Basic Utilities

In this section, we are going to discuss how to use the APIs for dediacritization, normalization and transliteration.

Dediacritization

The function to use is dediac, which works on Arabic, Buckwalter and custom transliterated characters.

julia> using Yunir
julia> ar_basmala = "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ";
julia> dediac(ar_basmala)"بسم ٱلله ٱلرحمن ٱلرحيم"

Or using Buckwalter as follows:

julia> bw_basmala = "bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi";
julia> dediac(bw_basmala; isarabic=false)"bsm {llh {lrHmn {lrHym"

With Julia's broadcasting feature, the above dediacritization can be applied to arrays by simply adding . to the name of the function.

julia> sentence0 = ["بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ",
           "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"
       ]2-element Vector{String}:
 "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
 "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"
julia> dediac.(sentence0)2-element Vector{String}: "بسم ٱلله ٱلرحمن ٱلرحيم" "إياك نعبد وإياك نستعين"

Normalization

The function to use is normalize, which works on Arabic, Buckwalter and custom transliterated characters. For example, using the ar_basmala and bw_basmala defined above, the normalized version would be

julia> normalize(ar_basmala)"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
julia> normalize(bw_basmala; isarabic=false)"bisomi All~ahi Alr~aHomaAni Alr~aHiymi"

You can also normalize specific characters, for example:

julia> normalize(ar_basmala, :alif_khanjareeya)"بِسْمِ ٱللَّهِ ٱلرَّحْمَانِ ٱلرَّحِيمِ"
julia> normalize(ar_basmala, :hamzat_wasl)"بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ"
julia> sentence1 = "وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ أُنزِلَ إِلَيْكَ وَمَآ أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ";
julia> normalize(sentence1, :alif_maddah)"وَٱلَّذِينَ يُؤْمِنُونَ بِمَا أُنزِلَ إِلَيْكَ وَمَا أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(sentence1, :alif_hamza_above)"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ اُنزِلَ إِلَيْكَ وَمَآ اُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> sentence2 = "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ";
julia> normalize(sentence2, :alif_hamza_below)"اِيَّاكَ نَعْبُدُ وَاِيَّاكَ نَسْتَعِينُ"
julia> sentence3 = "ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ";
julia> normalize(sentence3, :waw_hamza_above)"ٱلَّذِينَ يُوْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"
julia> normalize(sentence3, :ta_marbuta)"ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰهَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"
julia> sentence4 = "ٱللَّهُ يَسْتَهْزِئُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ";
julia> normalize(sentence4, :ya_hamza_above)"ٱللَّهُ يَسْتَهْزِيُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"
julia> sentence5 = "ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًى لِّلْمُتَّقِينَ";
julia> normalize(sentence5, :alif_maksura)"ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًي لِّلْمُتَّقِينَ"
julia> sentence6 = "ﷺ""ﷺ"
julia> normalize(sentence6) === "صلى الله عليه وسلم"true
julia> sentence7 = "ﷻ""ﷻ"
julia> normalize(sentence7) === "جل جلاله"true
julia> sentence8 = "﷽""﷽"
julia> normalize(sentence8) === ar_basmalatrue

Or a combination,

julia> normalize(ar_basmala, [:alif_khanjareeya, :hamzat_wasl])"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"

Broadcasting also applies to normalize function.

julia> normalize.(sentence0)2-element Vector{String}:
 "بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
 "اِيَّاكَ نَعْبُدُ وَاِيَّاكَ نَسْتَعِينُ"
julia> normalize.(sentence0, [:alif_khanjareeya, :alif_hamza_below])2-element Vector{String}: "بِسْمِ ٱللَّهِ ٱلرَّحْمَانِ ٱلرَّحِيمِ" "اِيَّاكَ نَعْبُدُ وَاِيَّاكَ نَسْتَعِينُ"

Transliteration

By default, Yunir.jl uses extended Buckwalter transliteration. The function to use are encode (Arabic -> Roman) and arabic (Roman -> Arabic). The following are some examples:

julia> arabic(bw_basmala)"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> arabic(bw_basmala) === ar_basmalatrue
julia> encode(ar_basmala)"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> encode(ar_basmala) === bw_basmalatrue

Custom Transliteration

For custom transliteration, user must specify the character mapping in a dictionary with Symbol type for both keys and values. By default, the Buckwalter mapping used in Yunir.jl is encoded in the constant variable BW_ENCODING.

julia> BW_ENCODINGDict{Symbol, Symbol} with 77 entries:
  Symbol("ۣ")  => Symbol(";")
  :ة          => :p
  :ذ          => :*
  :ۥ          => Symbol(",")
  Symbol("؍") => :c
  :ء          => Symbol("'")
  Symbol("ۜ")  => :(:)
  Symbol("َ")  => :a
  Symbol("٦") => Symbol("6")
  :ي          => :y
  Symbol("ٰ")  => Symbol("`")
  :ت          => :t
  :ن          => :n
  :ب          => :b
  :ص          => :S
  :ا          => :A
  :ث          => :v
  :إ          => :<
  :ج          => :j
  ⋮           => ⋮

Suppose we want to create a custom transliteration by simply reversing the values of the dictionary, then we have the following:

julia> old_keys = collect(keys(BW_ENCODING));
julia> new_vals = reverse(collect(values(BW_ENCODING)));

The new dictionary would be:

julia> my_encoder = Dict(old_keys .=> new_vals)Dict{Symbol, Symbol} with 77 entries:
  Symbol("ۣ")  => :q
  :ة          => Symbol("7")
  :ذ          => Symbol("(")
  :ۥ          => :l
  Symbol("؍") => :u
  :ء          => :D
  Symbol("ۜ")  => :f
  Symbol("َ")  => Symbol("1")
  Symbol("٦") => Symbol("[")
  :ي          => Symbol("]")
  Symbol("ٰ")  => Symbol("\"")
  :ت          => :z
  :ن          => Symbol("#")
  :ب          => :r
  :ص          => :k
  :ا          => :Z
  :ث          => :~
  :إ          => Symbol("9")
  :ج          => :i
  ⋮           => ⋮

Next is to declare this new transliteration so functions for dediacritization and normalization can use the new mapping. This is done using the macro @transliterator, which takes two arguments: the dictionary and the type name of the mapping.

julia> @transliterator my_encoder "MyEncoder"

Using this new transliteration, we now have an updated mapping for the basmala above:

julia> encode(ar_basmala)"rjsFKj 3,,v1Yj 3,bv1}FK1\"#j 3,bv1}j]Kj"

Reversing this two Arabic characters should give us the appropriate decoding:

julia> arabic(encode(ar_basmala))"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"

Dediacritization and Normalization on Custom Transliteration

As mentioned above, dediacritization and normalization also works on new custom transliteration. For example, dediacritizing the encoded ar_basmala would give us:

julia> dediac(encode(ar_basmala); isarabic=false)"rsK 3,,Y 3,b}K# 3,b}]K"
julia> dediac(encode(ar_basmala); isarabic=false) |> arabic"بسم ٱلله ٱلرحمن ٱلرحيم"

And for normalization,

julia> normalize(encode(ar_basmala); isarabic=false)"rjsFKj Z,,v1Yj Z,bv1}FK1Z#j Z,bv1}j]Kj"
julia> normalize(encode(ar_basmala); isarabic=false) |> arabic"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"

Reset Transliteration

To reset the transliteration back to Buckwalter, simply specify :default as the argument for the macro @transliterator as follows:

julia> @transliterator :default

With this, all functions dependent on transliteration will also get updated.

julia> encode(ar_basmala)"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> encode(ar_basmala) === bw_basmalatrue
julia> dediac(encode(ar_basmala); isarabic=false)"bsm {llh {lrHmn {lrHym"
julia> normalize(encode(ar_basmala); isarabic=false)"bisomi All~ahi Alr~aHomaAni Alr~aHiymi"