Basic Utilities
In this section, we are going to discuss how to use the APIs for dediacritization, normalization, and transliteration.
Dediacritization
Dediacritization is the process of removing diacritics from an Arabic word. These diacritics are mostly vowels but also includes sukuun سُكُون and saddah شَدّة. The function to use for dediacritization is dediac which works on either Arabic, Buckwalter or custom transliterated characters.
julia> using Yunirjulia> @transliterator :defaultjulia> ar_basmala = Ar("بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ");julia> dediac(ar_basmala)ERROR: MethodError: no method matching dediac(::Ar) The function `dediac` exists, but no method is defined for this combination of argument types. Closest candidates are: dediac(::Union{Ar, Bw}, ::String) @ Yunir ~/work/Yunir.jl/Yunir.jl/src/utils/dediac.jl:15
Or using Buckwalter as follows:
julia> bw_basmala = Bw("bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi");julia> dediac(bw_basmala)ERROR: MethodError: no method matching dediac(::Bw) The function `dediac` exists, but no method is defined for this combination of argument types. Closest candidates are: dediac(::Union{Ar, Bw}, ::String) @ Yunir ~/work/Yunir.jl/Yunir.jl/src/utils/dediac.jl:15
The isarabic parameter with false argument indicates that the dediac function or dediac API takes a Buckwalter encoded input, bw_basmala, and returns an output that is not encoded in Arabic (as in the previous example) but instead an output in Buckwalter form as well.
With Julia's broadcasting feature, the above dediacritization can be applied to arrays by simply adding . to the name of the function.
julia> sentence0 = Ar.(["بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ", "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ" ])2-element Vector{Ar}: Ar("بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ") Ar("إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ")julia> dediac.(sentence0)ERROR: MethodError: no method matching dediac(::Ar) The function `dediac` exists, but no method is defined for this combination of argument types. Closest candidates are: dediac(::Union{Ar, Bw}, ::String) @ Yunir ~/work/Yunir.jl/Yunir.jl/src/utils/dediac.jl:15
As seen above, broadcasting allows application of the dediac function to the elements of the vector sentence0. That is, because there are two entries in the sentence0 vector, the broadcasting applies the dediac function to each of these and thus returning two outputs as well.
Normalization
Arabic letters are calligraphic by design. It's free flowing design makes it very flexible to form unique ligatures that may require normalization for consistency's sake when doing natural language processing. To do normalization, the function to use is normalize, which works on either Arabic, Buckwalter or custom transliterated characters. For example, using the ar_basmala and bw_basmala defined above, the normalized version would be
julia> normalize(ar_basmala)Ar("بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ")julia> normalize(bw_basmala)Bw("bisomi All~ahi Alr~aHomaAni Alr~aHiymi")
Again, the isarabic=false parameter simply disables an Arabic output and instead encode it as a Buckwalter output. You can also normalize specific characters, for example:
julia> normalize(ar_basmala, :alif_khanjareeya)Ar("بِسْمِ ٱللَّهِ ٱلرَّحْمَانِ ٱلرَّحِيمِ")julia> normalize(ar_basmala, :hamzat_wasl)Ar("بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ")julia> sentence1 = Ar("وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ أُنزِلَ إِلَيْكَ وَمَآ أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ");julia> normalize(sentence1, :alif_maddah)Ar("وَٱلَّذِينَ يُؤْمِنُونَ بِمَا أُنزِلَ إِلَيْكَ وَمَا أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ")julia> normalize(sentence1, :alif_hamza_above)Ar("وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ اُنزِلَ إِلَيْكَ وَمَآ اُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ")julia> sentence2 = Ar("إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ");julia> normalize(sentence2, :alif_hamza_below)Ar("اِيَّاكَ نَعْبُدُ وَاِيَّاكَ نَسْتَعِينُ")julia> sentence3 = Ar("ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ");julia> normalize(sentence3, :waw_hamza_above)Ar("ٱلَّذِينَ يُوْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ")julia> normalize(sentence3, :ta_marbuta)Ar("ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰهَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ")julia> sentence4 = Ar("ٱللَّهُ يَسْتَهْزِئُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ");julia> normalize(sentence4, :ya_hamza_above)Ar("ٱللَّهُ يَسْتَهْزِيُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ")julia> sentence5 = Ar("ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًى لِّلْمُتَّقِينَ");julia> normalize(sentence5, :alif_maksura)Ar("ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًي لِّلْمُتَّقِينَ")julia> sentence6 = Ar("ﷺ")Ar("ﷺ")julia> normalize(sentence6) === Ar("صلى الله عليه وسلم")truejulia> sentence7 = Ar("ﷻ")Ar("ﷻ")julia> normalize(sentence7) === Ar("جل جلاله")truejulia> sentence8 = Ar("﷽")Ar("﷽")julia> normalize(sentence8) === ar_basmalatrue
Or a combination,
julia> normalize(ar_basmala, [:alif_khanjareeya, :hamzat_wasl])Ar("بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ")
Broadcasting also applies to normalize function.
julia> normalize.(sentence0)2-element Vector{Ar}: Ar("بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ") Ar("اِيَّاكَ نَعْبُدُ وَاِيَّاكَ نَسْتَعِينُ")julia> normalize.(sentence0, [:alif_khanjareeya, :alif_hamza_below])2-element Vector{Ar}: Ar("بِسْمِ ٱللَّهِ ٱلرَّحْمَانِ ٱلرَّحِيمِ") Ar("اِيَّاكَ نَعْبُدُ وَاِيَّاكَ نَسْتَعِينُ")
Transliteration
By default, Yunir.jl uses extended Buckwalter transliteration. The function to use are encode (Arabic -> Roman) and arabic (Roman -> Arabic). The following are some examples:
julia> arabic(bw_basmala)Ar("بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ")julia> arabic(bw_basmala) === ar_basmalatruejulia> encode(ar_basmala)Bw("bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi")julia> encode(ar_basmala) === bw_basmalatrue
Custom Transliteration
For custom transliteration, user must specify the character mapping in a dictionary with Symbol type for both keys and values. By default, the Buckwalter mapping used in Yunir.jl is encoded in the constant variable BW_ENCODING.
julia> BW_ENCODINGDict{Symbol, Symbol} with 77 entries: Symbol("ۣ") => Symbol(";") :ة => :p :ذ => :* :ۥ => Symbol(",") Symbol("؍") => :c :ء => Symbol("'") Symbol("ۜ") => :(:) Symbol("َ") => :a Symbol("٦") => Symbol("6") :ي => :y Symbol("ٰ") => Symbol("`") :ت => :t :ن => :n :ب => :b :ص => :S :ا => :A :ث => :v :إ => :< :ج => :j ⋮ => ⋮
Suppose we want to create a custom transliteration by simply reversing the values of the dictionary, then we have the following:
julia> old_keys = collect(keys(BW_ENCODING));julia> new_vals = reverse(collect(values(BW_ENCODING)));
The new dictionary would be:
julia> my_encoder = Dict(old_keys .=> new_vals)Dict{Symbol, Symbol} with 77 entries: Symbol("ۣ") => :q :ة => Symbol("7") :ذ => Symbol("(") :ۥ => :l Symbol("؍") => :u :ء => :D Symbol("ۜ") => :f Symbol("َ") => Symbol("1") Symbol("٦") => Symbol("[") :ي => Symbol("]") Symbol("ٰ") => Symbol("\"") :ت => :z :ن => Symbol("#") :ب => :r :ص => :k :ا => :Z :ث => :~ :إ => Symbol("9") :ج => :i ⋮ => ⋮
Next is to declare this new transliteration so functions for dediacritization and normalization can use the new mapping. This is done using the macro @transliterator, which takes two arguments: the dictionary and the type name of the mapping.
julia> @transliterator my_encoder "MyEncoder"Using this new transliteration, we now have an updated mapping for the basmala above:
julia> encode(ar_basmala)Bw("rjsFKj 3,,v1Yj 3,bv1}FK1"#j 3,bv1}j]Kj")
Reversing this two Arabic characters should give us the appropriate decoding:
julia> arabic(encode(ar_basmala))Ar("بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ")
Dediacritization and Normalization on Custom Transliteration
As mentioned above, dediacritization and normalization also works on new custom transliteration. For example, dediacritizing the encoded ar_basmala would give us:
julia> dediac(encode(ar_basmala))ERROR: MethodError: no method matching dediac(::Bw) The function `dediac` exists, but no method is defined for this combination of argument types. Closest candidates are: dediac(::Union{Ar, Bw}, ::String) @ Yunir ~/work/Yunir.jl/Yunir.jl/src/utils/dediac.jl:15julia> dediac(encode(ar_basmala)) |> arabicERROR: MethodError: no method matching dediac(::Bw) The function `dediac` exists, but no method is defined for this combination of argument types. Closest candidates are: dediac(::Union{Ar, Bw}, ::String) @ Yunir ~/work/Yunir.jl/Yunir.jl/src/utils/dediac.jl:15
And for normalization,
julia> normalize(encode(ar_basmala))Bw("rjsFKj Z,,v1Yj Z,bv1}FK1Z#j Z,bv1}j]Kj")julia> normalize(encode(ar_basmala)) |> arabicAr("بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ")
Reset Transliteration
To reset the transliteration back to Buckwalter, simply specify :default as the argument for the macro @transliterator as follows:
julia> @transliterator :defaultWith this, all functions dependent on transliteration will also get updated.
julia> encode(ar_basmala)Bw("bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi")julia> encode(ar_basmala) === bw_basmalatruejulia> dediac(encode(ar_basmala))ERROR: MethodError: no method matching dediac(::Bw) The function `dediac` exists, but no method is defined for this combination of argument types. Closest candidates are: dediac(::Union{Ar, Bw}, ::String) @ Yunir ~/work/Yunir.jl/Yunir.jl/src/utils/dediac.jl:15julia> normalize(encode(ar_basmala))Bw("bisomi All~ahi Alr~aHomaAni Alr~aHiymi")