Basic Utilities
In this section, we are going to discuss how to use the APIs for dediacritization, normalization and transliteration.
Dediacritization
The function to use is dediac
, which works on Arabic, Buckwalter and custom transliterated characters.
julia> using Yunir
julia> ar_basmala = "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ";
julia> dediac(ar_basmala)
"بسم ٱلله ٱلرحمن ٱلرحيم"
Or using Buckwalter as follows:
julia> bw_basmala = "bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi";
julia> dediac(bw_basmala; isarabic=false)
"bsm {llh {lrHmn {lrHym"
With Julia's broadcasting feature, the above dediacritization can be applied to arrays by simply adding .
to the name of the function.
julia> sentence0 = ["بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ", "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ" ]
2-element Vector{String}: "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ" "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"
julia> dediac.(sentence0)
2-element Vector{String}: "بسم ٱلله ٱلرحمن ٱلرحيم" "إياك نعبد وإياك نستعين"
Normalization
The function to use is normalize
, which works on Arabic, Buckwalter and custom transliterated characters. For example, using the ar_basmala
and bw_basmala
defined above, the normalized version would be
julia> normalize(ar_basmala)
"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
julia> normalize(bw_basmala; isarabic=false)
"bisomi All~ahi Alr~aHomaAni Alr~aHiymi"
You can also normalize specific characters, for example:
julia> normalize(ar_basmala, :alif_khanjareeya)
"بِسْمِ ٱللَّهِ ٱلرَّحْمَانِ ٱلرَّحِيمِ"
julia> normalize(ar_basmala, :hamzat_wasl)
"بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ"
julia> sentence1 = "وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ أُنزِلَ إِلَيْكَ وَمَآ أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ";
julia> normalize(sentence1, :alif_maddah)
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَا أُنزِلَ إِلَيْكَ وَمَا أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(sentence1, :alif_hamza_above)
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ اُنزِلَ إِلَيْكَ وَمَآ اُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> sentence2 = "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ";
julia> normalize(sentence2, :alif_hamza_below)
"اِيَّاكَ نَعْبُدُ وَاِيَّاكَ نَسْتَعِينُ"
julia> sentence3 = "ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ";
julia> normalize(sentence3, :waw_hamza_above)
"ٱلَّذِينَ يُوْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"
julia> normalize(sentence3, :ta_marbuta)
"ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰهَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"
julia> sentence4 = "ٱللَّهُ يَسْتَهْزِئُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ";
julia> normalize(sentence4, :ya_hamza_above)
"ٱللَّهُ يَسْتَهْزِيُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"
julia> sentence5 = "ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًى لِّلْمُتَّقِينَ";
julia> normalize(sentence5, :alif_maksura)
"ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًي لِّلْمُتَّقِينَ"
julia> sentence6 = "ﷺ"
"ﷺ"
julia> normalize(sentence6) === "صلى الله عليه وسلم"
true
julia> sentence7 = "ﷻ"
"ﷻ"
julia> normalize(sentence7) === "جل جلاله"
true
julia> sentence8 = "﷽"
"﷽"
julia> normalize(sentence8) === ar_basmala
true
Or a combination,
julia> normalize(ar_basmala, [:alif_khanjareeya, :hamzat_wasl])
"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
Broadcasting also applies to normalize
function.
julia> normalize.(sentence0)
2-element Vector{String}: "بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ" "اِيَّاكَ نَعْبُدُ وَاِيَّاكَ نَسْتَعِينُ"
julia> normalize.(sentence0, [:alif_khanjareeya, :alif_hamza_below])
2-element Vector{String}: "بِسْمِ ٱللَّهِ ٱلرَّحْمَانِ ٱلرَّحِيمِ" "اِيَّاكَ نَعْبُدُ وَاِيَّاكَ نَسْتَعِينُ"
Transliteration
By default, Yunir.jl uses extended Buckwalter transliteration. The function to use are encode
(Arabic -> Roman) and arabic
(Roman -> Arabic). The following are some examples:
julia> arabic(bw_basmala)
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> arabic(bw_basmala) === ar_basmala
true
julia> encode(ar_basmala)
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> encode(ar_basmala) === bw_basmala
true
Custom Transliteration
For custom transliteration, user must specify the character mapping in a dictionary with Symbol
type for both keys and values. By default, the Buckwalter mapping used in Yunir.jl is encoded in the constant variable BW_ENCODING
.
julia> BW_ENCODING
Dict{Symbol, Symbol} with 77 entries: Symbol("ۣ") => Symbol(";") :ة => :p :ذ => :* :ۥ => Symbol(",") Symbol("؍") => :c :ء => Symbol("'") Symbol("ۜ") => :(:) Symbol("َ") => :a Symbol("٦") => Symbol("6") :ي => :y Symbol("ٰ") => Symbol("`") :ت => :t :ن => :n :ب => :b :ص => :S :ا => :A :ث => :v :إ => :< :ج => :j ⋮ => ⋮
Suppose we want to create a custom transliteration by simply reversing the values of the dictionary, then we have the following:
julia> old_keys = collect(keys(BW_ENCODING));
julia> new_vals = reverse(collect(values(BW_ENCODING)));
The new dictionary would be:
julia> my_encoder = Dict(old_keys .=> new_vals)
Dict{Symbol, Symbol} with 77 entries: Symbol("ۣ") => :q :ة => Symbol("7") :ذ => Symbol("(") :ۥ => :l Symbol("؍") => :u :ء => :D Symbol("ۜ") => :f Symbol("َ") => Symbol("1") Symbol("٦") => Symbol("[") :ي => Symbol("]") Symbol("ٰ") => Symbol("\"") :ت => :z :ن => Symbol("#") :ب => :r :ص => :k :ا => :Z :ث => :~ :إ => Symbol("9") :ج => :i ⋮ => ⋮
Next is to declare this new transliteration so functions for dediacritization and normalization can use the new mapping. This is done using the macro @transliterator
, which takes two arguments: the dictionary and the type name of the mapping.
julia> @transliterator my_encoder "MyEncoder"
Using this new transliteration, we now have an updated mapping for the basmala above:
julia> encode(ar_basmala)
"rjsFKj 3,,v1Yj 3,bv1}FK1\"#j 3,bv1}j]Kj"
Reversing this two Arabic characters should give us the appropriate decoding:
julia> arabic(encode(ar_basmala))
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
Dediacritization and Normalization on Custom Transliteration
As mentioned above, dediacritization and normalization also works on new custom transliteration. For example, dediacritizing the encoded ar_basmala
would give us:
julia> dediac(encode(ar_basmala); isarabic=false)
"rsK 3,,Y 3,b}K# 3,b}]K"
julia> dediac(encode(ar_basmala); isarabic=false) |> arabic
"بسم ٱلله ٱلرحمن ٱلرحيم"
And for normalization,
julia> normalize(encode(ar_basmala); isarabic=false)
"rjsFKj Z,,v1Yj Z,bv1}FK1Z#j Z,bv1}j]Kj"
julia> normalize(encode(ar_basmala); isarabic=false) |> arabic
"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
Reset Transliteration
To reset the transliteration back to Buckwalter, simply specify :default
as the argument for the macro @transliterator
as follows:
julia> @transliterator :default
With this, all functions dependent on transliteration will also get updated.
julia> encode(ar_basmala)
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> encode(ar_basmala) === bw_basmala
true
julia> dediac(encode(ar_basmala); isarabic=false)
"bsm {llh {lrHmn {lrHym"
julia> normalize(encode(ar_basmala); isarabic=false)
"bisomi All~ahi Alr~aHomaAni Alr~aHiymi"