Transliteration
For transliteration, we will use Yunir.jl, a lightweight Arabic NLP toolkit. Yunir.jl uses Buckwalter as the default transliteration based on the Quranic Arabic Corpus encoding. The transliteration is done via the encode
function, for example, the following will transliterate the first verse of Chapter 1:
julia> using QuranTree
julia> using Yunir
julia> crps, tnzl = load(QuranData());
julia> crpsdata = table(crps);
julia> tnzldata = table(tnzl);
julia> vrs = verses(tnzldata[1][1])
1-element Vector{String}: "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> encode(vrs[1])
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
You need to install Yunir.jl to run the above code. To install, run
using Pkg
Pkg.add("Yunir")
The verses
function above is used to extract the corresponding verse from the Qur'an data of type AbstractQuran
.
verses
by default only returns the verse form of the table, but one can also extract the corresponding verse number instead of the form, example:
verses(tnzldata, number=true, start_end=true)
verses(tnzldata, number=true, start_end=false)
To extract the words of the corpus, use the function words
instead.
The function verses
always returns an Array, and hence encoding multiple verses is possible using Julia's .
(dot) broadcasting operation. For example, the following will transliterate all verses of Chapter 114:
julia> vrs = verses(tnzldata[114])
6-element Vector{String}: "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ قُلْ أَعُوذُ بِرَبِّ ٱلنَّاسِ" "مَلِكِ ٱلنَّاسِ" "إِلَٰهِ ٱلنَّاسِ" "مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ" "ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ" "مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ"
julia> encode.(vrs)
6-element Vector{String}: "bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi qulo >aEuw*u birab~i {ln~aAsi" "maliki {ln~aAsi" "<ila`hi {ln~aAsi" "min \$ar~i {lowasowaAsi {loxan~aAsi" "{l~a*iY yuwasowisu fiY Suduwri {ln~aAsi" "mina {lojin~api wa{ln~aAsi"
Decoding
To decode the transliterated back to Arabic form, Yunir.jl has arabic
function to do just that. For example, the following will decode to Arabic the transliterated verses of Chapter 114 above:
julia> arabic.(encode.(vrs))
6-element Vector{String}: "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ قُلْ أَعُوذُ بِرَبِّ ٱلنَّاسِ" "مَلِكِ ٱلنَّاسِ" "إِلَٰهِ ٱلنَّاسِ" "مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ" "ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ" "مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ"
Or using the CorpusData
,
julia> vrs = verses(crpsdata[114])
6-element Vector{String}: "qulo >aEuw*u birab~i {ln~aAsi" "maliki {ln~aAsi" "<ila`hi {ln~aAsi" "min \$ar~i {lowasowaAsi {loxan~aAsi" "{l~a*iY yuwasowisu fiY Suduwri {ln~aAsi" "mina {lojin~api wa{ln~aAsi"
julia> avrs = arabic.(vrs)
6-element Vector{String}: "قُلْ أَعُوذُ بِرَبِّ ٱلنَّاسِ" "مَلِكِ ٱلنَّاسِ" "إِلَٰهِ ٱلنَّاسِ" "مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ" "ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ" "مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ"
.
(dot) broadcasting is only used for arrays. So, for String
input (not arrays of String
), arabic(...)
(without dot) is used. Example,
arabic(vrs[1])
Custom Transliteration
Creating a custom transliteration requires only an input encoding in the form of a dictionary (Dict
). For example, Yunir.jl's Buckwalter's encoding is provided by the constant BW_ENCODING
as shown below:
julia> BW_ENCODING
Dict{Symbol, Symbol} with 61 entries: Symbol("ۣ") => Symbol(";") :ة => :p :ذ => :* :ۥ => Symbol(",") :ء => Symbol("'") Symbol("ۜ") => :(:) Symbol("َ") => :a Symbol("ٰ") => Symbol("`") :ي => :y :ت => :t :ن => :n :ب => :b :ص => :S :ا => :A :ث => :v :إ => :< :ج => :j :ى => :Y Symbol("ٍ") => :K ⋮ => ⋮
Suppose, we want to create a new transliteration by simply reversing the values of the dictionary. This is done as follows:
julia> old_keys = collect(keys(BW_ENCODING));
julia> new_vals = reverse(collect(values(BW_ENCODING)));
julia> my_encoder = Dict(old_keys .=> new_vals)
Dict{Symbol, Symbol} with 61 entries: Symbol("ۣ") => :q :ة => Symbol("(") :ذ => :l :ۥ => :u :ء => :D Symbol("ۜ") => :f Symbol("َ") => Symbol("[") Symbol("ٰ") => :z :ي => Symbol("#") :ت => :r :ن => :k :ب => :Z :ص => Symbol("]") :ا => Symbol("\"") :ث => :~ :إ => :i :ج => :m :ى => :_ Symbol("ٍ") => :h ⋮ => ⋮
julia> @transliterator my_encoder "MyEncoder"
The macro @transliterator
is used for updating the transliteration, and it takes two inputs: the dictionary (my_encoder
) and the name of the encoding ("MyEncoder"
). Using this new encoding, the avrs
above will have a new transliteration:
julia> new_vrs = encode.(avrs);
julia> new_vrs
6-element Vector{String}: ";,*g ^[},-l, Z<t[Zv< H*kv[\"s<" "j[*<n< H*kv[\"s<" "i<*[zK< H*kv[\"s<" "j<k %[tv< H*g-[sg-[\"s< H*g+[kv[\"s<" "H*v[l<_ #,-[sg-<s, :<_ ],!,-t< H*kv[\"s<" "j<k[ H*gm<kv[(< -[H*kv[\"s<"
To confirm this new transliteration, decoding it back to arabic should generate the proper results:
julia> arabic.(new_vrs)
6-element Vector{String}: "قُلْ أَعُوذُ بِرَبِّ ٱلنَّاسِ" "مَلِكِ ٱلنَّاسِ" "إِلَٰهِ ٱلنَّاسِ" "مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ" "ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ" "مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ"
To reset the transliteration, simply run the following:
julia> @transliterator :default
This will fallback to the Buckwalter transliteration, as shown below:
julia> bw_vrs = encode.(avrs);
julia> bw_vrs
6-element Vector{String}: "qulo >aEuw*u birab~i {ln~aAsi" "maliki {ln~aAsi" "<ila`hi {ln~aAsi" "min \$ar~i {lowasowaAsi {loxan~aAsi" "{l~a*iY yuwasowisu fiY Suduwri {ln~aAsi" "mina {lojin~api wa{ln~aAsi"
julia> arabic.(bw_vrs)
6-element Vector{String}: "قُلْ أَعُوذُ بِرَبِّ ٱلنَّاسِ" "مَلِكِ ٱلنَّاسِ" "إِلَٰهِ ٱلنَّاسِ" "مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ" "ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ" "مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ"
Simple Encoding
Another feature supported in QuranTree.jl is the Simple Encoding. For example, the following will (Simple) encode the first verse of Chapter 1:
julia> vrs = verses(tnzldata[1][1:5])
5-element Vector{String}: "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ" "ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ" "ٱلرَّحْمَٰنِ ٱلرَّحِيمِ" "مَٰلِكِ يَوْمِ ٱلدِّينِ" "إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"
julia> parse(SimpleEncoding, vrs[1])
"Ba+Kasra | Seen+Sukun | Meem+Kasra | <space> | AlifHamzatWasl | Lam | Lam+Shadda+Fatha | Ha+Kasra | <space> | AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Sukun | Meem+Fatha+AlifKhanjareeya | Noon+Kasra | <space> | AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Kasra | Ya | Meem+Kasra"
julia> parse.(SimpleEncoding, vrs)
5-element Vector{String}: "Ba+Kasra | Seen+Sukun | Meem+Kasra | <space> | AlifHamzatWasl | Lam | Lam+Shadda+Fatha | Ha+Kasra | <space> | AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Sukun | Meem+Fatha+AlifKhanjareeya | Noon+Kasra | <space> | AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Kasra | Ya | Meem+Kasra" "AlifHamzatWasl | Lam+Sukun | HHa+Fatha | Meem+Sukun | Dal+Damma | <space> | Lam+Kasra | Lam+Shadda+Fatha | Ha+Kasra | <space> | Ra+Fatha | Ba+Shadda+Kasra | <space> | AlifHamzatWasl | Lam+Sukun | Ain+Fatha+AlifKhanjareeya | Lam+Fatha | Meem+Kasra | Ya | Noon+Fatha" "AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Sukun | Meem+Fatha+AlifKhanjareeya | Noon+Kasra | <space> | AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Kasra | Ya | Meem+Kasra" "Meem+Fatha+AlifKhanjareeya | Lam+Kasra | Kaf+Kasra | <space> | Ya+Fatha | Waw+Sukun | Meem+Kasra | <space> | AlifHamzatWasl | Lam | Dal+Shadda+Kasra | Ya | Noon+Kasra" "AlifHamzaBelow+Kasra | Ya+Shadda+Fatha | Alif | Kaf+Fatha | <space> | Noon+Fatha | Ain+Sukun | Ba+Damma | Dal+Damma | <space> | Waw+Fatha | AlifHamzaBelow+Kasra | Ya+Shadda+Fatha | Alif | Kaf+Fatha | <space> | Noon+Fatha | Seen+Sukun | Ta+Fatha | Ain+Kasra | Ya | Noon+Damma"