Arabic Disambiguation

In this section, we are going to apply a model, estimated from Maximum Likelihood Estimation (MLE), for disambiguating Arabic texts with no diacritics. As always, load the data as follows:

julia> using QuranTree

julia> crps, tnzl = load(QuranData());

julia> crpsdata = table(crps);

julia> tnzldata = table(tnzl);

For this task, we are going to use the last verse of Chapter 1.

julia> avrs1 = verses(tnzldata[1][7])[1]
"صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُوبِ عَلَيْهِمْ وَلَا ٱلضَّآلِّينَ"

Of course, the input needs to have no diacritics and so:

julia> avrs1 = avrs1 |> dediac
"صرٰط ٱلذين أنعمت عليهم غير ٱلمغضوب عليهم ولا ٱلضالين"

Inferring

To infer the diacritics then, run the following:

julia> using Pkg

julia> Pkg.add("PyCall")
  Resolving package versions...
No Changes to `~/work/QuranTree.jl/QuranTree.jl/docs/Project.toml`
No Changes to `~/work/QuranTree.jl/QuranTree.jl/docs/Manifest.toml`

julia> using PyCall

julia> @pyimport camel_tools.disambig.mle as camel_disambig

julia> mled = camel_disambig.MLEDisambiguator.pretrained()
PyObject <camel_tools.disambig.mle.MLEDisambiguator object at 0x7fc562dc17f0>

julia> disambig = mled.disambiguate(split(avrs1))
9-element Array{Tuple{String,Array{Tuple{Float64,Dict{Any,Any}},1}},1}:
 ("صرٰط", [(1.0, Dict("form_num" => "s","root" => "ص.ر.ط","prc1" => "0","pos_lex_logprob" => -99.0,"vox" => "na","diac" => "صُرُط","cas" => "u","bw" => "صُرُط/NOUN","ud" => "NOUN","rat" => "i"…))])
 ("ٱلذين", [(1.0, Dict("form_num" => "p","root" => "#.ل","prc1" => "0","pos_lex_logprob" => -1.941913,"vox" => "na","diac" => "الَّذِينَ","cas" => "u","bw" => "الَّذِينَ/REL_PRON","ud" => "PRON","rat" => "y"…))])
 ("أنعمت", [(1.0, Dict("form_num" => "s","root" => "ن.ع.م","prc1" => "0","pos_lex_logprob" => -99.0,"vox" => "a","diac" => "أَنْعَمْتُ","cas" => "na","bw" => "أَنْعَم/PV+تُ/PVSUFF_SUBJ:1S","ud" => "VERB","rat" => "n"…))])
 ("عليهم", [(1.0, Dict("form_num" => "na","root" => "ع.ل.#","prc1" => "0","pos_lex_logprob" => -1.819512,"vox" => "na","diac" => "عَلَيهِم","cas" => "na","bw" => "عَلَي/PREP+هِم/PRON_3MP","ud" => "ADP+PRON","rat" => "na"…))])
 ("غير", [(1.0, Dict("form_num" => "s","root" => "غ.#.ر","prc1" => "0","pos_lex_logprob" => -2.845457,"vox" => "na","diac" => "غَيْرِ","cas" => "g","bw" => "غَيْر/NOUN+ِ/CASE_DEF_GEN","ud" => "NOUN","rat" => "i"…))])
 ("ٱلمغضوب", [(1.0, Dict("form_num" => "s","root" => "غ.ض.ب","prc1" => "0","pos_lex_logprob" => -99.0,"vox" => "na","diac" => "المَغْضُوب","cas" => "u","bw" => "ال/DET+مَغْضُوب/ADJ","ud" => "DET+ADJ","rat" => "n"…))])
 ("عليهم", [(1.0, Dict("form_num" => "na","root" => "ع.ل.#","prc1" => "0","pos_lex_logprob" => -1.819512,"vox" => "na","diac" => "عَلَيهِم","cas" => "na","bw" => "عَلَي/PREP+هِم/PRON_3MP","ud" => "ADP+PRON","rat" => "na"…))])
 ("ولا", [(1.0, Dict("form_num" => "na","root" => "ل.#","prc1" => "0","pos_lex_logprob" => -2.435585,"vox" => "na","diac" => "وَلا","cas" => "na","bw" => "وَ/CONJ+لا/NEG_PART","ud" => "CONJ+PART","rat" => "na"…))])
 ("ٱلضالين", [(1.0, Dict("form_num" => "d","root" => "ض.ل.ل","prc1" => "0","pos_lex_logprob" => -5.400551,"vox" => "na","diac" => "الضالَّيْنِ","cas" => "a","bw" => "ال/DET+ضالّ/NOUN+َيْنِ/NSUFF_MASC_DU_ACC","ud" => "DET+NOUN","rat" => "i"…))])

Extracting Diacritized Output

Finally, tying up all diacritized output:

julia> join([d[2][1][2]["diac"] for d in disambig], " ")
"صُرُط الَّذِينَ أَنْعَمْتُ عَلَيهِم غَيْرِ المَغْضُوب عَلَيهِم وَلا الضالَّيْنِ"