Data Processing
Special utilities for Arabic Natural Language Processing (ANLP) for data preprocessing are provided by Yunir.jl, for example on tasks like character dediacritization and character normalization.
Character Dediacritization
dediac
works for both Arabic, Buckwalter and custom transliterations.
julia> using QuranTree
julia> using Yunir
julia> crps, tnzl = load(QuranData());
julia> crpsdata = table(crps);
julia> tnzldata = table(tnzl);
julia> avrs = verses(tnzldata[1][1])[1]
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> dediac(avrs)
"بسم ٱلله ٱلرحمن ٱلرحيم"
julia> bvrs = verses(crpsdata[1][1])[1]
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(bvrs)
"bsm {llh {lrHmn {lrHym"
julia> dediac(avrs) === arabic(dediac(bvrs))
true
Custom transliteration is also dediacritizable as shown below,
julia> old_keys = collect(keys(BW_ENCODING));
julia> new_vals = reverse(collect(values(BW_ENCODING)));
julia> my_encoder = Dict(old_keys .=> new_vals);
julia> @transliterator my_encoder "MyEncoder"
julia> encode(avrs)
"Z<sgj< H**v[K< H*tv[{gj[zk< H*tv[{<#j<"
julia> arabic(encode(avrs))
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> dediac(encode(avrs))
"Zsj H**K H*t{jk H*t{#j"
julia> arabic(dediac(encode(avrs)))
"بسم ٱلله ٱلرحمن ٱلرحيم"
To reset the transliteration,
julia> @transliterator :default
julia> encode(avrs)
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(encode(avrs))
"bsm {llh {lrHmn {lrHym"
Character Normalization
Normalization is done using the normalize
function. It works for Arabic, Buckwalter and other custom transliterations. For example, the following normalizes the avrs
above:
julia> normalize(avrs)
"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
julia> normalize(dediac(avrs))
"بسم الله الرحمن الرحيم"
julia> dediac(normalize(avrs))
"بسم الله الرحمان الرحيم"
julia> # using pipe notation avrs |> dediac |> normalize |> encode
"bsm Allh AlrHmn AlrHym"
Specific character can be normalized:
julia> avrs1 = verses(tnzldata[2][4])[1]
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ أُنزِلَ إِلَيْكَ وَمَآ أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs1, :alif_maddah)
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَا أُنزِلَ إِلَيْكَ وَمَا أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs1, :alif_hamza_above)
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ اُنزِلَ إِلَيْكَ وَمَآ اُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs, [:alif_khanjareeya, :hamzat_wasl])
"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
Or using the CorpusData
instead of the TanzilData
,
julia> avrs2 = arabic(verses(crpsdata[2][15])[1])
"ٱللَّهُ يَسْتَهْزِئُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"
julia> normalize(avrs2, :ya_hamza_above)
"ٱللَّهُ يَسْتَهْزِيُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"