Data Processing

Special utilities for Arabic Natural Language Processing (ANLP) for data preprocessing are provided by Yunir.jl, for example on tasks like character dediacritization and character normalization.

Character Dediacritization

dediac works for both Arabic, Buckwalter and custom transliterations.

julia> using QuranTree
julia> using Yunir
julia> crps, tnzl = load(QuranData());
julia> crpsdata = table(crps);
julia> tnzldata = table(tnzl);
julia> avrs = verses(tnzldata[1][1])[1]"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> dediac(avrs)"بسم ٱلله ٱلرحمن ٱلرحيم"
julia> bvrs = verses(crpsdata[1][1])[1]"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(bvrs)"bsm {llh {lrHmn {lrHym"
julia> dediac(avrs) === arabic(dediac(bvrs))true

Custom transliteration is also dediacritizable as shown below,

julia> old_keys = collect(keys(BW_ENCODING));
julia> new_vals = reverse(collect(values(BW_ENCODING)));
julia> my_encoder = Dict(old_keys .=> new_vals);
julia> @transliterator my_encoder "MyEncoder"
julia> encode(avrs)"Z<sgj< H**v[K< H*tv[{gj[zk< H*tv[{<#j<"
julia> arabic(encode(avrs))"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> dediac(encode(avrs))"Zsj H**K H*t{jk H*t{#j"
julia> arabic(dediac(encode(avrs)))"بسم ٱلله ٱلرحمن ٱلرحيم"

To reset the transliteration,

julia> @transliterator :default
julia> encode(avrs)"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(encode(avrs))"bsm {llh {lrHmn {lrHym"

Character Normalization

Normalization is done using the normalize function. It works for Arabic, Buckwalter and other custom transliterations. For example, the following normalizes the avrs above:

julia> normalize(avrs)"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
julia> normalize(dediac(avrs))"بسم الله الرحمن الرحيم"
julia> dediac(normalize(avrs))"بسم الله الرحمان الرحيم"
julia> # using pipe notation avrs |> dediac |> normalize |> encode"bsm Allh AlrHmn AlrHym"

Specific character can be normalized:

julia> avrs1 = verses(tnzldata[2][4])[1]"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ أُنزِلَ إِلَيْكَ وَمَآ أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs1, :alif_maddah)"وَٱلَّذِينَ يُؤْمِنُونَ بِمَا أُنزِلَ إِلَيْكَ وَمَا أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs1, :alif_hamza_above)"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ اُنزِلَ إِلَيْكَ وَمَآ اُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs, [:alif_khanjareeya, :hamzat_wasl])"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"

Or using the CorpusData instead of the TanzilData,

julia> avrs2 = arabic(verses(crpsdata[2][15])[1])"ٱللَّهُ يَسْتَهْزِئُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"
julia> normalize(avrs2, :ya_hamza_above)"ٱللَّهُ يَسْتَهْزِيُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"