Data Processing

Special utilities for Arabic Natural Language Processing (ANLP) for data preprocessing are provided by Yunir.jl, for example on tasks like character dediacritization and character normalization.

Character Dediacritization

dediac works for both Arabic, Buckwalter and custom transliterations.

julia> using QuranTree
julia> using Yunir
julia> @transliterator :default
julia> crps, tnzl = load(QuranData());
julia> crpsdata = table(crps);
julia> tnzldata = table(tnzl);
julia> avrs = verses(tnzldata[1][1])[1]"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> dediac(avrs)"بسم ٱلله ٱلرحمن ٱلرحيم"
julia> bvrs = verses(crpsdata[1][1])[1]"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(bvrs)"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(avrs) === arabic(dediac(bvrs))false

Custom transliteration is also dediacritizable as shown below,

julia> old_keys = collect(keys(BW_ENCODING));
julia> new_vals = reverse(collect(values(BW_ENCODING)));
julia> my_encoder = Dict(old_keys .=> new_vals);
julia> @transliterator my_encoder "MyEncoder"
julia> encode(avrs)"rjsFKj 3,,v1Yj 3,bv1}FK1\"#j 3,bv1}j]Kj"
julia> arabic(encode(avrs))"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> dediac(encode(avrs))"rjsFKj 3,,v1Yj 3,bv1}FK1\"#j 3,bv1}j]Kj"
julia> arabic(dediac(encode(avrs)))"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"

To reset the transliteration,

julia> @transliterator :default
julia> encode(avrs)"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(encode(avrs))"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"

Character Normalization

Normalization is done using the normalize function. It works for Arabic, Buckwalter and other custom transliterations. For example, the following normalizes the avrs above:

julia> normalize(avrs)"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
julia> normalize(dediac(avrs))"بسم الله الرحمن الرحيم"
julia> dediac(normalize(avrs)) # using pipe notation"بسم الله الرحمان الرحيم"
julia> avrs |> dediac |> normalize |> encode"bsm Allh AlrHmn AlrHym"

Specific character can be normalized:

julia> avrs1 = verses(tnzldata[2][4])[1]"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ أُنزِلَ إِلَيْكَ وَمَآ أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs1, :alif_maddah)"وَٱلَّذِينَ يُؤْمِنُونَ بِمَا أُنزِلَ إِلَيْكَ وَمَا أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs1, :alif_hamza_above)"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ اُنزِلَ إِلَيْكَ وَمَآ اُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs, [:alif_khanjareeya, :hamzat_wasl])"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"

Or using the CorpusData instead of the TanzilData,

julia> avrs2 = arabic(verses(crpsdata[2][15])[1])"ٱللَّهُ يَسْتَهْزِئُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"
julia> normalize(avrs2, :ya_hamza_above)"ٱللَّهُ يَسْتَهْزِيُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"