Data Processing
Special utilities for Arabic Natural Language Processing (ANLP) for data preprocessing are provided by Yunir.jl, for example on tasks like character dediacritization and character normalization.
Character Dediacritization
dediac
works for both Arabic, Buckwalter and custom transliterations.
julia> using QuranTree
julia> using Yunir
julia> @transliterator :default
julia> crps, tnzl = load(QuranData());
julia> crpsdata = table(crps);
julia> tnzldata = table(tnzl);
julia> avrs = verses(tnzldata[1][1])[1]
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> dediac(avrs)
"بسم ٱلله ٱلرحمن ٱلرحيم"
julia> bvrs = verses(crpsdata[1][1])[1]
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(bvrs)
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(avrs) === arabic(dediac(bvrs))
false
Custom transliteration is also dediacritizable as shown below,
julia> old_keys = collect(keys(BW_ENCODING));
julia> new_vals = reverse(collect(values(BW_ENCODING)));
julia> my_encoder = Dict(old_keys .=> new_vals);
julia> @transliterator my_encoder "MyEncoder"
julia> encode(avrs)
"rjsFKj 3,,v1Yj 3,bv1}FK1\"#j 3,bv1}j]Kj"
julia> arabic(encode(avrs))
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> dediac(encode(avrs))
"rjsFKj 3,,v1Yj 3,bv1}FK1\"#j 3,bv1}j]Kj"
julia> arabic(dediac(encode(avrs)))
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
To reset the transliteration,
julia> @transliterator :default
julia> encode(avrs)
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(encode(avrs))
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
Character Normalization
Normalization is done using the normalize
function. It works for Arabic, Buckwalter and other custom transliterations. For example, the following normalizes the avrs
above:
julia> normalize(avrs)
"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
julia> normalize(dediac(avrs))
"بسم الله الرحمن الرحيم"
julia> dediac(normalize(avrs)) # using pipe notation
"بسم الله الرحمان الرحيم"
julia> avrs |> dediac |> normalize |> encode
"bsm Allh AlrHmn AlrHym"
Specific character can be normalized:
julia> avrs1 = verses(tnzldata[2][4])[1]
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ أُنزِلَ إِلَيْكَ وَمَآ أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs1, :alif_maddah)
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَا أُنزِلَ إِلَيْكَ وَمَا أُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs1, :alif_hamza_above)
"وَٱلَّذِينَ يُؤْمِنُونَ بِمَآ اُنزِلَ إِلَيْكَ وَمَآ اُنزِلَ مِن قَبْلِكَ وَبِٱلْءَاخِرَةِ هُمْ يُوقِنُونَ"
julia> normalize(avrs, [:alif_khanjareeya, :hamzat_wasl])
"بِسْمِ اللَّهِ الرَّحْمَانِ الرَّحِيمِ"
Or using the CorpusData
instead of the TanzilData
,
julia> avrs2 = arabic(verses(crpsdata[2][15])[1])
"ٱللَّهُ يَسْتَهْزِئُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"
julia> normalize(avrs2, :ya_hamza_above)
"ٱللَّهُ يَسْتَهْزِيُ بِهِمْ وَيَمُدُّهُمْ فِى طُغْيَٰنِهِمْ يَعْمَهُونَ"