CAMeL Tools
In this section, we will explore how to use CAMeL Tools of New York University Abu Dhabi. CAMeL is a suite of tools for Arabic Natural Language Processing, and by far the most feature-rich library to date for universal Arabic NLP. To install the library, follow the instructions here.
Setting up
For macOS users, however, simply run the following in the terminal:
pip3 install camel-tools
Then, download the necessary data as follows:
camel_data light
For this tutorial, we are going to use only the light version of the CAMeL data which is around 19mb.
Julia PyCall.jl
Julia can interoperate with Python through the library PyCall.jl. To install, run the following:
julia> using Pkg
julia> Pkg.add("PyCall")
Resolving package versions...
No Changes to `~/work/QuranTree.jl/QuranTree.jl/docs/Project.toml`
No Changes to `~/work/QuranTree.jl/QuranTree.jl/docs/Manifest.toml`
Character Dediacritization
At this point, Julia can now connect to Python, and CAMeL Tools can now be loaded via the macro @pyimport
. For example, the following will load the dediac
module of the said library:
julia> using PyCall
julia> @pyimport camel_tools.utils.dediac as camel_dediac
julia> @pyimport camel_tools.utils.normalize as camel_normalize
In case Python is not found, then it is required to specify the path in the environment variables, and as to which version to use. Hence, after installation of PyCall.jl, specify the path, for example:
ENV["PYTHON"] = "/usr/bin/python3"
Pkg.build("PyCall")
The last line will build the library and PyCall.jl will remember the path.
Make sure the Python version you setup is where the CAMeL Tools was installed.
Let's use this and compare the results with QuranTree.jl's built in dediac
function.
julia> using QuranTree
julia> crps, tnzl = load(QuranData());
julia> crpsdata = table(crps);
julia> tnzldata = table(tnzl);
julia> avrs1 = verses(tnzldata[1][1])[1]
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"
julia> dediac(avrs1)
"بسم ٱلله ٱلرحمٰن ٱلرحيم"
Now using CAMeL tools, we get the following:
julia> camel_dediac.dediac_ar(avrs1)
"بسم ٱلله ٱلرحمن ٱلرحيم"
The difference is on the Alif Khanjareeya, where at the moment QuranTree.jl tree does not consider it as part of the diacritics, but part of the characters to be normalized.
Let's try this on CorpusData
as well, to see how it handles Buckwalter dediacritization:
julia> vrs1 = verses(crpsdata[1][1])[1]
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"
julia> dediac(vrs1)
"bsm {llh {lrHm`n {lrHym"
julia> camel_dediac.dediac_bw(vrs1)
"bsm {llh {lrHmn {lrHym"
Character Normalization
To normalize, QuranTree.jl uses argument for specifying the character to normalize. However for CAMeL tools, this is part of the name of the function:
julia> avrs2 = verses(tnzldata[2][3])[1]
"ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"
julia> normalize(avrs2, :ta_marbuta)
"ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰهَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"
julia> camel_normalize.normalize_teh_marbuta_ar(avrs2)
"ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰهَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"
Another example, normalizing over the Buckwalter encoding:
julia> vrs2 = verses(crpsdata[2][3])[1]
"{l~a*iyna yu&ominuwna bi{logayobi wayuqiymuwna {lS~alaw`pa wamim~aA razaqona`humo yunfiquwna"
julia> normalize(vrs2, :ta_marbuta)
"{l~a*iyna yu&ominuwna bi{logayobi wayuqiymuwna {lS~alaw`ha wamim~aA razaqona`humo yunfiquwna"
julia> camel_normalize.normalize_teh_marbuta_bw(vrs2)
"{l~a*iyna yu&ominuwna bi{logayobi wayuqiymuwna {lS~alaw`ha wamim~aA razaqona`humo yunfiquwna"