CAMeL Tools

In this section, we will explore how to use CAMeL Tools of New York University Abu Dhabi. CAMeL is a suite of tools for Arabic Natural Language Processing, and by far the most feature-rich library to date for universal Arabic NLP. To install the library, follow the instructions here.

Setting up

For macOS users, however, simply run the following in the terminal:

pip3 install camel-tools

Then, download the necessary data as follows:

camel_data light

For this tutorial, we are going to use only the light version of the CAMeL data which is around 19mb.

Julia PyCall.jl

Julia can interoperate with Python through the library PyCall.jl. To install, run the following:

julia> using Pkg

julia> Pkg.add("PyCall")
  Resolving package versions...
No Changes to `~/work/QuranTree.jl/QuranTree.jl/docs/Project.toml`
No Changes to `~/work/QuranTree.jl/QuranTree.jl/docs/Manifest.toml`

Character Dediacritization

At this point, Julia can now connect to Python, and CAMeL Tools can now be loaded via the macro @pyimport. For example, the following will load the dediac module of the said library:

julia> using PyCall

julia> @pyimport camel_tools.utils.dediac as camel_dediac

julia> @pyimport camel_tools.utils.normalize as camel_normalize

Important

In case Python is not found, then it is required to specify the path in the environment variables, and as to which version to use. Hence, after installation of PyCall.jl, specify the path, for example:

ENV["PYTHON"] = "/usr/bin/python3"
Pkg.build("PyCall")

The last line will build the library and PyCall.jl will remember the path.

Important

Make sure the Python version you setup is where the CAMeL Tools was installed.

Let's use this and compare the results with QuranTree.jl's built in dediac function.

julia> using QuranTree

julia> crps, tnzl = load(QuranData());

julia> crpsdata = table(crps);

julia> tnzldata = table(tnzl);

julia> avrs1 = verses(tnzldata[1][1])[1]
"بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ"

julia> dediac(avrs1)
"بسم ٱلله ٱلرحمٰن ٱلرحيم"

Now using CAMeL tools, we get the following:

julia> camel_dediac.dediac_ar(avrs1)
"بسم ٱلله ٱلرحمن ٱلرحيم"

The difference is on the Alif Khanjareeya, where at the moment QuranTree.jl tree does not consider it as part of the diacritics, but part of the characters to be normalized.

Let's try this on CorpusData as well, to see how it handles Buckwalter dediacritization:

julia> vrs1 = verses(crpsdata[1][1])[1]
"bisomi {ll~ahi {lr~aHoma`ni {lr~aHiymi"

julia> dediac(vrs1)
"bsm {llh {lrHm`n {lrHym"

julia> camel_dediac.dediac_bw(vrs1)
"bsm {llh {lrHmn {lrHym"

Character Normalization

To normalize, QuranTree.jl uses argument for specifying the character to normalize. However for CAMeL tools, this is part of the name of the function:

julia> avrs2 = verses(tnzldata[2][3])[1]
"ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"

julia> normalize(avrs2, :ta_marbuta)
"ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰهَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"

julia> camel_normalize.normalize_teh_marbuta_ar(avrs2)
"ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰهَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ"

Another example, normalizing over the Buckwalter encoding:

julia> vrs2 = verses(crpsdata[2][3])[1]
"{l~a*iyna yu&ominuwna bi{logayobi wayuqiymuwna {lS~alaw`pa wamim~aA razaqona`humo yunfiquwna"

julia> normalize(vrs2, :ta_marbuta)
"{l~a*iyna yu&ominuwna bi{logayobi wayuqiymuwna {lS~alaw`ha wamim~aA razaqona`humo yunfiquwna"

julia> camel_normalize.normalize_teh_marbuta_bw(vrs2)
"{l~a*iyna yu&ominuwna bi{logayobi wayuqiymuwna {lS~alaw`ha wamim~aA razaqona`humo yunfiquwna"