Getting Started
There are two datasets included in the library, namely the Quranic Arabic Corpus and the Tanzil data. To load, simply run the following:
julia> using QuranTree
julia> data = QuranData()
QuranData(QuranTree.FilePaths("/home/runner/work/QuranTree.jl/QuranTree.jl/src/../data/quranic-corpus-morphology-0.4.txt", "/home/runner/work/QuranTree.jl/QuranTree.jl/src/../data/quran-uthmani-final.txt"))
julia> crps, tnzl = load(data);
The QuranData()
is a struct
containing the default file path of the data. The load
function returns a tuple
for both the Quranic Corpus and the Tanzil Data. The loaded data is encoded in a immutable (read-only) array, so users cannot change it. This is specified in the type of the object as shown below:
julia> crps
(CorpusRaw) 128276-element ReadOnlyArrays.ReadOnlyArray{String, 1, Vector{String}}: "# PLEASE DO NOT REMOVE OR CHANGE THIS COPYRIGHT BLOCK" "#====================================================================" "#" "# Quranic Arabic Corpus (morphology, version 0.4)" "# Copyright (C) 2011 Kais Dukes" "# License: GNU General Public License" "#" "# The Quranic Arabic Corpus includes syntactic and morphological" "# annotation of the Quran, and builds on the verified Arabic text" "# distributed by the Tanzil project." ⋮ "(114:5:4:1)\tSuduwri\tN\tSTEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN" "(114:5:5:1)\t{l\tDET\tPREFIX|Al+" "(114:5:5:2)\tn~aAsi\tN\tSTEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN" "(114:6:1:1)\tmina\tP\tSTEM|POS:P|LEM:min" "(114:6:2:1)\t{lo\tDET\tPREFIX|Al+" "(114:6:2:2)\tjin~api\tN\tSTEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN" "(114:6:3:1)\twa\tCONJ\tPREFIX|w:CONJ+" "(114:6:3:2)\t{l\tDET\tPREFIX|Al+" "(114:6:3:3)\tn~aAsi\tN\tSTEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
julia> tnzl
(TanzilRaw) 6266-element ReadOnlyArrays.ReadOnlyArray{String, 1, Vector{String}}: "1|1|بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ" "1|2|ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ" "1|3|ٱلرَّحْمَٰنِ ٱلرَّحِيمِ" "1|4|مَٰلِكِ يَوْمِ ٱلدِّينِ" "1|5|إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ" "1|6|ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ" "1|7|صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُوبِ عَلَيْهِمْ وَلَا ٱلضَّآلِّينَ" "2|1|بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ" "2|2|ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًى لِّلْمُتَّقِينَ" "2|3|ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ" ⋮ "# track of changes." "#" "# - This copyright notice shall be included in all verbatim copies " "# of the text, and shall be reproduced appropriately in all files " "# derived from or containing substantial portion of this text." "#" "# Please check updates at: http://tanzil.net/updates/" "# " "#===================================================================="
In order to parse these raw data, the table
function is used:
julia> crpsdata = table(crps);
julia> tnzldata = table(tnzl);
julia> crpsdata
Quranic Arabic Corpus (morphology) (C) 2011 Kais Dukes 128219×7 DataFrame Row │ chapter verse word part form tag features ⋯ │ Int64 Int64 Int64 Int64 String String String ⋯ ────────┼─────────────────────────────────────────────────────────────────────── 1 │ 1 1 1 1 bi P PREFIX|bi+ ⋯ 2 │ 1 1 1 2 somi N STEM|POS:N|LEM:{so 3 │ 1 1 2 1 {ll~ahi PN STEM|POS:PN|LEM:{l 4 │ 1 1 3 1 {l DET PREFIX|Al+ 5 │ 1 1 3 2 r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r ⋯ 6 │ 1 1 4 1 {l DET PREFIX|Al+ 7 │ 1 1 4 2 r~aHiymi ADJ STEM|POS:ADJ|LEM:r 8 │ 1 2 1 1 {lo DET PREFIX|Al+ ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 128213 │ 114 5 5 2 n~aAsi N STEM|POS:N|LEM:n~a ⋯ 128214 │ 114 6 1 1 mina P STEM|POS:P|LEM:min 128215 │ 114 6 2 1 {lo DET PREFIX|Al+ 128216 │ 114 6 2 2 jin~api N STEM|POS:N|LEM:jin 128217 │ 114 6 3 1 wa CONJ PREFIX|w:CONJ+ ⋯ 128218 │ 114 6 3 2 {l DET PREFIX|Al+ 128219 │ 114 6 3 3 n~aAsi N STEM|POS:N|LEM:n~a 1 column and 128204 rows omitted
julia> tnzldata
Tanzil Quran Text (Uthmani) (C) 2008-2010 Tanzil.net 6236×3 DataFrame Row │ chapter verse form │ Int64 Int64 String ──────┼─────────────────────────────────────────────────── 1 │ 1 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ 2 │ 1 2 ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ 3 │ 1 3 ٱلرَّحْمَٰنِ ٱلرَّحِيمِ 4 │ 1 4 مَٰلِكِ يَوْمِ ٱلدِّينِ 5 │ 1 5 إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ 6 │ 1 6 ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ 7 │ 1 7 صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُو… 8 │ 2 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ ⋮ │ ⋮ ⋮ ⋮ 6230 │ 113 5 وَمِن شَرِّ حَاسِدٍ إِذَا حَسَدَ 6231 │ 114 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ قُلْ أَعُوذُ بِ… 6232 │ 114 2 مَلِكِ ٱلنَّاسِ 6233 │ 114 3 إِلَٰهِ ٱلنَّاسِ 6234 │ 114 4 مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ 6235 │ 114 5 ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ 6236 │ 114 6 مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ 6221 rows omitted
The resulting tables are of type CorpusData
and TanzilData
, respectively, and are encoded on top of DataFrames.jl's IndexedTable
, which can be accessed by simply calling the macro @data
(for example, @data crpsdata
or crpsdata.data
).
Manipulating the Table
As mentioned above, the table is based on DataFrames.jl's DataFrame
. Therefore, any data manipulation is done through the DataFrames.jl's APIs. To access the data, simply call the property with .data
or using the macro @data
:
julia> crpstbl = @data crpsdata; # or crpsdata.data
julia> tnzltbl = @data tnzldata; # or tnzldata.data
julia> crpstbl
128219×7 DataFrame Row │ chapter verse word part form tag features ⋯ │ Int64 Int64 Int64 Int64 String String String ⋯ ────────┼─────────────────────────────────────────────────────────────────────── 1 │ 1 1 1 1 bi P PREFIX|bi+ ⋯ 2 │ 1 1 1 2 somi N STEM|POS:N|LEM:{so 3 │ 1 1 2 1 {ll~ahi PN STEM|POS:PN|LEM:{l 4 │ 1 1 3 1 {l DET PREFIX|Al+ 5 │ 1 1 3 2 r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r ⋯ 6 │ 1 1 4 1 {l DET PREFIX|Al+ 7 │ 1 1 4 2 r~aHiymi ADJ STEM|POS:ADJ|LEM:r 8 │ 1 2 1 1 {lo DET PREFIX|Al+ ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 128213 │ 114 5 5 2 n~aAsi N STEM|POS:N|LEM:n~a ⋯ 128214 │ 114 6 1 1 mina P STEM|POS:P|LEM:min 128215 │ 114 6 2 1 {lo DET PREFIX|Al+ 128216 │ 114 6 2 2 jin~api N STEM|POS:N|LEM:jin 128217 │ 114 6 3 1 wa CONJ PREFIX|w:CONJ+ ⋯ 128218 │ 114 6 3 2 {l DET PREFIX|Al+ 128219 │ 114 6 3 3 n~aAsi N STEM|POS:N|LEM:n~a 1 column and 128204 rows omitted
julia> tnzltbl
6236×3 DataFrame Row │ chapter verse form │ Int64 Int64 String ──────┼─────────────────────────────────────────────────── 1 │ 1 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ 2 │ 1 2 ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ 3 │ 1 3 ٱلرَّحْمَٰنِ ٱلرَّحِيمِ 4 │ 1 4 مَٰلِكِ يَوْمِ ٱلدِّينِ 5 │ 1 5 إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ 6 │ 1 6 ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ 7 │ 1 7 صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُو… 8 │ 2 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ ⋮ │ ⋮ ⋮ ⋮ 6230 │ 113 5 وَمِن شَرِّ حَاسِدٍ إِذَا حَسَدَ 6231 │ 114 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ قُلْ أَعُوذُ بِ… 6232 │ 114 2 مَلِكِ ٱلنَّاسِ 6233 │ 114 3 إِلَٰهِ ٱلنَّاسِ 6234 │ 114 4 مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ 6235 │ 114 5 ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ 6236 │ 114 6 مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ 6221 rows omitted
Note that, crpsdata
and crpstbl
have different types (as in the case of tnzldata
and tnzltbl
) as shown below:
julia> typeof(crpsdata)
CorpusData
julia> typeof(crpstbl)
DataFrames.DataFrame
From here, any data manipulation is done using DataFrames.jl's APIs. For example, the following will select the feature column of the crpstbl
:
julia> using DataFrames
julia> crpstbl[!, :features]
128219-element Vector{String}: "PREFIX|bi+" "STEM|POS:N|LEM:{som|ROOT:smw|M|GEN" "STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN" "PREFIX|Al+" "STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN" "PREFIX|Al+" "STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN" "PREFIX|Al+" "STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM" "PREFIX|l:P+" ⋮ "STEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN" "PREFIX|Al+" "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN" "STEM|POS:P|LEM:min" "PREFIX|Al+" "STEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN" "PREFIX|w:CONJ+" "PREFIX|Al+" "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
julia> # or equivalent to crpsdata.data[!, :features]
128219-element Vector{String}: "PREFIX|bi+" "STEM|POS:N|LEM:{som|ROOT:smw|M|GEN" "STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN" "PREFIX|Al+" "STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN" "PREFIX|Al+" "STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN" "PREFIX|Al+" "STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM" "PREFIX|l:P+" ⋮ "STEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN" "PREFIX|Al+" "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN" "STEM|POS:P|LEM:min" "PREFIX|Al+" "STEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN" "PREFIX|w:CONJ+" "PREFIX|Al+" "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
You need to install DataFrames.jl to successfully run the code.
using Pkg
Pkg.add("DataFrames")
To filter tokens that are Prefix
ed features, the Base.jl's occursin
can be used:
julia> filter(t -> occursin(r"^PREFIX", t.features), crpstbl)
28670×7 DataFrame Row │ chapter verse word part form tag features │ Int64 Int64 Int64 Int64 String String String ───────┼────────────────────────────────────────────────────────────── 1 │ 1 1 1 1 bi P PREFIX|bi+ 2 │ 1 1 3 1 {l DET PREFIX|Al+ 3 │ 1 1 4 1 {l DET PREFIX|Al+ 4 │ 1 2 1 1 {lo DET PREFIX|Al+ 5 │ 1 2 2 1 li P PREFIX|l:P+ 6 │ 1 2 4 1 {lo DET PREFIX|Al+ 7 │ 1 3 1 1 {l DET PREFIX|Al+ 8 │ 1 3 2 1 {l DET PREFIX|Al+ ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 28664 │ 114 3 2 1 {l DET PREFIX|Al+ 28665 │ 114 4 3 1 {lo DET PREFIX|Al+ 28666 │ 114 4 4 1 {lo DET PREFIX|Al+ 28667 │ 114 5 5 1 {l DET PREFIX|Al+ 28668 │ 114 6 2 1 {lo DET PREFIX|Al+ 28669 │ 114 6 3 1 wa CONJ PREFIX|w:CONJ+ 28670 │ 114 6 3 2 {l DET PREFIX|Al+ 28655 rows omitted
julia> # or equivalent to filter(t -> occursin(r"^PREFIX", t.features), crpsdata.data)
28670×7 DataFrame Row │ chapter verse word part form tag features │ Int64 Int64 Int64 Int64 String String String ───────┼────────────────────────────────────────────────────────────── 1 │ 1 1 1 1 bi P PREFIX|bi+ 2 │ 1 1 3 1 {l DET PREFIX|Al+ 3 │ 1 1 4 1 {l DET PREFIX|Al+ 4 │ 1 2 1 1 {lo DET PREFIX|Al+ 5 │ 1 2 2 1 li P PREFIX|l:P+ 6 │ 1 2 4 1 {lo DET PREFIX|Al+ 7 │ 1 3 1 1 {l DET PREFIX|Al+ 8 │ 1 3 2 1 {l DET PREFIX|Al+ ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 28664 │ 114 3 2 1 {l DET PREFIX|Al+ 28665 │ 114 4 3 1 {lo DET PREFIX|Al+ 28666 │ 114 4 4 1 {lo DET PREFIX|Al+ 28667 │ 114 5 5 1 {l DET PREFIX|Al+ 28668 │ 114 6 2 1 {lo DET PREFIX|Al+ 28669 │ 114 6 3 1 wa CONJ PREFIX|w:CONJ+ 28670 │ 114 6 3 2 {l DET PREFIX|Al+ 28655 rows omitted
The main point here is that, any data manipulation on the CorpusTable
and TanzilData
is done through DataFrames.jl's APIs.