Getting Started

There are two datasets included in the library, namely the Quranic Arabic Corpus and the Tanzil data. To load, simply run the following:

julia> using QuranTree
julia> data = QuranData()QuranData(QuranTree.FilePaths("/home/runner/work/QuranTree.jl/QuranTree.jl/src/../data/quranic-corpus-morphology-0.4.txt", "/home/runner/work/QuranTree.jl/QuranTree.jl/src/../data/quran-uthmani-final.txt"))
julia> crps, tnzl = load(data);

The QuranData() is a struct containing the default file path of the data. The load function returns a tuple for both the Quranic Corpus and the Tanzil Data. The loaded data is encoded in a immutable (read-only) array, so users cannot change it. This is specified in the type of the object as shown below:

julia> crps(CorpusRaw) 128276-element ReadOnlyArrays.ReadOnlyArray{String, 1, Vector{String}}:
 "# PLEASE DO NOT REMOVE OR CHANGE THIS COPYRIGHT BLOCK"
 "#===================================================================="
 "#"
 "#  Quranic Arabic Corpus (morphology, version 0.4)"
 "#  Copyright (C) 2011 Kais Dukes"
 "#  License: GNU General Public License"
 "#"
 "#  The Quranic Arabic Corpus includes syntactic and morphological"
 "#  annotation of the Quran, and builds on the verified Arabic text"
 "#  distributed by the Tanzil project."
 ⋮
 "(114:5:4:1)\tSuduwri\tN\tSTEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN"
 "(114:5:5:1)\t{l\tDET\tPREFIX|Al+"
 "(114:5:5:2)\tn~aAsi\tN\tSTEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
 "(114:6:1:1)\tmina\tP\tSTEM|POS:P|LEM:min"
 "(114:6:2:1)\t{lo\tDET\tPREFIX|Al+"
 "(114:6:2:2)\tjin~api\tN\tSTEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN"
 "(114:6:3:1)\twa\tCONJ\tPREFIX|w:CONJ+"
 "(114:6:3:2)\t{l\tDET\tPREFIX|Al+"
 "(114:6:3:3)\tn~aAsi\tN\tSTEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
julia> tnzl(TanzilRaw) 6266-element ReadOnlyArrays.ReadOnlyArray{String, 1, Vector{String}}: "1|1|بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ" "1|2|ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ" "1|3|ٱلرَّحْمَٰنِ ٱلرَّحِيمِ" "1|4|مَٰلِكِ يَوْمِ ٱلدِّينِ" "1|5|إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ" "1|6|ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ" "1|7|صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُوبِ عَلَيْهِمْ وَلَا ٱلضَّآلِّينَ" "2|1|بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ" "2|2|ذَٰلِكَ ٱلْكِتَٰبُ لَا رَيْبَ فِيهِ هُدًى لِّلْمُتَّقِينَ" "2|3|ٱلَّذِينَ يُؤْمِنُونَ بِٱلْغَيْبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقْنَٰهُمْ يُنفِقُونَ" ⋮ "# track of changes." "#" "# - This copyright notice shall be included in all verbatim copies " "# of the text, and shall be reproduced appropriately in all files " "# derived from or containing substantial portion of this text." "#" "# Please check updates at: http://tanzil.net/updates/" "# " "#===================================================================="

In order to parse these raw data, the table function is used:

julia> crpsdata = table(crps);
julia> tnzldata = table(tnzl);
julia> crpsdataQuranic Arabic Corpus (morphology) (C) 2011 Kais Dukes 128219×7 DataFrame Row │ chapter verse word part form tag features ⋯ │ Int64 Int64 Int64 Int64 String String String ⋯ ────────┼─────────────────────────────────────────────────────────────────────── 1 │ 1 1 1 1 bi P PREFIX|bi+ ⋯ 2 │ 1 1 1 2 somi N STEM|POS:N|LEM:{so 3 │ 1 1 2 1 {ll~ahi PN STEM|POS:PN|LEM:{l 4 │ 1 1 3 1 {l DET PREFIX|Al+ 5 │ 1 1 3 2 r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r ⋯ 6 │ 1 1 4 1 {l DET PREFIX|Al+ 7 │ 1 1 4 2 r~aHiymi ADJ STEM|POS:ADJ|LEM:r 8 │ 1 2 1 1 {lo DET PREFIX|Al+ ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 128213 │ 114 5 5 2 n~aAsi N STEM|POS:N|LEM:n~a ⋯ 128214 │ 114 6 1 1 mina P STEM|POS:P|LEM:min 128215 │ 114 6 2 1 {lo DET PREFIX|Al+ 128216 │ 114 6 2 2 jin~api N STEM|POS:N|LEM:jin 128217 │ 114 6 3 1 wa CONJ PREFIX|w:CONJ+ ⋯ 128218 │ 114 6 3 2 {l DET PREFIX|Al+ 128219 │ 114 6 3 3 n~aAsi N STEM|POS:N|LEM:n~a 1 column and 128204 rows omitted
julia> tnzldataTanzil Quran Text (Uthmani) (C) 2008-2010 Tanzil.net 6236×3 DataFrame Row │ chapter verse form │ Int64 Int64 String ──────┼─────────────────────────────────────────────────── 1 │ 1 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ 2 │ 1 2 ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ 3 │ 1 3 ٱلرَّحْمَٰنِ ٱلرَّحِيمِ 4 │ 1 4 مَٰلِكِ يَوْمِ ٱلدِّينِ 5 │ 1 5 إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ 6 │ 1 6 ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ 7 │ 1 7 صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُو… 8 │ 2 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ ⋮ │ ⋮ ⋮ ⋮ 6230 │ 113 5 وَمِن شَرِّ حَاسِدٍ إِذَا حَسَدَ 6231 │ 114 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ قُلْ أَعُوذُ بِ… 6232 │ 114 2 مَلِكِ ٱلنَّاسِ 6233 │ 114 3 إِلَٰهِ ٱلنَّاسِ 6234 │ 114 4 مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ 6235 │ 114 5 ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ 6236 │ 114 6 مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ 6221 rows omitted

The resulting tables are of type CorpusData and TanzilData, respectively, and are encoded on top of DataFrames.jl's IndexedTable, which can be accessed by simply calling the macro @data (for example, @data crpsdata or crpsdata.data).

Manipulating the Table

As mentioned above, the table is based on DataFrames.jl's DataFrame. Therefore, any data manipulation is done through the DataFrames.jl's APIs. To access the data, simply call the property with .data or using the macro @data:

julia> crpstbl = @data crpsdata; # or crpsdata.data
julia> tnzltbl = @data tnzldata; # or tnzldata.data
julia> crpstbl128219×7 DataFrame Row │ chapter verse word part form tag features ⋯ │ Int64 Int64 Int64 Int64 String String String ⋯ ────────┼─────────────────────────────────────────────────────────────────────── 1 │ 1 1 1 1 bi P PREFIX|bi+ ⋯ 2 │ 1 1 1 2 somi N STEM|POS:N|LEM:{so 3 │ 1 1 2 1 {ll~ahi PN STEM|POS:PN|LEM:{l 4 │ 1 1 3 1 {l DET PREFIX|Al+ 5 │ 1 1 3 2 r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r ⋯ 6 │ 1 1 4 1 {l DET PREFIX|Al+ 7 │ 1 1 4 2 r~aHiymi ADJ STEM|POS:ADJ|LEM:r 8 │ 1 2 1 1 {lo DET PREFIX|Al+ ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 128213 │ 114 5 5 2 n~aAsi N STEM|POS:N|LEM:n~a ⋯ 128214 │ 114 6 1 1 mina P STEM|POS:P|LEM:min 128215 │ 114 6 2 1 {lo DET PREFIX|Al+ 128216 │ 114 6 2 2 jin~api N STEM|POS:N|LEM:jin 128217 │ 114 6 3 1 wa CONJ PREFIX|w:CONJ+ ⋯ 128218 │ 114 6 3 2 {l DET PREFIX|Al+ 128219 │ 114 6 3 3 n~aAsi N STEM|POS:N|LEM:n~a 1 column and 128204 rows omitted
julia> tnzltbl6236×3 DataFrame Row │ chapter verse form │ Int64 Int64 String ──────┼─────────────────────────────────────────────────── 1 │ 1 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ 2 │ 1 2 ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ 3 │ 1 3 ٱلرَّحْمَٰنِ ٱلرَّحِيمِ 4 │ 1 4 مَٰلِكِ يَوْمِ ٱلدِّينِ 5 │ 1 5 إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ 6 │ 1 6 ٱهْدِنَا ٱلصِّرَٰطَ ٱلْمُسْتَقِيمَ 7 │ 1 7 صِرَٰطَ ٱلَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ ٱلْمَغْضُو… 8 │ 2 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ الٓمٓ ⋮ │ ⋮ ⋮ ⋮ 6230 │ 113 5 وَمِن شَرِّ حَاسِدٍ إِذَا حَسَدَ 6231 │ 114 1 بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ قُلْ أَعُوذُ بِ… 6232 │ 114 2 مَلِكِ ٱلنَّاسِ 6233 │ 114 3 إِلَٰهِ ٱلنَّاسِ 6234 │ 114 4 مِن شَرِّ ٱلْوَسْوَاسِ ٱلْخَنَّاسِ 6235 │ 114 5 ٱلَّذِى يُوَسْوِسُ فِى صُدُورِ ٱلنَّاسِ 6236 │ 114 6 مِنَ ٱلْجِنَّةِ وَٱلنَّاسِ 6221 rows omitted

Note that, crpsdata and crpstbl have different types (as in the case of tnzldata and tnzltbl) as shown below:

julia> typeof(crpsdata)CorpusData
julia> typeof(crpstbl)DataFrames.DataFrame

From here, any data manipulation is done using DataFrames.jl's APIs. For example, the following will select the feature column of the crpstbl:

julia> using DataFrames
julia> crpstbl[!, :features]128219-element Vector{String}: "PREFIX|bi+" "STEM|POS:N|LEM:{som|ROOT:smw|M|GEN" "STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN" "PREFIX|Al+" "STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN" "PREFIX|Al+" "STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN" "PREFIX|Al+" "STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM" "PREFIX|l:P+" ⋮ "STEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN" "PREFIX|Al+" "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN" "STEM|POS:P|LEM:min" "PREFIX|Al+" "STEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN" "PREFIX|w:CONJ+" "PREFIX|Al+" "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
julia> # or equivalent to crpsdata.data[!, :features]128219-element Vector{String}: "PREFIX|bi+" "STEM|POS:N|LEM:{som|ROOT:smw|M|GEN" "STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN" "PREFIX|Al+" "STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN" "PREFIX|Al+" "STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN" "PREFIX|Al+" "STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM" "PREFIX|l:P+" ⋮ "STEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN" "PREFIX|Al+" "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN" "STEM|POS:P|LEM:min" "PREFIX|Al+" "STEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN" "PREFIX|w:CONJ+" "PREFIX|Al+" "STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN"
Note

You need to install DataFrames.jl to successfully run the code.

using Pkg
Pkg.add("DataFrames")

To filter tokens that are Prefixed features, the Base.jl's occursin can be used:

julia> filter(t -> occursin(r"^PREFIX", t.features), crpstbl)28670×7 DataFrame
   Row │ chapter  verse  word   part   form    tag     features
       │ Int64    Int64  Int64  Int64  String  String  String
───────┼──────────────────────────────────────────────────────────────
     1 │       1      1      1      1  bi      P       PREFIX|bi+
     2 │       1      1      3      1  {l      DET     PREFIX|Al+
     3 │       1      1      4      1  {l      DET     PREFIX|Al+
     4 │       1      2      1      1  {lo     DET     PREFIX|Al+
     5 │       1      2      2      1  li      P       PREFIX|l:P+
     6 │       1      2      4      1  {lo     DET     PREFIX|Al+
     7 │       1      3      1      1  {l      DET     PREFIX|Al+
     8 │       1      3      2      1  {l      DET     PREFIX|Al+
   ⋮   │    ⋮       ⋮      ⋮      ⋮      ⋮       ⋮           ⋮
 28664 │     114      3      2      1  {l      DET     PREFIX|Al+
 28665 │     114      4      3      1  {lo     DET     PREFIX|Al+
 28666 │     114      4      4      1  {lo     DET     PREFIX|Al+
 28667 │     114      5      5      1  {l      DET     PREFIX|Al+
 28668 │     114      6      2      1  {lo     DET     PREFIX|Al+
 28669 │     114      6      3      1  wa      CONJ    PREFIX|w:CONJ+
 28670 │     114      6      3      2  {l      DET     PREFIX|Al+
                                                    28655 rows omitted
julia> # or equivalent to filter(t -> occursin(r"^PREFIX", t.features), crpsdata.data)28670×7 DataFrame Row │ chapter verse word part form tag features │ Int64 Int64 Int64 Int64 String String String ───────┼────────────────────────────────────────────────────────────── 1 │ 1 1 1 1 bi P PREFIX|bi+ 2 │ 1 1 3 1 {l DET PREFIX|Al+ 3 │ 1 1 4 1 {l DET PREFIX|Al+ 4 │ 1 2 1 1 {lo DET PREFIX|Al+ 5 │ 1 2 2 1 li P PREFIX|l:P+ 6 │ 1 2 4 1 {lo DET PREFIX|Al+ 7 │ 1 3 1 1 {l DET PREFIX|Al+ 8 │ 1 3 2 1 {l DET PREFIX|Al+ ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 28664 │ 114 3 2 1 {l DET PREFIX|Al+ 28665 │ 114 4 3 1 {lo DET PREFIX|Al+ 28666 │ 114 4 4 1 {lo DET PREFIX|Al+ 28667 │ 114 5 5 1 {l DET PREFIX|Al+ 28668 │ 114 6 2 1 {lo DET PREFIX|Al+ 28669 │ 114 6 3 1 wa CONJ PREFIX|w:CONJ+ 28670 │ 114 6 3 2 {l DET PREFIX|Al+ 28655 rows omitted

The main point here is that, any data manipulation on the CorpusTable and TanzilData is done through DataFrames.jl's APIs.