Morphological Features
QuranTree.jl provides complete types for all morphological features and part of speech of The Quranic Arabic Corpus.
Parsing
The features of each token are encoded as String
in its raw form, and in order to parse this as morphological feature, the function parse(QuranFeatures, x)
is used, where x
is the raw String
input. For example, the following will parse the 2nd part of the 3rd word of the 1st verse of Chapter 1:
julia> using QuranTree
julia> using Yunir
julia> crps, tnzl = load(QuranData());
julia> crpsdata = table(crps);
julia> tnzldata = table(tnzl);
julia> crpsdata[1][1][3][2]
Chapter 1 ٱلْفَاتِحَة (The Opening) Verse 1 1×5 DataFrame Row │ word part form tag features │ Int64 Int64 String String String ─────┼───────────────────────────────────────────────────────────────────── 1 │ 3 2 r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:…
julia> token = crpsdata[1][1][3][2].data[!, :features]
1-element Vector{String}: "STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN"
julia> mfeat = parse(QuranFeatures, token[1])
Stem(:ADJ, ADJ, AbstractQuranFeature[Lemma("r~aHoma`n"), Root("rHm"), M, S, GEN])
julia> typeof(mfeat)
Stem
Extracting Detailed Description
To see the detailed description of the features, @desc
is used.
julia> @desc mfeat
Stem ──── Adjective: ├ data: ADJ ├ desc: Adjective └ ar_label: صفة Lemma: └ data: r~aHoma`n Root: └ data: rHm Masculine: ├ data: M ├ desc: Masculine └ ar_label: الجنس Singular: ├ data: S ├ desc: Singular └ ar_label: العدد Genetive: ├ data: GEN ├ desc: Genetive case └ ar_label: مجرور
The Julia's dump
function can be used as to how to access the properties of the Stem
object.
julia> dump(mfeat)
Stem data: Symbol ADJ pos: Adjective data: Symbol ADJ desc: String "Adjective" ar_label: String "صفة" feats: Array{AbstractQuranFeature}((5,)) 1: Lemma data: String "r~aHoma`n" 2: Root data: String "rHm" 3: Masculine data: Symbol M desc: String "Masculine" ar_label: String "الجنس" 4: Singular data: Symbol S desc: String "Singular" ar_label: String "العدد" 5: Genetive data: Symbol GEN desc: String "Genetive case" ar_label: String "مجرور"
julia> # access other feats of the token mfeat.feats
5-element Vector{AbstractQuranFeature}: Lemma("r~aHoma`n") Root("rHm") M S GEN
Checking Parts of Speech
isfeat(token, pos)
checks whether the token
's parsed feature is a particular part of speech (pos
). For example, the following checks whether mfeat
above, among others, is indeed Masculine
and Singular
.
julia> isfeat(mfeat, Masculine)
true
julia> isfeat(mfeat, Feminine)
false
julia> isfeat(mfeat, Singular)
true
julia> isfeat(mfeat, Adjective) && isfeat(mfeat, Genetive)
true
Another example on checking whether the token has Root
and Lemma
features.
julia> isfeat(mfeat, Root) && isfeat(mfeat, Lemma)
true
Lemma, Root and Special
root
, lemma
and special
functions are used for extracting the Root, Lemma and Special morphological features, respectively.
julia> root(mfeat)
"rHm"
julia> lemma(mfeat)
"r~aHoma`n"
julia> arabic(root(mfeat))
"رحم"
julia> arabic(lemma(mfeat))
"رَّحْمَٰن"
The following example shows token with Special
feature:
julia> token2 = crpsdata.data[!, :features][53]
"STEM|POS:NEG|LEM:laA|SP:<in~"
julia> mfeat2 = parse(QuranFeatures, token2)
Stem(:NEG, NEG, AbstractQuranFeature[Lemma("laA"), Special("<in~")])
julia> special(mfeat2)
"<in~"
julia> arabic(special(mfeat2))
"إِنّ"
Implied Verb Features
Some features of Quranic Arabic Verbs are implied. For example, the Voice feature of the Verb is default to Active voice, the Mood feature is default to Indicative mood, and the Verb form feature is default to First form.
julia> token3 = crpsdata.data[!, :features][27]
"STEM|POS:V|IMPF|(X)|LEM:{sotaEiynu|ROOT:Ewn|1P"
token3
is a Verb
with no Mood and Verb form features stated. However, parsing this will automatically add the default values of the said features as shown below:
julia> mfeat3 = parse(QuranFeatures, token3)
Stem(:V, V, AbstractQuranFeature[Lemma("{sotaEiynu"), Root("Ewn"), IMPF, X, 1, P, IND, ACT])
julia> @desc mfeat3
Stem ──── Verb: ├ data: V ├ desc: Verb └ ar_label: فعل Lemma: └ data: {sotaEiynu Root: └ data: Ewn Imperfect: ├ data: IMPF ├ desc: Imperfect verb └ ar_label: فعل مضارع VerbFormX: ├ data: X ├ desc: Tenth verb form └ ar_label: فعل FirstPerson: ├ data: 1 ├ desc: First person └ ar_label: الاسناد Plural: ├ data: P ├ desc: Plural └ ar_label: العدد Indicative: ├ data: IND ├ desc: Indicative mood (default) └ ar_label: مرفوع Active: ├ data: ACT ├ desc: Active voice (default) └ ar_label: مبني للمعلوم
Another example where the Voice feature of the Verb is implied:
julia> token4 = crpsdata.data[!, :features][27]
"STEM|POS:V|IMPF|(X)|LEM:{sotaEiynu|ROOT:Ewn|1P"
julia> mfeat4 = parse(QuranFeatures, token4)
Stem(:V, V, AbstractQuranFeature[Lemma("{sotaEiynu"), Root("Ewn"), IMPF, X, 1, P, IND, ACT])
julia> @desc mfeat4
Stem ──── Verb: ├ data: V ├ desc: Verb └ ar_label: فعل Lemma: └ data: {sotaEiynu Root: └ data: Ewn Imperfect: ├ data: IMPF ├ desc: Imperfect verb └ ar_label: فعل مضارع VerbFormX: ├ data: X ├ desc: Tenth verb form └ ar_label: فعل FirstPerson: ├ data: 1 ├ desc: First person └ ar_label: الاسناد Plural: ├ data: P ├ desc: Plural └ ar_label: العدد Indicative: ├ data: IND ├ desc: Indicative mood (default) └ ar_label: مرفوع Active: ├ data: ACT ├ desc: Active voice (default) └ ar_label: مبني للمعلوم
POS Abstract Types
The table below contains the complete list of the Part of Speech with its corresponding types. As shown in the table below, each part of speech has a corresponding parent type, which is a superset type in the Type Hierarchy. This is useful for grouping. For example, instead of using ||
(or) in checking for all tokens that are either FirstPerson
, SecondPerson
, or ThirdPerson
, the parent type AbstractPerson
can be used.
julia> # without using parent type function allpersons(row) rfeat = parse(QuranFeatures, row.features) is1st = isfeat(rfeat, FirstPerson) is2nd = isfeat(rfeat, SecondPerson) is3rd = isfeat(rfeat, ThirdPerson) return is1st || is2nd || is3rd end
allpersons (generic function with 1 method)
julia> tbl1 = filter(allpersons, crpsdata.data);
julia> tbl1[!, [:form, :features]]
44092×2 DataFrame Row │ form features │ String String ───────┼──────────────────────────────────────────────── 1 │ <iy~aAka STEM|POS:PRON|LEM:<iy~aA|2MS 2 │ naEobudu STEM|POS:V|IMPF|LEM:Eabada|ROOT:… 3 │ <iy~aAka STEM|POS:PRON|LEM:<iy~aA|2MS 4 │ nasotaEiynu STEM|POS:V|IMPF|(X)|LEM:{sotaEiy… 5 │ {hodi STEM|POS:V|IMPV|LEM:hadaY|ROOT:h… 6 │ naA SUFFIX|PRON:1P 7 │ >anoEamo STEM|POS:V|PERF|(IV)|LEM:>anoEam… 8 │ ta SUFFIX|PRON:2MS ⋮ │ ⋮ ⋮ 44086 │ >aEuw*u STEM|POS:V|IMPF|LEM:Eu*o|ROOT:Ew… 44087 │ xalaqa STEM|POS:V|PERF|LEM:xalaqa|ROOT:… 44088 │ waqaba STEM|POS:V|PERF|LEM:waqaba|ROOT:… 44089 │ Hasada STEM|POS:V|PERF|LEM:Hasada|ROOT:… 44090 │ qulo STEM|POS:V|IMPV|LEM:qaAla|ROOT:q… 44091 │ >aEuw*u STEM|POS:V|IMPF|LEM:Eu*o|ROOT:Ew… 44092 │ yuwasowisu STEM|POS:V|IMPF|LEM:wasowasa|ROO… 44077 rows omitted
julia> # using parent type tbl2 = filter(row -> isfeat(parse(QuranFeatures, row.features), AbstractPerson), crpsdata.data);
julia> tbl2[!, [:form, :features]]
44092×2 DataFrame Row │ form features │ String String ───────┼──────────────────────────────────────────────── 1 │ <iy~aAka STEM|POS:PRON|LEM:<iy~aA|2MS 2 │ naEobudu STEM|POS:V|IMPF|LEM:Eabada|ROOT:… 3 │ <iy~aAka STEM|POS:PRON|LEM:<iy~aA|2MS 4 │ nasotaEiynu STEM|POS:V|IMPF|(X)|LEM:{sotaEiy… 5 │ {hodi STEM|POS:V|IMPV|LEM:hadaY|ROOT:h… 6 │ naA SUFFIX|PRON:1P 7 │ >anoEamo STEM|POS:V|PERF|(IV)|LEM:>anoEam… 8 │ ta SUFFIX|PRON:2MS ⋮ │ ⋮ ⋮ 44086 │ >aEuw*u STEM|POS:V|IMPF|LEM:Eu*o|ROOT:Ew… 44087 │ xalaqa STEM|POS:V|PERF|LEM:xalaqa|ROOT:… 44088 │ waqaba STEM|POS:V|PERF|LEM:waqaba|ROOT:… 44089 │ Hasada STEM|POS:V|PERF|LEM:Hasada|ROOT:… 44090 │ qulo STEM|POS:V|IMPV|LEM:qaAla|ROOT:q… 44091 │ >aEuw*u STEM|POS:V|IMPF|LEM:Eu*o|ROOT:Ew… 44092 │ yuwasowisu STEM|POS:V|IMPF|LEM:wasowasa|ROO… 44077 rows omitted
julia> sum(tbl1[!, :features] .!== tbl2[!, :features])
0
Part of Speech Types
Type | Parent Type | Tag | Description | Arabic Name |
Noun | AbstractNoun | Symbol("N") | Noun | اسم |
ProperNoun | AbstractNoun | Symbol("PN") | Proper noun | اسم علم |
Adjective | AbstractDerivedNominal | Symbol("ADJ") | Adjective | صفة |
ImperativeVerbalNoun | AbstractDerivedNominal | Symbol("IMPN") | Imperative verbal noun | اسم فعل أمر |
Personal | AbstractPronoun | Symbol("PRON") | Personal pronoun | ضمير |
Demonstrative | AbstractPronoun | Symbol("DEM") | Demonstrative pronoun | اسم اشارة |
Relative | AbstractPronoun | Symbol("REL") | Relative pronoun | اسم موصول |
Time | AbstractAdverb | Symbol("T") | Time adverb | ظرف زمان |
Location | AbstractAdverb | Symbol("LOC") | Location adverb | ظرف مكان |
Preposition | AbstractPreposition | Symbol("P") | Preposition | حرف جر |
EmphaticLam | AbstractPrefix | Symbol("EMPH") | Emphatic lam prefix | لام التوكيد |
ImperativeLam | AbstractPrefix | Symbol("IMPV") | Imperative lam prefix | لام الامر |
PurposeLam | AbstractPrefix | Symbol("PRP") | Purpose lam prefix | لام التعليل |
EmphaticNun | AbstractPrefix | Symbol("+n:EMPH") | Emphatic lam prefix | لام التوكيد |
Coordinating | AbstractConjunction | Symbol("CONJ") | Coordinating conjunction | حرف عطف |
Subordinating | AbstractConjunction | Symbol("SUB") | Subordinating particle | حرف مصدري |
Accusative | AbstractParticle | Symbol("ACC") | Accusative particle | حرف نصب |
Amendment | AbstractParticle | Symbol("AMD") | Amendment particle | حرف استدراك |
Answer | AbstractParticle | Symbol("ANS") | Answer particle | حرف جواب |
Aversion | AbstractParticle | Symbol("AVR") | Aversion particle | حرف ردع |
Cause | AbstractParticle | Symbol("CAUS") | Particle of cause | حرف سببية |
Certainty | AbstractParticle | Symbol("CERT") | Particle of certainty | حرف تحقيق |
Circumstantial | AbstractParticle | Symbol("CIRC") | Circumstantial particle | حرف حال |
Comitative | AbstractParticle | Symbol("COM") | Comitative particle | واو المعية |
Conditional | AbstractParticle | Symbol("COND") | Conditional particle | حرف شرط |
Equalization | AbstractParticle | Symbol("EQ") | Equalization particle | حرف تسوية |
Exhortation | AbstractParticle | Symbol("EXH") | Exhortation particle | حرف تحضيض |
Explanation | AbstractParticle | Symbol("EXL") | Explanation particle | حرف تفصيل |
Exceptive | AbstractParticle | Symbol("EXP") | Exceptive particle | أداة استثناء |
Future | AbstractParticle | Symbol("FUT") | Future particle | حرف استقبال |
Inceptive | AbstractParticle | Symbol("INC") | Inceptive particle | حرف ابتداء |
Interpretation | AbstractParticle | Symbol("INT") | Inceptive particle | حرف تفسير |
Interogative | AbstractParticle | Symbol("INTG") | Interogative particle | حرف استفهام |
Negative | AbstractParticle | Symbol("NEG") | Negative particle | حرف نفي |
Preventive | AbstractParticle | Symbol("PREV") | Preventive particle | حرف كاف |
Prohibition | AbstractParticle | Symbol("PRO") | Prohibition particle | حرف نهي |
Resumption | AbstractParticle | Symbol("REM") | Resumption particle | |
Restriction | AbstractParticle | Symbol("RES") | Restriction particle | أداة حصر |
Retraction | AbstractParticle | Symbol("RET") | Retraction particle | حرف اضراب |
Result | AbstractParticle | Symbol("RSLT") | Result particle | حرف واقع في جواب الشرط |
Supplemental | AbstractParticle | Symbol("SUP") | Suplemental particle | حرف زائد |
Surprise | AbstractParticle | Symbol("SUR") | Surprise particle | حرف فجاءة |
Vocative | AbstractParticle | Symbol("VOC") | Vocative particle | حرف نداء |
DisconnectedLetters | AbstractDisLetters | Symbol("INL") | Quranic initials | حروف مقطعة |
FirstPerson | AbstractPerson | Symbol("1") | First person | الاسناد |
SecondPerson | AbstractPerson | Symbol("2") | Second person | الاسناد |
ThirdPerson | AbstractPerson | Symbol("3") | Third person | الاسناد |
Masculine | AbstractGender | Symbol("M") | Masculine | الجنس |
Feminine | AbstractGender | Symbol("F") | Feminine | الجنس |
Singular | AbstractNumber | Symbol("S") | Singular | العدد |
Dual | AbstractNumber | Symbol("D") | Dual | العدد |
Plural | AbstractNumber | Symbol("P") | Plural | العدد |
Verb | AbstractPartOfSpeech | Symbol("V") | Verb | فعل |
Perfect | AbstractAspect | Symbol("PERF") | Perfect verb | فعل ماض |
Imperfect | AbstractAspect | Symbol("IMPF") | Imperfect verb | فعل مضارع |
Imperative | AbstractAspect | Symbol("IMPV") | Imperative verb | فعل أمر |
Indicative | AbstractMood | Symbol("IND") | Indicative mood (default) | مرفوع |
Subjunctive | AbstractMood | Symbol("SUBJ") | Subjunctive mood | منصوب |
Jussive | AbstractMood | Symbol("JUS") | Jussive mood | مجزوم |
Active | AbstractVoice | Symbol("ACT") | Active voice (default) | مبني للمعلوم |
Passive | AbstractVoice | Symbol("PASS") | Passive voice | مبني للمجهول |
VerbFormI | AbstractForm | Symbol("I") | First verb form (default) | فعل |
VerbFormII | AbstractForm | Symbol("II") | Second verb form | فعل |
VerbFormIII | AbstractForm | Symbol("III") | Third verb form | فعل |
VerbFormIV | AbstractForm | Symbol("IV") | Fourth verb form | فعل |
VerbFormV | AbstractForm | Symbol("V") | Fifth verb form | فعل |
VerbFormVI | AbstractForm | Symbol("VI") | Sixth verb form | فعل |
VerbFormVII | AbstractForm | Symbol("VII") | Seventh verb form | فعل |
VerbFormVIII | AbstractForm | Symbol("VIII") | Eighth verb form | فعل |
VerbFormIX | AbstractForm | Symbol("IX") | Ninth verb form | فعل |
VerbFormX | AbstractForm | Symbol("X") | Tenth verb form | فعل |
VerbFormXI | AbstractForm | Symbol("XI") | Eleventh verb form | فعل |
VerbFormXII | AbstractForm | Symbol("XII") | Twelfth verb form | فعل |
ActiveParticle | AbstractDerivedNoun | Symbol("ACT PCPL") | Active particle | اسم فاعل |
PassiveParticle | AbstractDerivedNoun | Symbol("PASS PCPL") | Passive particle | اسم مفعول |
VerbalNoun | AbstractDerivedNoun | Symbol("VN") | Verbal noun | مصدر |
Definite | AbstractState | Symbol("DEF") | Definite state | معرفة |
Indefinite | AbstractState | Symbol("INDEF") | Indefinite state | نكرة |
Nominative | AbstractCase | Symbol("NOM") | Nominative case | مرفوع |
Genetive | AbstractCase | Symbol("GEN") | Genetive case | مجرور |