Orthographical Analysis
All Arabic characters and diacritics and other characters used in Arabic texts, such as the Qur'an are all encoded as structs or types. These types have properties that can be used for orthographical analysis. These properties are the vocal and numeral associated with each of the character.
Numerals
The numerals we refer here is the Abjad numeral.
julia> using Yunirjulia> ar_basmala = "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ";
If we want to take the numerals, we need to tokenize it first.
julia> arb_token = tokenize(ar_basmala)4-element Vector{String}: "بِسْمِ" "ٱللَّهِ" "ٱلرَّحْمَٰنِ" "ٱلرَّحِيمِ"
Next we then parse each of these words as Orthography.
julia> arb_parsed1 = parse(Orthography, arb_token[1])Orthography(Type[Ba, Kasra, Seen, Sukun, Meem, Kasra])julia> arb_parsed2 = parse.(Orthography, arb_token)4-element Vector{Orthography}: Orthography(Type[Ba, Kasra, Seen, Sukun, Meem, Kasra]) Orthography(Type[AlifHamzatWasl, Lam, Lam, Shadda, Fatha, Ha, Kasra]) Orthography(Type[AlifHamzatWasl, Lam, Ra, Shadda, Fatha, HHa, Sukun, Meem, Fatha, AlifKhanjareeya, Noon, Kasra]) Orthography(Type[AlifHamzatWasl, Lam, Ra, Shadda, Fatha, HHa, Kasra, Ya, Meem, Kasra])
Finally, we can compute the numerals of the parsed tokens as follows:
julia> numerals(arb_parsed2[1])6-element Vector{Union{Nothing, Int64}}: 2 nothing 60 nothing 40 nothingjulia> numerals(arb_parsed2[2])7-element Vector{Union{Nothing, Int64}}: 1 30 30 nothing nothing 5 nothingjulia> numerals(arb_parsed2[3])12-element Vector{Union{Nothing, Int64}}: 1 30 200 nothing nothing 8 nothing 40 nothing nothing 50 nothing
We can also check the type of the characters, whether it is a Lunar or Solar character. To do this, use the isfeat (short for 'is feature' in the sense that characters here are also referred as feature).
julia> isfeat(arb_parsed2[1], AbstractLunar)6-element BitVector: 1 0 0 0 1 0julia> arb_parsed2[1][isfeat(arb_parsed2[1], AbstractLunar)]2-element Vector{Type}: Ba Meemjulia> isfeat.(arb_parsed2, AbstractLunar)4-element Vector{BitVector}: [1, 0, 0, 0, 1, 0] [1, 0, 0, 0, 0, 1, 0] [1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0] [1, 0, 0, 0, 0, 1, 0, 1, 1, 0]julia> isfeat.(arb_parsed2, AbstractSolar)4-element Vector{BitVector}: [0, 0, 1, 0, 0, 0] [0, 1, 1, 0, 0, 0, 0] [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0] [0, 1, 1, 0, 0, 0, 0, 0, 0, 0]
Vocals
Vocals refer to categorization of the characters based on the vocals it mainly uses in pronunciation.
julia> vocals(arb_parsed2[1])6-element Vector{Union{Nothing, Symbol}}: :labial nothing :sibilant nothing :labial nothingjulia> vocals(arb_parsed2[2])7-element Vector{Union{Nothing, Symbol}}: :soft :liquid :liquid nothing nothing :guttural nothingjulia> vocals(arb_parsed2[3])12-element Vector{Union{Nothing, Symbol}}: :soft :liquid :liquid nothing nothing :guttural nothing :labial nothing nothing :liquid nothing
Simple Encoding
Simple encoding is a worded or spelled out transliteration of the arabic text.
julia> parse(SimpleEncoding, ar_basmala)"Ba+Kasra | Seen+Sukun | Meem+Kasra | <space> | AlifHamzatWasl | Lam | Lam+Shadda+Fatha | Ha+Kasra | <space> | AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Sukun | Meem+Fatha+AlifKhanjareeya | Noon+Kasra | <space> | AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Kasra | Ya | Meem+Kasra"