Orthographical Analysis

All Arabic characters and diacritics and other characters used in Arabic texts, such as the Qur'an are all encoded as structs or types. These types have properties that can be used for orthographical analysis. These properties are the vocal and numeral associated with each of the character.

Numerals

The numerals we refer here is the Abjad numeral.

julia> using Yunir
julia> ar_basmala = "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ";

If we want to take the numerals, we need to tokenize it first.

julia> arb_token = tokenize(ar_basmala)4-element Vector{String}:
 "بِسْمِ"
 "ٱللَّهِ"
 "ٱلرَّحْمَٰنِ"
 "ٱلرَّحِيمِ"

Next we then parse each of these words as Orthography.

julia> arb_parsed1 = parse(Orthography, arb_token[1])Orthography(Type[Ba, Kasra, Seen, Sukun, Meem, Kasra])
julia> arb_parsed2 = parse.(Orthography, arb_token)4-element Vector{Orthography}: Orthography(Type[Ba, Kasra, Seen, Sukun, Meem, Kasra]) Orthography(Type[AlifHamzatWasl, Lam, Lam, Shadda, Fatha, Ha, Kasra]) Orthography(Type[AlifHamzatWasl, Lam, Ra, Shadda, Fatha, HHa, Sukun, Meem, Fatha, AlifKhanjareeya, Noon, Kasra]) Orthography(Type[AlifHamzatWasl, Lam, Ra, Shadda, Fatha, HHa, Kasra, Ya, Meem, Kasra])

Finally, we can compute the numerals of the parsed tokens as follows:

julia> numerals(arb_parsed2[1])6-element Vector{Union{Nothing, Int64}}:
  2
   nothing
 60
   nothing
 40
   nothing
julia> numerals(arb_parsed2[2])7-element Vector{Union{Nothing, Int64}}: 1 30 30 nothing nothing 5 nothing
julia> numerals(arb_parsed2[3])12-element Vector{Union{Nothing, Int64}}: 1 30 200 nothing nothing 8 nothing 40 nothing nothing 50 nothing

We can also check the type of the characters, whether it is a Lunar or Solar character. To do this, use the isfeat (short for 'is feature' in the sense that characters here are also referred as feature).

julia> isfeat(arb_parsed2[1], AbstractLunar)6-element BitVector:
 1
 0
 0
 0
 1
 0
julia> arb_parsed2[1][isfeat(arb_parsed2[1], AbstractLunar)]2-element Vector{Type}: Ba Meem
julia> isfeat.(arb_parsed2, AbstractLunar)4-element Vector{BitVector}: [1, 0, 0, 0, 1, 0] [1, 0, 0, 0, 0, 1, 0] [1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0] [1, 0, 0, 0, 0, 1, 0, 1, 1, 0]
julia> isfeat.(arb_parsed2, AbstractSolar)4-element Vector{BitVector}: [0, 0, 1, 0, 0, 0] [0, 1, 1, 0, 0, 0, 0] [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0] [0, 1, 1, 0, 0, 0, 0, 0, 0, 0]

Vocals

Vocals refer to categorization of the characters based on the vocals it mainly uses in pronunciation.

julia> vocals(arb_parsed2[1])6-element Vector{Union{Nothing, Symbol}}:
 :labial
 nothing
 :sibilant
 nothing
 :labial
 nothing
julia> vocals(arb_parsed2[2])7-element Vector{Union{Nothing, Symbol}}: :soft :liquid :liquid nothing nothing :guttural nothing
julia> vocals(arb_parsed2[3])12-element Vector{Union{Nothing, Symbol}}: :soft :liquid :liquid nothing nothing :guttural nothing :labial nothing nothing :liquid nothing

Simple Encoding

Simple encoding is a worded or spelled out transliteration of the arabic text.

julia> parse(SimpleEncoding, ar_basmala)"Ba+Kasra | Seen+Sukun | Meem+Kasra | <space> | AlifHamzatWasl | Lam | Lam+Shadda+Fatha | Ha+Kasra | <space> | AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Sukun | Meem+Fatha+AlifKhanjareeya | Noon+Kasra | <space> | AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Kasra | Ya | Meem+Kasra"