Orthographical Analysis
All Arabic characters and diacritics and other characters used in Arabic texts, such as the Qur'an are all encoded as struct
s or types. These types have properties that can be used for orthographical analysis. These properties are the vocal and numeral associated with each of the character.
Numerals
The numerals we refer here is the Abjad numeral.
julia> using Yunir
julia> ar_basmala = "بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ";
If we want to take the numerals, we need to tokenize it first.
julia> arb_token = tokenize(ar_basmala)
4-element Vector{String}: "بِسْمِ" "ٱللَّهِ" "ٱلرَّحْمَٰنِ" "ٱلرَّحِيمِ"
Next we then parse each of these words as Orthography
.
julia> arb_parsed1 = parse(Orthography, arb_token[1])
Orthography(Type[Ba, Kasra, Seen, Sukun, Meem, Kasra])
julia> arb_parsed2 = parse.(Orthography, arb_token)
4-element Vector{Orthography}: Orthography(Type[Ba, Kasra, Seen, Sukun, Meem, Kasra]) Orthography(Type[AlifHamzatWasl, Lam, Lam, Shadda, Fatha, Ha, Kasra]) Orthography(Type[AlifHamzatWasl, Lam, Ra, Shadda, Fatha, HHa, Sukun, Meem, Fatha, AlifKhanjareeya, Noon, Kasra]) Orthography(Type[AlifHamzatWasl, Lam, Ra, Shadda, Fatha, HHa, Kasra, Ya, Meem, Kasra])
Finally, we can compute the numerals of the parsed tokens as follows:
julia> numerals(arb_parsed2[1])
6-element Vector{Union{Nothing, Int64}}: 2 nothing 60 nothing 40 nothing
julia> numerals(arb_parsed2[2])
7-element Vector{Union{Nothing, Int64}}: 1 30 30 nothing nothing 5 nothing
julia> numerals(arb_parsed2[3])
12-element Vector{Union{Nothing, Int64}}: 1 30 200 nothing nothing 8 nothing 40 nothing nothing 50 nothing
We can also check the type of the characters, whether it is a Lunar or Solar character. To do this, use the isfeat
(short for 'is feature' in the sense that characters here are also referred as feature).
julia> isfeat(arb_parsed2[1], AbstractLunar)
6-element BitVector: 1 0 0 0 1 0
julia> arb_parsed2[1][isfeat(arb_parsed2[1], AbstractLunar)]
2-element Vector{Type}: Ba Meem
julia> isfeat.(arb_parsed2, AbstractLunar)
4-element Vector{BitVector}: [1, 0, 0, 0, 1, 0] [1, 0, 0, 0, 0, 1, 0] [1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0] [1, 0, 0, 0, 0, 1, 0, 1, 1, 0]
julia> isfeat.(arb_parsed2, AbstractSolar)
4-element Vector{BitVector}: [0, 0, 1, 0, 0, 0] [0, 1, 1, 0, 0, 0, 0] [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0] [0, 1, 1, 0, 0, 0, 0, 0, 0, 0]
Vocals
Vocals refer to categorization of the characters based on the vocals it mainly uses in pronunciation.
julia> vocals(arb_parsed2[1])
6-element Vector{Union{Nothing, Symbol}}: :labial nothing :sibilant nothing :labial nothing
julia> vocals(arb_parsed2[2])
7-element Vector{Union{Nothing, Symbol}}: :soft :liquid :liquid nothing nothing :guttural nothing
julia> vocals(arb_parsed2[3])
12-element Vector{Union{Nothing, Symbol}}: :soft :liquid :liquid nothing nothing :guttural nothing :labial nothing nothing :liquid nothing
Simple Encoding
Simple encoding is a worded or spelled out transliteration of the arabic text.
julia> parse(SimpleEncoding, ar_basmala)
"Ba+Kasra | Seen+Sukun | Meem+Kasra | <space> | AlifHamzatWasl | Lam | Lam+Shadda+Fatha | Ha+Kasra | <space> | AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Sukun | Meem+Fatha+AlifKhanjareeya | Noon+Kasra | <space> | AlifHamzatWasl | Lam | Ra+Shadda+Fatha | HHa+Kasra | Ya | Meem+Kasra"