Topic Modeling

Another application of Natural Language Processing is the Topic Modeling, which aims to extract the topics from a given document. In this section, we are going to apply this to Chapter 18 (the Cave) of the Qur'an. To do this, we are going to use the TextAnalysis.jl library. The model for this task will be the Latent Dirichlet Allocation (LDA). To start with, load the data as follows:

julia> using QuranTree
julia> using TextAnalysis
julia> using Yunir
julia> crps, tnzl = QuranData() |> load;
julia> crpsdata = table(crps)Quranic Arabic Corpus (morphology)
(C) 2011 Kais Dukes

128219×7 DataFrame
    Row │ chapter  verse  word   part   form        tag     features           ⋯
        │ Int64    Int64  Int64  Int64  String      String  String             ⋯
────────┼───────────────────────────────────────────────────────────────────────
      1 │       1      1      1      1  bi          P       PREFIX|bi+         ⋯
      2 │       1      1      1      2  somi        N       STEM|POS:N|LEM:{so
      3 │       1      1      2      1  {ll~ahi     PN      STEM|POS:PN|LEM:{l
      4 │       1      1      3      1  {l          DET     PREFIX|Al+
      5 │       1      1      3      2  r~aHoma`ni  ADJ     STEM|POS:ADJ|LEM:r ⋯
      6 │       1      1      4      1  {l          DET     PREFIX|Al+
      7 │       1      1      4      2  r~aHiymi    ADJ     STEM|POS:ADJ|LEM:r
      8 │       1      2      1      1  {lo         DET     PREFIX|Al+
   ⋮    │    ⋮       ⋮      ⋮      ⋮        ⋮         ⋮                     ⋮  ⋱
 128213 │     114      5      5      2  n~aAsi      N       STEM|POS:N|LEM:n~a ⋯
 128214 │     114      6      1      1  mina        P       STEM|POS:P|LEM:min
 128215 │     114      6      2      1  {lo         DET     PREFIX|Al+
 128216 │     114      6      2      2  jin~api     N       STEM|POS:N|LEM:jin
 128217 │     114      6      3      1  wa          CONJ    PREFIX|w:CONJ+     ⋯
 128218 │     114      6      3      2  {l          DET     PREFIX|Al+
 128219 │     114      6      3      3  n~aAsi      N       STEM|POS:N|LEM:n~a
                                                1 column and 128204 rows omitted

Note

You need to install Yunir.jl to successfully run the code.

using Pkg
Pkg.add("Yunir")
Pkg.add("TextAnalysis")

Data Preprocessing

The first data processing will be the removal of all Disconnected Letters (like الٓمٓ ,الٓمٓصٓ, among others), Prepositions, Particles, Conjunctions, Pronouns, and Adverbs. This is done as follows:

julia> function preprocess(s::String)
           feat = parse(QuranFeatures, s)
           disletters = isfeat(feat, AbstractDisLetters)
           prepositions = isfeat(feat, AbstractPreposition)
           particles = isfeat(feat, AbstractParticle)
           conjunctions = isfeat(feat, AbstractConjunction)
           pronouns = isfeat(feat, AbstractPronoun)
           adverbs = isfeat(feat, AbstractAdverb)
       
           return !disletters && !prepositions && !particles && !conjunctions && !pronouns && !adverbs
       endpreprocess (generic function with 1 method)
julia> crpstbl = filter(t -> preprocess(t.features), crpsdata[18].data)827×7 DataFrame
 Row │ chapter  verse  word   part   form       tag     features               ⋯
     │ Int64    Int64  Int64  Int64  String     String  String                 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │      18      1      1      2  Hamodu     N       STEM|POS:N|LEM:Hamod|R ⋯
   2 │      18      1      2      2  l~ahi      PN      STEM|POS:PN|LEM:{ll~ah
   3 │      18      1      4      1  >anzala    V       STEM|POS:V|PERF|(IV)|L
   4 │      18      1      6      1  Eabodi     N       STEM|POS:N|LEM:Eabod|R
   5 │      18      1      9      1  yajoEal    V       STEM|POS:V|IMPF|LEM:ja ⋯
   6 │      18      1     11      1  EiwajaA    N       STEM|POS:N|LEM:Eiwaj|R
   7 │      18      2      2      1  l~i        PRP     PREFIX|l:PRP+
   8 │      18      2      2      2  yun*ira    V       STEM|POS:V|IMPF|(IV)|L
  ⋮  │    ⋮       ⋮      ⋮      ⋮        ⋮        ⋮                     ⋮      ⋱
 821 │      18    110     14      1  yarojuwA@  V       STEM|POS:V|IMPF|LEM:ya ⋯
 822 │      18    110     16      1  rab~i      N       STEM|POS:N|LEM:rab~|RO
 823 │      18    110     17      2  lo         IMPV    PREFIX|l:IMPV+
 824 │      18    110     17      3  yaEomalo   V       STEM|POS:V|IMPF|LEM:Ea
 825 │      18    110     21      1  yu$oriko   V       STEM|POS:V|IMPF|(IV)|L ⋯
 826 │      18    110     22      2  EibaAdapi  N       STEM|POS:N|LEM:EibaAda
 827 │      18    110     23      1  rab~i      N       STEM|POS:N|LEM:rab~|RO
                                                   1 column and 812 rows omitted

Next, we create a copy of the above data so we have the original state, and use the copy to do further data processing.

julia> crpsnew = deepcopy(crpstbl)827×7 DataFrame
 Row │ chapter  verse  word   part   form       tag     features               ⋯
     │ Int64    Int64  Int64  Int64  String     String  String                 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │      18      1      1      2  Hamodu     N       STEM|POS:N|LEM:Hamod|R ⋯
   2 │      18      1      2      2  l~ahi      PN      STEM|POS:PN|LEM:{ll~ah
   3 │      18      1      4      1  >anzala    V       STEM|POS:V|PERF|(IV)|L
   4 │      18      1      6      1  Eabodi     N       STEM|POS:N|LEM:Eabod|R
   5 │      18      1      9      1  yajoEal    V       STEM|POS:V|IMPF|LEM:ja ⋯
   6 │      18      1     11      1  EiwajaA    N       STEM|POS:N|LEM:Eiwaj|R
   7 │      18      2      2      1  l~i        PRP     PREFIX|l:PRP+
   8 │      18      2      2      2  yun*ira    V       STEM|POS:V|IMPF|(IV)|L
  ⋮  │    ⋮       ⋮      ⋮      ⋮        ⋮        ⋮                     ⋮      ⋱
 821 │      18    110     14      1  yarojuwA@  V       STEM|POS:V|IMPF|LEM:ya ⋯
 822 │      18    110     16      1  rab~i      N       STEM|POS:N|LEM:rab~|RO
 823 │      18    110     17      2  lo         IMPV    PREFIX|l:IMPV+
 824 │      18    110     17      3  yaEomalo   V       STEM|POS:V|IMPF|LEM:Ea
 825 │      18    110     21      1  yu$oriko   V       STEM|POS:V|IMPF|(IV)|L ⋯
 826 │      18    110     22      2  EibaAdapi  N       STEM|POS:N|LEM:EibaAda
 827 │      18    110     23      1  rab~i      N       STEM|POS:N|LEM:rab~|RO
                                                   1 column and 812 rows omitted
julia> feats = crpsnew[!, :features]827-element Vector{String}:
 "STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM"
 "STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN"
 "STEM|POS:V|PERF|(IV)|LEM:>anzala|ROOT:nzl|3MS"
 "STEM|POS:N|LEM:Eabod|ROOT:Ebd|M|GEN"
 "STEM|POS:V|IMPF|LEM:jaEala|ROOT:jEl|3MS|MOOD:JUS"
 "STEM|POS:N|LEM:Eiwaj|ROOT:Ewj|M|NOM"
 "PREFIX|l:PRP+"
 "STEM|POS:V|IMPF|(IV)|LEM:>an*ara|ROOT:n*r|3MS|MOOD:SUBJ"
 "STEM|POS:N|LEM:l~adun|ROOT:ldn|GEN"
 "STEM|POS:V|IMPF|(II)|LEM:bu\$~ira|ROOT:b\$r|3MS|MOOD:SUBJ"
 ⋮
 "STEM|POS:ADJ|LEM:wa`Hid|ROOT:wHd|MS|INDEF|NOM"
 "STEM|POS:V|PERF|LEM:kaAna|ROOT:kwn|SP:kaAn|3MS"
 "STEM|POS:V|IMPF|LEM:yarojuwA@|ROOT:rjw|3MS"
 "STEM|POS:N|LEM:rab~|ROOT:rbb|M|GEN"
 "PREFIX|l:IMPV+"
 "STEM|POS:V|IMPF|LEM:Eamila|ROOT:Eml|3MS|MOOD:JUS"
 "STEM|POS:V|IMPF|(IV)|LEM:>a\$oraka|ROOT:\$rk|3MS|MOOD:JUS"
 "STEM|POS:N|LEM:EibaAdat|ROOT:Ebd|F|GEN"
 "STEM|POS:N|LEM:rab~|ROOT:rbb|M|GEN"
julia> feats = parse.(QuranFeatures, feats)827-element Vector{AbstractQuranFeature}:
 Stem(:N, N, AbstractQuranFeature[Lemma("Hamod"), Root("Hmd"), M, NOM])
 Stem(:PN, PN, AbstractQuranFeature[Lemma("{ll~ah"), Root("Alh"), GEN])
 Stem(:V, V, AbstractQuranFeature[Lemma(">anzala"), Root("nzl"), PERF, IV, 3, M, S, IND, ACT])
 Stem(:N, N, AbstractQuranFeature[Lemma("Eabod"), Root("Ebd"), M, GEN])
 Stem(:V, V, AbstractQuranFeature[Lemma("jaEala"), Root("jEl"), JUS, IMPF, 3, M, S, ACT, I])
 Stem(:N, N, AbstractQuranFeature[Lemma("Eiwaj"), Root("Ewj"), M, NOM])
 Prefix(Symbol("l:PRP+"), PRP)
 Stem(:V, V, AbstractQuranFeature[Lemma(">an*ara"), Root("n*r"), SUBJ, IMPF, IV, 3, M, S, ACT])
 Stem(:N, N, AbstractQuranFeature[Lemma("l~adun"), Root("ldn"), GEN])
 Stem(:V, V, AbstractQuranFeature[Lemma("bu\$~ira"), Root("b\$r"), SUBJ, IMPF, II, 3, M, S, ACT])
 ⋮
 Stem(:ADJ, ADJ, AbstractQuranFeature[Lemma("wa`Hid"), Root("wHd"), M, S, INDEF, NOM])
 Stem(:V, V, AbstractQuranFeature[Lemma("kaAna"), Root("kwn"), Special("kaAn"), PERF, 3, M, S, IND, ACT, I])
 Stem(:V, V, AbstractQuranFeature[Lemma("yarojuwA@"), Root("rjw"), IMPF, 3, M, S, IND, ACT, I])
 Stem(:N, N, AbstractQuranFeature[Lemma("rab~"), Root("rbb"), M, GEN])
 Prefix(Symbol("l:IMPV+"), IMPV)
 Stem(:V, V, AbstractQuranFeature[Lemma("Eamila"), Root("Eml"), JUS, IMPF, 3, M, S, ACT, I])
 Stem(:V, V, AbstractQuranFeature[Lemma(">a\$oraka"), Root("\$rk"), JUS, IMPF, IV, 3, M, S, ACT])
 Stem(:N, N, AbstractQuranFeature[Lemma("EibaAdat"), Root("Ebd"), F, GEN])
 Stem(:N, N, AbstractQuranFeature[Lemma("rab~"), Root("rbb"), M, GEN])

Lemmatization

Using the above parsed features, we then convert the form of the tokens into its lemma. This is useful for addressing inflections.

julia> lemmas = lemma.(feats)827-element Vector{Union{Missing, String}}:
 "Hamod"
 "{ll~ah"
 ">anzala"
 "Eabod"
 "jaEala"
 "Eiwaj"
 missing
 ">an*ara"
 "l~adun"
 "bu\$~ira"
 ⋮
 "wa`Hid"
 "kaAna"
 "yarojuwA@"
 "rab~"
 missing
 "Eamila"
 ">a\$oraka"
 "EibaAdat"
 "rab~"
julia> forms1 = crpsnew[!, :form]827-element Vector{String}:
 "Hamodu"
 "l~ahi"
 ">anzala"
 "Eabodi"
 "yajoEal"
 "EiwajaA"
 "l~i"
 "yun*ira"
 "l~aduno"
 "yuba\$~ira"
 ⋮
 "wa`HidN"
 "kaAna"
 "yarojuwA@"
 "rab~i"
 "lo"
 "yaEomalo"
 "yu\$oriko"
 "EibaAdapi"
 "rab~i"
julia> forms1[.!ismissing.(lemmas)] = lemmas[.!ismissing.(lemmas)]795-element Vector{Union{Missing, String}}:
 "Hamod"
 "{ll~ah"
 ">anzala"
 "Eabod"
 "jaEala"
 "Eiwaj"
 ">an*ara"
 "l~adun"
 "bu\$~ira"
 "Eamila"
 ⋮
 "<ila`h"
 "wa`Hid"
 "kaAna"
 "yarojuwA@"
 "rab~"
 "Eamila"
 ">a\$oraka"
 "EibaAdat"
 "rab~"

Tips

We can also use the Root features instead, which is done by simply replacing lemma.(feats) with root.(feats).

We now put back the new form to the corpus:

julia> crpsnew[!, :form] = forms1827-element Vector{String}:
 "Hamod"
 "{ll~ah"
 ">anzala"
 "Eabod"
 "jaEala"
 "Eiwaj"
 "l~i"
 ">an*ara"
 "l~adun"
 "bu\$~ira"
 ⋮
 "wa`Hid"
 "kaAna"
 "yarojuwA@"
 "rab~"
 "lo"
 "Eamila"
 ">a\$oraka"
 "EibaAdat"
 "rab~"
julia> crpsnew = CorpusData(crpsnew)Quranic Arabic Corpus (morphology)
(C) 2011 Kais Dukes

827×7 DataFrame
 Row │ chapter  verse  word   part   form       tag     features               ⋯
     │ Int64    Int64  Int64  Int64  String     String  String                 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │      18      1      1      2  Hamod      N       STEM|POS:N|LEM:Hamod|R ⋯
   2 │      18      1      2      2  {ll~ah     PN      STEM|POS:PN|LEM:{ll~ah
   3 │      18      1      4      1  >anzala    V       STEM|POS:V|PERF|(IV)|L
   4 │      18      1      6      1  Eabod      N       STEM|POS:N|LEM:Eabod|R
   5 │      18      1      9      1  jaEala     V       STEM|POS:V|IMPF|LEM:ja ⋯
   6 │      18      1     11      1  Eiwaj      N       STEM|POS:N|LEM:Eiwaj|R
   7 │      18      2      2      1  l~i        PRP     PREFIX|l:PRP+
   8 │      18      2      2      2  >an*ara    V       STEM|POS:V|IMPF|(IV)|L
  ⋮  │    ⋮       ⋮      ⋮      ⋮        ⋮        ⋮                     ⋮      ⋱
 821 │      18    110     14      1  yarojuwA@  V       STEM|POS:V|IMPF|LEM:ya ⋯
 822 │      18    110     16      1  rab~       N       STEM|POS:N|LEM:rab~|RO
 823 │      18    110     17      2  lo         IMPV    PREFIX|l:IMPV+
 824 │      18    110     17      3  Eamila     V       STEM|POS:V|IMPF|LEM:Ea
 825 │      18    110     21      1  >a$oraka   V       STEM|POS:V|IMPF|(IV)|L ⋯
 826 │      18    110     22      2  EibaAdat   N       STEM|POS:N|LEM:EibaAda
 827 │      18    110     23      1  rab~       N       STEM|POS:N|LEM:rab~|RO
                                                   1 column and 812 rows omitted

Tokenization

We want to summarize the Qur'an at the verse level. Thus, the token would be the verses of the corpus. From these verses, we further clean it by dediacritization and normalization of the characters:

julia> lem_vrs = verses(crpsnew)109-element Vector{String}:
 "Hamod {ll~ah >anzala Eabod jaEala Eiwaj"
 "l~i>an*ara l~adun bu\$~ira Eamila"
 ">an*ara qaAla {t~axa*a {ll~ah"
 "Eilom A^baA' kabura xaraja >afowa`h qaAla"
 "ba`xiE >avar 'aAmana Hadiyv"
 "jaEala >aroD libalawo >aHosan"
 "lajaAEil"
 "Hasiba kahof r~aqiym kaAna 'aAyap"
 ">awaY fitoyap kahof qaAla A^taY l~adun yuhay~i}o >amor"
 "Daraba >u*unN kahof"
 ⋮
 "Hasiba kafara {t~axa*a Eabod duwn >aEotadato jahan~am ka`firuwn"
 "qaAla nab~a>a >axosariyn"
 "Dal~a saEoy Hayaw`p d~unoyaA Hasiba >aHosana"
 "kafara 'aAyap rab~ liqaA^' HabiTa Eamal >aqaAma qiya`map"
 "jazaA^' jahan~am kafara {t~axa*a 'aAyap rasuwl"
 "'aAmana Eamila S~a`liHa`t kaAna jan~ap firodawos"
 "bagaY`"
 "qaAla kaAna baHor kalima`t rab~ lanafida baHor nafida kalima`t rab~ jaA^'a mivol"
 "qaAla ba\$ar mivol >awoHaY`^ <ila`h <ila`h wa`Hid kaAna yarojuwA@ rab~ loEamila >a\$oraka EibaAdat rab~"
julia> vrs = normalize.(dediac.(lem_vrs))109-element Vector{String}:
 "Hmd Allh Anzl Ebd jEl Ewj"
 "lAn*r ldn b\$r Eml"
 "An*r qAl Atx* Allh"
 "Elm AbA' kbr xrj Afwh qAl"
 "bxE Avr 'Amn Hdyv"
 "jEl ArD lblw AHsn"
 "ljAEl"
 "Hsb khf rqym kAn 'Ayh"
 "Awy ftyh khf qAl Aty ldn yhyy Amr"
 "Drb A*n khf"
 ⋮
 "Hsb kfr Atx* Ebd dwn AEtdt jhnm kfrwn"
 "qAl nbA Axsryn"
 "Dl sEy Hywh dnyA Hsb AHsn"
 "kfr 'Ayh rb lqA' HbT Eml AqAm qymh"
 "jzA' jhnm kfr Atx* 'Ayh rswl"
 "'Amn Eml SlHt kAn jnh frdws"
 "bgy"
 "qAl kAn bHr klmt rb lnfd bHr nfd klmt rb jA' mvl"
 "qAl b\$r mvl AwHy Alh Alh wHd kAn yrjwA@ rb lEml A\$rk EbAdt rb"

Creating a TextAnalysis Corpus

To make use of the TextAnalysis.jl's APIs, we need to encode the processed Quranic Corpus to TextAnalysis.jl's Corpus. In this case, we will create a StringDocument of the verses.

julia> crps1 = Corpus(StringDocument.(vrs))A Corpus with 109 documents:
 * 109 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

We then update the lexicon and inverse index for efficient indexing of the corpus.

julia> update_lexicon!(crps1)
julia> update_inverse_index!(crps1)

Next, we create a Document Term Matrix, which will have rows of verses and columns of words describing the verses.

julia> m1 = DocumentTermMatrix(crps1)A 109 X 360 DocumentTermMatrix

Latent Dirichlet Allocation

Finally, run LDA as follows:

julia> k = 3          # number of topics3
julia> iter = 1000    # number of gibbs sampling iterations1000
julia> alpha = 0.1    # hyperparameter0.1
julia> beta = 0.1     # hyperparameter0.1
julia> ϕ, θ = lda(m1, k, iter, alpha, beta)(
⠚⠛⠓⠛⠚⠛⠛⠛⠛⠛⠛⠛⠛⠋⠛⠛⠛⠛⠚⠛⠛⠛⠚⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠓⠓⠚⠛⠛⠚⠊, [0.3333333333333333 0.0 … 1.0 0.7333333333333333; 0.5 1.0 … 0.0 0.26666666666666666; 0.16666666666666666 0.0 … 0.0 0.0])

Extract the topic for first cluster:

julia> ntopics = 1010
julia> cluster_topics = Matrix(undef, ntopics, k);
julia> for i = 1:k
           topics_idcs = sortperm(ϕ[i, :], rev=true)
           cluster_topics[:, i] = arabic.(m1.terms[topics_idcs][1:ntopics])
       end
julia> cluster_topics10×3 Matrix{Any}:
 "ء"    "ذ"     "قال"
 "رب"   "ء"     "كان"
 "قال"  "قال"   "استطاع"
 "جعل"  "وجد"   "رب"
 "كان"  "الله"  "اراد"
 "ل"    "اتخ"   "اتبع"
 "امن"  "ر"     "لبث"
 "ارض"  "ا"     "علم"
 "اتي"  "رب"    "امر"
 "شي"   "شا"    "كلب"

Tabulating this properly would give us the following

Pkg.add("DataFrames")
Pkg.add("Latexify")
using DataFrames: DataFrame
using Latexify

mdtable(DataFrame(
    topic1 = cluster_topics[:, 1],
    topic2 = cluster_topics[:, 2],
    topic3 = cluster_topics[:, 3]
    ), latex=false)

topic1	topic2	topic3
ء	ذ	قال
رب	ء	كان
قال	قال	استطاع
جعل	وجد	رب
كان	الله	اراد
ل	اتخ	اتبع
امن	ر	لبث
ارض	ا	علم
اتي	رب	امر
شي	شا	كلب

As you may have noticed, the result is not good and this is mainly due to data processing. Readers are encourage to improve this for their use case. This section simply demonstrated how TextAnalysis.jl's LDA can be used for Topic Modeling of the QuranTree.jl's corpus.

Finally, the following will extract the topic for each verse:

julia> vrs_topics = []Any[]
julia> for i = 1:dtm(m1).m
           push!(vrs_topics, sortperm(θ[:, i], rev=true)[1])
       end
julia> vrs_topics109-element Vector{Any}:
 2
 2
 2
 2
 1
 1
 2
 2
 2
 2
 ⋮
 2
 3
 1
 2
 2
 1
 3
 1
 1