Text Summarization
This section will demonstrate how to use TextAnalysis.jl (Julia's leading NLP library) for QuranTree.jl. In particular, in summarizing the Qur'an, specifically Chapter 18 (The Cave) which most Muslims are familiar with (this is the chapter recommended to be read every Friday). The algorithm used for summarization is called TextRank, an application of PageRank algorithm to text datasets.
julia> using QuranTree
julia> using TextAnalysis
julia> using Yunir
julia> crps, tnzl = QuranData() |> load;
julia> crpsdata = table(crps)
Quranic Arabic Corpus (morphology) (C) 2011 Kais Dukes 128219×7 DataFrame Row │ chapter verse word part form tag features ⋯ │ Int64 Int64 Int64 Int64 String String String ⋯ ────────┼─────────────────────────────────────────────────────────────────────── 1 │ 1 1 1 1 bi P PREFIX|bi+ ⋯ 2 │ 1 1 1 2 somi N STEM|POS:N|LEM:{so 3 │ 1 1 2 1 {ll~ahi PN STEM|POS:PN|LEM:{l 4 │ 1 1 3 1 {l DET PREFIX|Al+ 5 │ 1 1 3 2 r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r ⋯ 6 │ 1 1 4 1 {l DET PREFIX|Al+ 7 │ 1 1 4 2 r~aHiymi ADJ STEM|POS:ADJ|LEM:r 8 │ 1 2 1 1 {lo DET PREFIX|Al+ ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 128213 │ 114 5 5 2 n~aAsi N STEM|POS:N|LEM:n~a ⋯ 128214 │ 114 6 1 1 mina P STEM|POS:P|LEM:min 128215 │ 114 6 2 1 {lo DET PREFIX|Al+ 128216 │ 114 6 2 2 jin~api N STEM|POS:N|LEM:jin 128217 │ 114 6 3 1 wa CONJ PREFIX|w:CONJ+ ⋯ 128218 │ 114 6 3 2 {l DET PREFIX|Al+ 128219 │ 114 6 3 3 n~aAsi N STEM|POS:N|LEM:n~a 1 column and 128204 rows omitted
You need to install Yunir.jl to successfully run the code.
using Pkg
Pkg.add("Yunir")
Pkg.add("TextAnalysis")
Data Preprocessing
The first data processing will be the removal of all Disconnected Letters (like الٓمٓ ,الٓمٓصٓ, among others), Prepositions, Particles, Conjunctions, Pronouns, and Adverbs. This is done as follows:
julia> function preprocess(s::String) feat = parse(QuranFeatures, s) disletters = isfeat(feat, AbstractDisLetters) prepositions = isfeat(feat, AbstractPreposition) particles = isfeat(feat, AbstractParticle) conjunctions = isfeat(feat, AbstractConjunction) pronouns = isfeat(feat, AbstractPronoun) adverbs = isfeat(feat, AbstractAdverb) return !disletters && !prepositions && !particles && !conjunctions && !pronouns && !adverbs end
preprocess (generic function with 1 method)
julia> crpstbl = filter(t -> preprocess(t.features), crpsdata[18].data)
827×7 DataFrame Row │ chapter verse word part form tag features ⋯ │ Int64 Int64 Int64 Int64 String String String ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 18 1 1 2 Hamodu N STEM|POS:N|LEM:Hamod|R ⋯ 2 │ 18 1 2 2 l~ahi PN STEM|POS:PN|LEM:{ll~ah 3 │ 18 1 4 1 >anzala V STEM|POS:V|PERF|(IV)|L 4 │ 18 1 6 1 Eabodi N STEM|POS:N|LEM:Eabod|R 5 │ 18 1 9 1 yajoEal V STEM|POS:V|IMPF|LEM:ja ⋯ 6 │ 18 1 11 1 EiwajaA N STEM|POS:N|LEM:Eiwaj|R 7 │ 18 2 2 1 l~i PRP PREFIX|l:PRP+ 8 │ 18 2 2 2 yun*ira V STEM|POS:V|IMPF|(IV)|L ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 821 │ 18 110 14 1 yarojuwA@ V STEM|POS:V|IMPF|LEM:ya ⋯ 822 │ 18 110 16 1 rab~i N STEM|POS:N|LEM:rab~|RO 823 │ 18 110 17 2 lo IMPV PREFIX|l:IMPV+ 824 │ 18 110 17 3 yaEomalo V STEM|POS:V|IMPF|LEM:Ea 825 │ 18 110 21 1 yu$oriko V STEM|POS:V|IMPF|(IV)|L ⋯ 826 │ 18 110 22 2 EibaAdapi N STEM|POS:N|LEM:EibaAda 827 │ 18 110 23 1 rab~i N STEM|POS:N|LEM:rab~|RO 1 column and 812 rows omitted
Next, we create a copy of the above data (so we have the original state), and use the copy to do further data processing.
julia> crpsnew = deepcopy(crpstbl)
827×7 DataFrame Row │ chapter verse word part form tag features ⋯ │ Int64 Int64 Int64 Int64 String String String ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 18 1 1 2 Hamodu N STEM|POS:N|LEM:Hamod|R ⋯ 2 │ 18 1 2 2 l~ahi PN STEM|POS:PN|LEM:{ll~ah 3 │ 18 1 4 1 >anzala V STEM|POS:V|PERF|(IV)|L 4 │ 18 1 6 1 Eabodi N STEM|POS:N|LEM:Eabod|R 5 │ 18 1 9 1 yajoEal V STEM|POS:V|IMPF|LEM:ja ⋯ 6 │ 18 1 11 1 EiwajaA N STEM|POS:N|LEM:Eiwaj|R 7 │ 18 2 2 1 l~i PRP PREFIX|l:PRP+ 8 │ 18 2 2 2 yun*ira V STEM|POS:V|IMPF|(IV)|L ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 821 │ 18 110 14 1 yarojuwA@ V STEM|POS:V|IMPF|LEM:ya ⋯ 822 │ 18 110 16 1 rab~i N STEM|POS:N|LEM:rab~|RO 823 │ 18 110 17 2 lo IMPV PREFIX|l:IMPV+ 824 │ 18 110 17 3 yaEomalo V STEM|POS:V|IMPF|LEM:Ea 825 │ 18 110 21 1 yu$oriko V STEM|POS:V|IMPF|(IV)|L ⋯ 826 │ 18 110 22 2 EibaAdapi N STEM|POS:N|LEM:EibaAda 827 │ 18 110 23 1 rab~i N STEM|POS:N|LEM:rab~|RO 1 column and 812 rows omitted
julia> feats = crpsnew[!, :features]
827-element Vector{String}: "STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM" "STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN" "STEM|POS:V|PERF|(IV)|LEM:>anzala|ROOT:nzl|3MS" "STEM|POS:N|LEM:Eabod|ROOT:Ebd|M|GEN" "STEM|POS:V|IMPF|LEM:jaEala|ROOT:jEl|3MS|MOOD:JUS" "STEM|POS:N|LEM:Eiwaj|ROOT:Ewj|M|NOM" "PREFIX|l:PRP+" "STEM|POS:V|IMPF|(IV)|LEM:>an*ara|ROOT:n*r|3MS|MOOD:SUBJ" "STEM|POS:N|LEM:l~adun|ROOT:ldn|GEN" "STEM|POS:V|IMPF|(II)|LEM:bu\$~ira|ROOT:b\$r|3MS|MOOD:SUBJ" ⋮ "STEM|POS:ADJ|LEM:wa`Hid|ROOT:wHd|MS|INDEF|NOM" "STEM|POS:V|PERF|LEM:kaAna|ROOT:kwn|SP:kaAn|3MS" "STEM|POS:V|IMPF|LEM:yarojuwA@|ROOT:rjw|3MS" "STEM|POS:N|LEM:rab~|ROOT:rbb|M|GEN" "PREFIX|l:IMPV+" "STEM|POS:V|IMPF|LEM:Eamila|ROOT:Eml|3MS|MOOD:JUS" "STEM|POS:V|IMPF|(IV)|LEM:>a\$oraka|ROOT:\$rk|3MS|MOOD:JUS" "STEM|POS:N|LEM:EibaAdat|ROOT:Ebd|F|GEN" "STEM|POS:N|LEM:rab~|ROOT:rbb|M|GEN"
julia> feats = parse.(QuranFeatures, feats)
827-element Vector{AbstractQuranFeature}: Stem(:N, N, AbstractQuranFeature[Lemma("Hamod"), Root("Hmd"), M, NOM]) Stem(:PN, PN, AbstractQuranFeature[Lemma("{ll~ah"), Root("Alh"), GEN]) Stem(:V, V, AbstractQuranFeature[Lemma(">anzala"), Root("nzl"), PERF, IV, 3, M, S, IND, ACT]) Stem(:N, N, AbstractQuranFeature[Lemma("Eabod"), Root("Ebd"), M, GEN]) Stem(:V, V, AbstractQuranFeature[Lemma("jaEala"), Root("jEl"), JUS, IMPF, 3, M, S, ACT, I]) Stem(:N, N, AbstractQuranFeature[Lemma("Eiwaj"), Root("Ewj"), M, NOM]) Prefix(Symbol("l:PRP+"), PRP) Stem(:V, V, AbstractQuranFeature[Lemma(">an*ara"), Root("n*r"), SUBJ, IMPF, IV, 3, M, S, ACT]) Stem(:N, N, AbstractQuranFeature[Lemma("l~adun"), Root("ldn"), GEN]) Stem(:V, V, AbstractQuranFeature[Lemma("bu\$~ira"), Root("b\$r"), SUBJ, IMPF, II, 3, M, S, ACT]) ⋮ Stem(:ADJ, ADJ, AbstractQuranFeature[Lemma("wa`Hid"), Root("wHd"), M, S, INDEF, NOM]) Stem(:V, V, AbstractQuranFeature[Lemma("kaAna"), Root("kwn"), Special("kaAn"), PERF, 3, M, S, IND, ACT, I]) Stem(:V, V, AbstractQuranFeature[Lemma("yarojuwA@"), Root("rjw"), IMPF, 3, M, S, IND, ACT, I]) Stem(:N, N, AbstractQuranFeature[Lemma("rab~"), Root("rbb"), M, GEN]) Prefix(Symbol("l:IMPV+"), IMPV) Stem(:V, V, AbstractQuranFeature[Lemma("Eamila"), Root("Eml"), JUS, IMPF, 3, M, S, ACT, I]) Stem(:V, V, AbstractQuranFeature[Lemma(">a\$oraka"), Root("\$rk"), JUS, IMPF, IV, 3, M, S, ACT]) Stem(:N, N, AbstractQuranFeature[Lemma("EibaAdat"), Root("Ebd"), F, GEN]) Stem(:N, N, AbstractQuranFeature[Lemma("rab~"), Root("rbb"), M, GEN])
Lemmatization
Using the above parsed features, we then convert the form
of the tokens into its lemma. This is useful for addressing variations due to inflection.
julia> lemmas = lemma.(feats)
827-element Vector{Union{Missing, String}}: "Hamod" "{ll~ah" ">anzala" "Eabod" "jaEala" "Eiwaj" missing ">an*ara" "l~adun" "bu\$~ira" ⋮ "wa`Hid" "kaAna" "yarojuwA@" "rab~" missing "Eamila" ">a\$oraka" "EibaAdat" "rab~"
julia> forms1 = crpsnew[!, :form]
827-element Vector{String}: "Hamodu" "l~ahi" ">anzala" "Eabodi" "yajoEal" "EiwajaA" "l~i" "yun*ira" "l~aduno" "yuba\$~ira" ⋮ "wa`HidN" "kaAna" "yarojuwA@" "rab~i" "lo" "yaEomalo" "yu\$oriko" "EibaAdapi" "rab~i"
julia> forms1[.!ismissing.(lemmas)] = lemmas[.!ismissing.(lemmas)]
795-element Vector{Union{Missing, String}}: "Hamod" "{ll~ah" ">anzala" "Eabod" "jaEala" "Eiwaj" ">an*ara" "l~adun" "bu\$~ira" "Eamila" ⋮ "<ila`h" "wa`Hid" "kaAna" "yarojuwA@" "rab~" "Eamila" ">a\$oraka" "EibaAdat" "rab~"
We can also use the Root
features instead, which is done by simply replacing lemma.(feats)
with root.(feats)
.
We now put back the new form to the corpus:
julia> crpsnew[!, :form] = forms1
827-element Vector{String}: "Hamod" "{ll~ah" ">anzala" "Eabod" "jaEala" "Eiwaj" "l~i" ">an*ara" "l~adun" "bu\$~ira" ⋮ "wa`Hid" "kaAna" "yarojuwA@" "rab~" "lo" "Eamila" ">a\$oraka" "EibaAdat" "rab~"
julia> crpsnew = CorpusData(crpsnew)
Quranic Arabic Corpus (morphology) (C) 2011 Kais Dukes 827×7 DataFrame Row │ chapter verse word part form tag features ⋯ │ Int64 Int64 Int64 Int64 String String String ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 18 1 1 2 Hamod N STEM|POS:N|LEM:Hamod|R ⋯ 2 │ 18 1 2 2 {ll~ah PN STEM|POS:PN|LEM:{ll~ah 3 │ 18 1 4 1 >anzala V STEM|POS:V|PERF|(IV)|L 4 │ 18 1 6 1 Eabod N STEM|POS:N|LEM:Eabod|R 5 │ 18 1 9 1 jaEala V STEM|POS:V|IMPF|LEM:ja ⋯ 6 │ 18 1 11 1 Eiwaj N STEM|POS:N|LEM:Eiwaj|R 7 │ 18 2 2 1 l~i PRP PREFIX|l:PRP+ 8 │ 18 2 2 2 >an*ara V STEM|POS:V|IMPF|(IV)|L ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 821 │ 18 110 14 1 yarojuwA@ V STEM|POS:V|IMPF|LEM:ya ⋯ 822 │ 18 110 16 1 rab~ N STEM|POS:N|LEM:rab~|RO 823 │ 18 110 17 2 lo IMPV PREFIX|l:IMPV+ 824 │ 18 110 17 3 Eamila V STEM|POS:V|IMPF|LEM:Ea 825 │ 18 110 21 1 >a$oraka V STEM|POS:V|IMPF|(IV)|L ⋯ 826 │ 18 110 22 2 EibaAdat N STEM|POS:N|LEM:EibaAda 827 │ 18 110 23 1 rab~ N STEM|POS:N|LEM:rab~|RO 1 column and 812 rows omitted
Tokenization
We want to summarize the Qur'an at the verse level. Thus, the token would be the verses of the corpus. From these verses, we further clean it by dediacritization and normalization of the characters:
julia> lem_vrs = verses(crpsnew)
109-element Vector{String}: "Hamod {ll~ah >anzala Eabod jaEala Eiwaj" "l~i>an*ara l~adun bu\$~ira Eamila" ">an*ara qaAla {t~axa*a {ll~ah" "Eilom A^baA' kabura xaraja >afowa`h qaAla" "ba`xiE >avar 'aAmana Hadiyv" "jaEala >aroD libalawo >aHosan" "lajaAEil" "Hasiba kahof r~aqiym kaAna 'aAyap" ">awaY fitoyap kahof qaAla A^taY l~adun yuhay~i}o >amor" "Daraba >u*unN kahof" ⋮ "Hasiba kafara {t~axa*a Eabod duwn >aEotadato jahan~am ka`firuwn" "qaAla nab~a>a >axosariyn" "Dal~a saEoy Hayaw`p d~unoyaA Hasiba >aHosana" "kafara 'aAyap rab~ liqaA^' HabiTa Eamal >aqaAma qiya`map" "jazaA^' jahan~am kafara {t~axa*a 'aAyap rasuwl" "'aAmana Eamila S~a`liHa`t kaAna jan~ap firodawos" "bagaY`" "qaAla kaAna baHor kalima`t rab~ lanafida baHor nafida kalima`t rab~ jaA^'a mivol" "qaAla ba\$ar mivol >awoHaY`^ <ila`h <ila`h wa`Hid kaAna yarojuwA@ rab~ loEamila >a\$oraka EibaAdat rab~"
julia> vrs = normalize.(dediac.(lem_vrs))
109-element Vector{String}: "Hmd Allh Anzl Ebd jEl Ewj" "lAn*r ldn b\$r Eml" "An*r qAl Atx* Allh" "Elm AbA' kbr xrj Afwh qAl" "bxE Avr 'Amn Hdyv" "jEl ArD lblw AHsn" "ljAEl" "Hsb khf rqym kAn 'Ayh" "Awy ftyh khf qAl Aty ldn yhyy Amr" "Drb A*n khf" ⋮ "Hsb kfr Atx* Ebd dwn AEtdt jhnm kfrwn" "qAl nbA Axsryn" "Dl sEy Hywh dnyA Hsb AHsn" "kfr 'Ayh rb lqA' HbT Eml AqAm qymh" "jzA' jhnm kfr Atx* 'Ayh rswl" "'Amn Eml SlHt kAn jnh frdws" "bgy" "qAl kAn bHr klmt rb lnfd bHr nfd klmt rb jA' mvl" "qAl b\$r mvl AwHy Alh Alh wHd kAn yrjwA@ rb lEml A\$rk EbAdt rb"
Creating a TextAnalysis Corpus
To make use of the TextAnalysis.jl's APIs, we need to encode the processed Quranic Corpus to TextAnalysis.jl's Corpus. In this case, we will create a StringDocument
of the verses.
julia> crps1 = Corpus(StringDocument.(vrs))
A Corpus with 109 documents: * 109 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
We then update the lexicon and inverse index for efficient indexing of the corpus.
julia> update_lexicon!(crps1)
julia> update_inverse_index!(crps1)
Next, we create a Document Term Matrix, which will have rows of verses and columns of words describing the verses.
julia> m1 = DocumentTermMatrix(crps1)
A 109 X 360 DocumentTermMatrix
TF-IDF
Finally, we compute the corresponding TF-IDF, which will serve as the feature matrix.
julia> tfidf = tf_idf(m1)
109×360 SparseArrays.SparseMatrixCSC{Float64, Int64} with 836 stored entries: ⡀⣇⡀⢖⢐⣹⢑⠀⣵⡢⡙⠈⢺⠀⡈⢔⣀⡠⣀⠃⢲⣀⠵⡇⡀⡀⢠⠀⣧⣀⣡⡂⢀⠀⣀⣀⠐⠀⣄⠠ ⠮⠷⢯⠔⠡⡫⠰⠦⡔⠫⠭⠦⠮⡤⣅⠉⢜⠠⠹⢤⡔⣿⠹⢅⠕⠍⠈⣄⡏⠁⢹⠨⠑⡤⣋⡕⢋⢥⠄⠁ ⢑⡟⠐⢇⠄⢳⣺⢁⣅⣒⠠⠧⠐⢒⢤⡐⡾⠔⢠⡬⣖⢠⠀⠀⡳⢠⣄⡲⡇⡂⡽⠂⠸⡔⠀⢦⠶⠂⠄⢰ ⠴⡇⠡⠈⠈⢅⠂⡆⠖⠁⠶⠂⠀⠄⠊⠛⠋⠘⢉⠃⠂⢛⠁⢒⠈⠁⠨⢀⡏⢁⢋⠠⢂⠀⠀⠹⢀⠨⢁⡁ ⡝⠇⢈⠐⠃⠸⠙⠋⡏⠠⠨⠠⠠⠄⡠⢹⢑⠁⠆⡂⡓⠂⠁⠁⠀⠚⠁⠈⠇⢎⠰⠘⡈⠘⠂⢪⠂⠑⡘⠂ ⠀⣏⠂⠄⠈⡠⠌⠃⢇⠤⡢⠲⠱⠼⢑⡀⠔⠔⢈⠱⡗⡂⢀⢠⠀⠉⡀⢸⡅⠩⢩⠄⠅⠀⠐⢈⠀⠁⢐⠀
Summarizing the Qur'an
Using the TF-IDF, we compute the product of it with its transpose to come up with a square matrix, where the elements describes the linkage between the verses, or the similarity between the verses.
julia> sim_mat = tfidf * tfidf'
109×109 SparseArrays.SparseMatrixCSC{Float64, Int64} with 5199 stored entries: ⠻⣦⢡⠖⣴⢻⡆⢷⡇⡗⡖⠱⠆⢿⣴⢻⢰⡷⣼⣶⡗⠗⢶⠷⢾⡶⢶⣶⠆⠰⡴⢶⠁⠸⠿⡇⡞⢶⣶⡶ ⢡⠖⢱⣶⠾⣶⡆⡴⣶⣷⢇⡱⠄⢴⠶⢚⢐⣷⢻⢲⣃⢅⢴⠤⠴⠦⡴⢶⡖⢰⢧⡴⠁⠨⠭⠇⣒⠜⡒⠶ ⣴⣛⢺⣧⣿⣿⡇⣿⣧⡷⡧⠰⠺⣿⣿⣾⢟⡿⣛⣻⣿⠢⢾⡔⢺⡓⢞⣻⠆⠿⣷⢿⡆⢰⠷⡇⡷⢺⣟⡿ ⢬⣍⢈⡭⣭⣭⡕⣭⡍⡭⡭⢨⣭⣭⣭⢬⣍⣭⣩⣩⣭⡭⣭⣨⣭⣭⣭⣭⡅⣭⣩⣭⠀⢨⣽⡅⡭⣭⣍⣭ ⢭⠭⢼⣿⢭⡿⡇⡭⡵⢏⢍⢨⠭⠭⠭⣭⡭⢭⣭⣭⣭⢍⢭⢬⡭⡭⡭⢭⣅⣭⢭⣭⡄⡨⠭⡅⣭⢭⣭⣭ ⢜⡉⢍⡱⢉⡋⡃⣋⡃⣑⣱⣾⡉⣉⣏⣸⠉⣝⣩⣙⠟⡕⣙⣈⣉⣉⣋⣙⣃⢙⣙⣋⠌⢨⣭⡅⡛⡉⢻⣉ ⣬⣅⢀⣅⣾⣦⡇⣿⡇⡇⡇⢨⡿⣯⣾⣤⢆⣿⣹⣢⡶⡅⣯⢨⣭⣯⣽⣭⡅⢾⣨⣯⠀⢨⣯⡇⡠⣵⣄⣿ ⣴⣛⣸⢃⣻⣿⡃⣟⡇⣧⣋⣹⠚⣿⣛⣼⣛⣻⣟⣟⣻⢂⢻⠑⢺⡓⣚⣛⣦⢻⣓⣛⠔⢰⡛⡇⣧⢚⣻⣿ ⢴⡶⢴⣴⣿⡵⡇⣽⡇⣏⣇⢤⣬⣵⣿⣸⡵⣯⣶⣶⣯⣄⣴⢤⣴⣦⣴⣶⣄⣭⢦⣵⡄⢠⣥⡇⣆⣼⣾⣯ ⢲⣿⢻⣒⣿⣸⡇⣺⡇⣿⣇⢺⠳⣺⣿⢽⢸⣿⣿⣿⣗⢂⣾⡒⢺⡗⣺⣿⡯⢸⣗⣺⠂⠘⢒⡇⣇⢺⣿⡿ ⢽⠍⠍⢜⠻⡛⡇⡿⡇⢟⢟⠥⠜⠯⠻⢚⠋⢿⠹⢙⣟⣽⢼⠢⠼⠥⡯⣽⡇⢻⢹⡯⡇⢸⠯⠇⣛⠜⡋⡿ ⢼⡗⠐⡗⢚⠷⡃⣻⡃⣗⡓⢸⡋⣛⢟⠒⠐⣟⢺⠻⠲⡓⣟⣽⣻⣛⣛⣿⡓⢘⢿⣛⡄⣸⣻⡃⠲⡓⠒⣛ ⢺⡷⠰⡇⢾⠲⡇⣿⡇⡯⡇⢸⡧⣿⢾⠲⠰⣿⢾⠶⠖⡇⣿⢺⣿⣿⣿⣿⡇⢸⢾⣿⠎⢼⣿⡇⡇⡷⠶⣿ ⢸⣷⢰⣏⣾⣱⡇⣿⡇⣏⣏⢸⡗⣿⣾⢸⢰⣿⣾⣾⣏⣯⣿⣼⣿⣿⣿⣿⡏⣸⣾⣿⡄⢸⣿⡇⣏⣷⣷⣿ ⢈⡁⢘⣉⣬⡅⡅⣭⡅⣽⣍⢘⣡⣍⣬⣛⡄⣽⣋⣋⣭⣉⣙⢈⣉⣉⣋⣩⣟⣽⢈⣍⡀⢈⣍⡇⣋⣩⣙⣿ ⢰⣏⢉⡷⣽⣟⡇⣾⡇⣷⡷⢸⡦⣾⣽⢸⢌⣷⣹⣹⡷⡶⣿⢳⣾⣷⣾⣿⡆⢴⣱⣾⠂⢺⣿⡇⡷⣮⣏⣷ ⣁⡀⡁⡀⢈⣉⡀⣀⡀⡩⡂⣁⡀⣀⢐⣁⠀⣉⣈⠀⣉⣉⣀⣩⣊⣅⣀⣉⡀⢈⣨⣀⡕⢍⣉⡁⢈⡀⠀⣀ ⠿⠧⠧⠇⠽⠧⠗⠿⠇⠧⠇⠿⠯⠿⠿⠬⠥⠿⠼⠴⠯⠇⠿⠺⠿⠿⠿⠿⠧⠽⠿⠿⠇⠸⢿⢗⠆⠯⠤⠿ ⢺⣍⣘⠜⣹⣋⡇⣯⡇⣟⡟⠨⢄⣮⣩⢛⣈⣽⣩⣙⣛⠜⢼⠢⢭⡭⢯⣽⡏⣸⡹⣯⠂⠰⡬⡅⣛⢜⣋⣭ ⢸⡿⢸⡌⣿⡽⡇⣽⡇⣿⡟⢲⣤⣽⣿⣾⡾⣿⣿⡿⣯⡬⣼⢠⣼⣧⣽⣿⣷⣼⢯⣽⠀⢠⣤⡇⡏⣼⢿⣷
At this point, we can now write the code for the PageRank algorithm:
julia> using LinearAlgebra
WARNING: using LinearAlgebra.normalize in module Main conflicts with an existing identifier.
julia> function pagerank(A; Niter=20, damping=.15) Nmax = size(A, 1) r = rand(1, Nmax); # Generate a random starting rank. r = r ./ norm(r, 1); # Normalize a = (1 - damping) ./ Nmax; # Create damping vector for i=1:Niter s = r * A rmul!(s, damping) r = s .+ (a * sum(r, dims=2)); # Compute PageRank. end r = r ./ norm(r, 1); return r end
pagerank (generic function with 1 method)
Using this function, we apply it to the above similarity matrix (sim_mat
) and extract the PageRank scores for all verses. These scores will serve as the weights, and so higher scores suggest that the verse has a lot of connections to other verses in the corpus, which means it represents per se the corpus.
julia> p = pagerank(sim_mat)
1×109 Matrix{Float64}: 0.00258995 0.00258329 0.00269773 … 0.00630215 0.00256489 0.00239992
Now we sort these scores in descending order and use it to rearrange the original verses:
julia> idx = sortperm(vec(p), rev=true)[1:10]
10-element Vector{Int64}: 84 88 91 65 69 7 27 107 90 67
Finally, the following 10 verses best summarizes the corpus (Chapter 18) using TextRank:
julia> verse_nos = verses(CorpusData(crpstbl), number=true, start_end=false)
1-element Vector{Tuple{Vector{Int64}, Vector{Int64}}}: ([18], [1, 2, 4, 5, 6, 7, 8, 9, 10, 11 … 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])
julia> verse_out = String[];
julia> chapter = Int64[];
julia> verse = Int64[];
julia> for v in verse_nos verse_out = vcat(verse_out, verses(crpsdata[v[1]][v[2]])) chapter = vcat(chapter, repeat(v[1], inner=length(v[2]))) verse = vcat(verse, v[2]) end
julia> using DataFrames
julia> tbl = DataFrame( chapter=chapter[idx], verse=verse[idx], verse_text=arabic.(verse_out[idx]) );
julia> tbl
10×3 DataFrame Row │ chapter verse verse_text │ Int64 Int64 String ─────┼─────────────────────────────────────────────────── 1 │ 18 85 فَأَتْبَعَ سَبَبًا 2 │ 18 89 ثُمَّ أَتْبَعَ سَبَبًا 3 │ 18 92 ثُمَّ أَتْبَعَ سَبَبًا 4 │ 18 66 قَالَ لَهُۥ مُوسَىٰ هَلْ أَتَّبِعُكَ عَلَىٰٓ أَن تُعَلِّ… 5 │ 18 70 قَالَ فَإِنِ ٱتَّبَعْتَنِى فَلَا تَسْـَٔلْنِى عَن شَىْ… 6 │ 18 8 وَإِنَّا لَجَٰعِلُونَ مَا عَلَيْهَا صَعِيدًا جُرُزًا 7 │ 18 28 وَٱصْبِرْ نَفْسَكَ مَعَ ٱلَّذِينَ يَدْعُونَ رَبَّهُم بِ… 8 │ 18 108 خَٰلِدِينَ فِيهَا لَا يَبْغُونَ عَنْهَا حِوَلًا 9 │ 18 91 كَذَٰلِكَ وَقَدْ أَحَطْنَا بِمَا لَدَيْهِ خُبْرًا 10 │ 18 68 وَكَيْفَ تَصْبِرُ عَلَىٰ مَا لَمْ تُحِطْ بِهِۦ خُبْرًا
The following is the table of the above output properly formatted in HTML.
Pkg.add("Latexify")
using Latexify
mdtable(DataFrame(tbl), latex=false)
chapter | verse | verse_text |
---|---|---|
18 | 85 | فَأَتْبَعَ سَبَبًا |
18 | 89 | ثُمَّ أَتْبَعَ سَبَبًا |
18 | 92 | ثُمَّ أَتْبَعَ سَبَبًا |
18 | 66 | قَالَ لَهُۥ مُوسَىٰ هَلْ أَتَّبِعُكَ عَلَىٰٓ أَن تُعَلِّمَنِ مِمَّا عُلِّمْتَ رُشْدًا |
18 | 70 | قَالَ فَإِنِ ٱتَّبَعْتَنِى فَلَا تَسْـَٔلْنِى عَن شَىْءٍ حَتَّىٰٓ أُحْدِثَ لَكَ مِنْهُ ذِكْرًا |
18 | 8 | وَإِنَّا لَجَٰعِلُونَ مَا عَلَيْهَا صَعِيدًا جُرُزًا |
18 | 28 | وَٱصْبِرْ نَفْسَكَ مَعَ ٱلَّذِينَ يَدْعُونَ رَبَّهُم بِٱلْغَدَوٰةِ وَٱلْعَشِىِّ يُرِيدُونَ وَجْهَهُۥ وَلَا تَعْدُ عَيْنَاكَ عَنْهُمْ تُرِيدُ زِينَةَ ٱلْحَيَوٰةِ ٱلدُّنْيَا وَلَا تُطِعْ مَنْ أَغْفَلْنَا قَلْبَهُۥ عَن ذِكْرِنَا وَٱتَّبَعَ هَوَىٰهُ وَكَانَ أَمْرُهُۥ فُرُطًا |
18 | 108 | خَٰلِدِينَ فِيهَا لَا يَبْغُونَ عَنْهَا حِوَلًا |
18 | 91 | كَذَٰلِكَ وَقَدْ أَحَطْنَا بِمَا لَدَيْهِ خُبْرًا |
18 | 68 | وَكَيْفَ تَصْبِرُ عَلَىٰ مَا لَمْ تُحِطْ بِهِۦ خُبْرًا |
The following are the translations of the above verses:
Chapter | Verse | English Translation |
18 | 85 | So he travelled a course, |
18 | 89 | Then he travelled a ˹different˺ course |
18 | 92 | Then he travelled a ˹third˺ course |
18 | 66 | Moses said to him, “May I follow you, provided that you teach me some of the right guidance you have been taught?” |
18 | 70 | He responded, “Then if you follow me, do not question me about anything until I ˹myself˺ clarify it for you.” |
18 | 8 | And We will certainly reduce whatever is on it to barren ground. |
18 | 28 | And patiently stick with those who call upon their Lord morning and evening, seeking His pleasure. Do not let your eyes look beyond them, desiring the luxuries of this worldly life. And do not obey those whose hearts We have made heedless of Our remembrance, who follow ˹only˺ their desires and whose state is ˹total˺ loss. |
18 | 108 | where they will be forever, never desiring anywhere else. |
18 | 91 | So it was. And We truly had full knowledge of him. |
18 | 68 | And how can you be patient with what is beyond your ˹realm of˺ knowledge?” |