wordtm package
Subpackages
Submodules
wordtm.meta module
- wordtm.meta.addin(func)[source]
Adds additional features (showing timing information and source code) to a function at runtime. This adds two parameters (‘timing’ & ‘code’) to function ‘func’ at runtime. ‘timing’ is a flag indicating whether the execution time of the function is shown, and it is default to False. ‘code’ is an indicator determining if the source code of the function ‘func’ is shown and/or the function is invoked; ‘0’ indicates the function is executed but its source code is not shown, ‘1’ indicates the source code of the function is shown after execution, or ‘2’ indicates the source code of the function is shown without execution, and it is default to 0.
- Parameters:
func (function) – The target function for inserting additiolnal features - timing information and showing code, default to None
- Returns:
The wrapper function
- Return type:
function
- wordtm.meta.addin_all(modname='wordtm')[source]
Applies ‘addin’ function to all functions of all sub-modules of a module at runtime.
- Parameters:
modname (str, optional) – The target module of which all the functions are inserted additional features, default to ‘wordtm’
- wordtm.meta.addin_all_functions(submod)[source]
Applies ‘addin’ function to all functions of a module at runtime.
- Parameters:
submod (module) – The target sub-module of which all the functions are inserted additional features, default to None
- wordtm.meta.get_module_info(detailed=False)[source]
Gets the information of the module ‘wordtm’.
- Parameters:
detailed (bool, optional) – The flag indicating whether only function signature or detailed source code is shown, default to False
- Returns:
The information of the module ‘wordtm’
- Return type:
str
wordtm.pivot module
- wordtm.pivot.stat(df, chi=False, *, timing=False, code=0)[source]
Returns a pivot table from the DataFrame ‘df’ storing the input Scripture, with columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’.
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
chi (bool, optional) – If the value is True, assume the input text is in Chinese, otherwise, the input text is in English, default to False
- Returns:
The pivot table of the input Scripture grouped by category (‘cat_no’)
- Return type:
pandas.DataFrame
wordtm.quot module
- wordtm.quot.extract_quotation(text, quot_marks, *, timing=False, code=0)[source]
Returns the text within a pair of quotation marks.
- Parameters:
text (str) – The target text to be extracted, default to None
quot_marks (list) – A pair of quotation marks, [‘”’, ‘”’] for English text or [’『’, ‘』’] for Chinese text, default to None
- Returns:
The text within a pair of quotation marks, if any, otherwise, an empty string
- Return type:
str
- wordtm.quot.match_text(target, sent_tokens, lang, threshold, n=5, *, timing=False, code=0)[source]
Returns a list of tuples of the cosine smilarity measure of the OT verse with target verse and the index of that OT verse in the DataFrame storing the prescribed OT Scripture.
- Parameters:
target (str) – The target verse to be matched, default to None
sent_tokens (str) – The target verse to be matched, default to None
lang (str) – If the value is ‘chi’ , the processed language is assumed to be Chinese, otherwise, it is English, default to None
threshold (float) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where the cosine similarity measure of a matched OT verse and the target verse should be greater this value, default to None
n (int, optional) – The upper bound of the number of matched verses, default to 5
- Returns:
The list of tuples of the cosine smilarity measure and the index of the OT verse
- Return type:
list
- wordtm.quot.match_verse(i, ot_list, otdf, df, book, chap, verse, lang, threshold, *, timing=False, code=0)[source]
Returns whether the target NT verse (book, chap, verse) can match a particular verse in the list of OT verses (ot_list), and prints the matched OT versed.
- Parameters:
i (int) – The number of matched verses so far, default to None
ot_list (list) – The list of OT verses (str) to be matched, default to None
otdf (pandas.DataFrame) – The DataFrame storing the prescribed OT verses to be matched, default to None
df (pandas.DataFrame) – The DataFrame storing the collection of the target NT verses to be matched, default to None
book (str) – The Bible book short name (3 characters) of the target NT verse to be matched, default to None
chap (int) – The chapter number of the target NT verse to be matched, default to None
verse (int) – The verse number of the target NT verse to be matched, default to None
lang (str) – If the value is ‘chi’ , the processed language is assumed to be Chinese otherwise, it is English, default to None
threshold (float) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where that measure for successful match should be greater this value, default to None
- Returns:
True if the target verse matched an OT verse, False otherwise
- Return type:
bool
- wordtm.quot.show_quot(target, source='ot', lang='en', threshold=0.5, *, timing=False, code=0)[source]
Shows a collection of matched OT verses, if any, based on the prescribed collection of target NT verse and the threshold value.
- Parameters:
target (pandas.DataFrame) – The collection of target NT verses to be matched, default to None
source (str, optional) – The string representing the collection of all or subset of OT verses to be matched, default to ‘ot’
lang (str, optional) – If the value is ‘en’, the processed language is assumed to be English otherwise, it is Chinese, default to ‘en
threshold (str, optional) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where that measure for successful match should be greater this value, default to 0.5
- Returns:
The list of tuples of the cosine smilarity measure and the index of the OT verse
- Return type:
list
- wordtm.quot.tokenize(sentence, *, timing=False, code=0)[source]
Returns a list of tokens from a Chinese sentence.
- Parameters:
sentence (str) – The target text to be tokenized, default to None
- Returns:
The generator object that storing the list of tokens extracted from the sentence
- Return type:
generator
wordtm.ta module
- wordtm.ta.get_sent_scores(sentences, diction, sent_len, *, timing=False, code=0) dict [source]
Returns the dictionary of a list of sentences with their scores computed by their words
- Parameters:
sentences (list) – The list of sentences for computing their scores, default to None
diction (collections.Counter object) – The dictionary storing the collection of tokenized words with their frequencies
sent_len (int) – The upper bound of the number of sentences to be processed, default to None
- Returns:
The list of sentences tokenized from the collection of document
- Return type:
pandas.DataFrame
- wordtm.ta.get_sentences(df, lang, *, timing=False, code=0)[source]
Returns the list of sentences tokenized from the collection of documents (df).
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
lang (str) – If the value is ‘chi’ , the processed language is assumed to be Chinese otherwise, it is English, default to None
- Returns:
The list of sentences tokenized from the collection of document
- Return type:
pandas.DataFrame
- wordtm.ta.get_summary(sentences, sent_weight, threshold, sent_len, *, timing=False, code=0)[source]
Returns the summary of the collection of sentences
- Parameters:
sentences (list) – The list of target sentences for summarization, default to None
sent_weight (collections.Counter object) – The dictionary of a list of sentences with their scores computed by their words
threshold (float) – The minimum value of sentence weight for extracting that sentence as part of the final summary, default to None
sent_len (int) – The upper bound of the number of sentences to be processed, default to None
- Returns:
The summary of the collection of sentences
- Return type:
str
- wordtm.ta.summary(df, lang='en', weight=1.5, sent_len=8, *, timing=False, code=0)[source]
Returns the summary of the collection of sentences stored in a DataFrame (df)
- Parameters:
df (pandas.DataFrame) – The collection of target sentences for summarization, default to None
lang (str, optional) – The language, either English (‘en’) or Chinese (‘chi’) of the target text to be processed, default to ‘en’
weight (float, optional) – The factor to be multiplied to the threshold, which determines the sentences as the summary, default to 1.5
sent_len (int, optional) – The upper bound of the number of sentences to be processed, default to 8
- Returns:
The summary of the collection of sentences
- Return type:
str
wordtm.tm module
- class wordtm.tm.BTM(textfile, chi=False, num_topics=15, embed=True)[source]
Bases:
object
The BTM object for BERTopic modeling.
- Variables:
num_topics (int) – The number of topics to be built from the modeling, default to 10
textfile (str) – The filename of the text file to be processed
chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English
num_topics – The number of topics set for the topic model
docs (pandas.DataFrame or list) – The collection of the original documents to be processed
pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing
dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)
corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)
model (bertopic.BERTopic) – The BERTopic model object
embed (bool) – The flag indicating whether the BERTopic model is trained with the BERT pretrained model
bmodel (transformers.BertModel) – The BERT pretrained model
bt_vectorizer (sklearn.feature_extraction.text.CountVectorizer) – The vectorizer extracted from the BERTopic model for model evaluation
bt_analyzer (functools.partial) – The analyzer extracted from the BERTopic model for model evaluation
cleaned_docs (list) – The list of documents (string) built by grouping the original documents by the topics created from the BERTopic model
- fit_chi()[source]
Build the BERTopic model for Chinese text with the created corpus and dictionary.
- preprocess()[source]
Process the original English documents (wordtm.tm.BTM.docs) by invoking wordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the BERTopic model.
- class wordtm.tm.LDA(textfile, chi=False, num_topics=15)[source]
Bases:
object
The LDA object for Latent Dirichlet Allocation (LDA) modeling.
- Variables:
num_topics (int) – The number of topics to be built from the modeling, default to 10.
textfile (str) – The filename of the text file to be processed
chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English
num_topics – The number of topics set for the topic model
docs (pandas.DataFrame or list) – The collection of the original documents to be processed
pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing
dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)
corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)
model (gensim.models.LdaModel) – The LDA model object
vis_data (pyLDAvis.PreparedData) – The LDA model’s prepared data for visualization
- evaluate()[source]
Computes and outputs the coherence score, perplexity, topic diversity, and topic size distribution.
- preprocess()[source]
Process the original English documents (wordtm.tm.LDA.docs) by invoking wordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the LDA model.
- class wordtm.tm.NMF(textfile, chi=False, num_topics=15)[source]
Bases:
object
The NMF object for Non-negative Matrix Factorization (NMF) modeling.
- Variables:
num_topics (int) – The number of topics to be built from the modeling, default to 10.
textfile (str) – The filename of the text file to be processed
chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English
num_topics – The number of topics set for the topic model
docs (pandas.DataFrame or list) – The collection of the original documents to be processed
pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing
dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)
corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)
model (gensim.models.Nmf) – The NMF model object
- evaluate()[source]
Computes and outputs the coherence score, topic diversity, and topic size distribution.
- preprocess()[source]
Process the original English documents (wordtm.tm.NMF.docs) by invoking wordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the NMF model.
- wordtm.tm.btm_process(doc_file, cat=0, chi=False, group=True, eval=False, *, timing=False, code=0)[source]
Pipelines the BERTopic modeling.
- Parameters:
doc_file (str) – The filename of the prescribed text file to be loaded, default to None
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True
eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False
- Returns:
The pipelined BTM
- Return type:
wordtm.tm.BTM object
- wordtm.tm.lda_process(doc_file, cat=0, chi=False, group=True, eval=False, *, timing=False, code=0)[source]
Pipelines the LDA modeling.
- Parameters:
doc_file (str) – The filename of the prescribed text file to be loaded, default to None
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True
eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False
- Returns:
The pipelined LDA
- Return type:
wordtm.tm.LDA object
- wordtm.tm.load_bible(textfile, cat=0, group=True, *, timing=False, code=0)[source]
Loads and returns the Bible Scripture from the prescribed internal file (‘textfile’).
- Parameters:
textfile (str) – The package’s internal Bible text from which the text is loaded, either World English Bible (‘web.csv’) or Chinese Union Version (Traditional) (‘cuv.csv’), default to None
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True
- Returns:
The collection of Scripture loaded
- Return type:
pandas.DataFrame
- wordtm.tm.load_text(textfile, *, timing=False, code=0)[source]
Loads and returns the list of documents from the prescribed file (‘textfile’).
- Parameters:
textfile (str) – The prescribed text file from which the text is loaded, default to None
- Returns:
The list of documents loaded
- Return type:
list
- wordtm.tm.nmf_process(doc_file, cat=0, chi=False, group=True, eval=False, *, timing=False, code=0)[source]
Pipelines the NMF modeling.
- Parameters:
doc_file (str) – The filename of the prescribed text file to be loaded, default to None
cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0
chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False
group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True
eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False
- Returns:
The pipelined NMF
- Return type:
wordtm.tm.NMF object
- wordtm.tm.process_text(doc, *, timing=False, code=0)[source]
Processes the English text through tokenization, converting to lower case, removing all digits, stemming, removing punctuations and stopwords.
- Parameters:
doc (str) – The prescribed text, in form of a string, to be processed, default to None
- Returns:
The list of the processed strings
- Return type:
list
wordtm.util module
- wordtm.util.add_chi_vocab(*, timing=False, code=0)[source]
Loads the Chinese Bible vocabulary from the internal file ‘bible_vocab.txt’, and adds to the Jieba word list for future tokenization
- wordtm.util.chi_sent_terms(text, *, timing=False, code=0)[source]
Returns the list of Chinese words tokenized from the input text.
- Parameters:
text (str) – The input Chinese text to be tokenized, default to None
- Returns:
The list of Chinese words
- Return type:
list
- wordtm.util.chi_stops(*, timing=False, code=0)[source]
Loads the common Chinese (Traditional) vocabulary to Jieba for future tokenization, and the Chinese stopwords for future wordcloud plotting.
- Returns:
The list of stopwords for wordcloud plotting
- Return type:
list
- wordtm.util.clean_text(df, *, timing=False, code=0)[source]
Cleans the text from the Scripture stored in the DataFrame ‘df’, by removing all digits, replacing newline by a space, removing English stopwords, converting all characters to lower case, and removing all characters except alphanumeric and whitespace.
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
- Returns:
The cleaned text in a DataFrame
- Return type:
pandas.DataFrame
- wordtm.util.extract(df, testament=-1, category='', book=0, chapter=0, verse=0, *, timing=False, code=0)[source]
Extract a subset of the Scripture stored in a DataFRame by testament, category, or book/chapter/verse.
- Parameters:
df (pandas.DataFrame) – The collection of the Bible Scripture with columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’, default to None
testament (int, optional) – The prescribed testament to be extracted, -1 stands for no prescription, 0 for OT, or 1 for NT, default to -1
category (str, optional) – The prescribed category to be extracted, and it should be either a full category name or a short name with 3 lower-case letters from a list of 10 categories, default to ‘’
book (str, int, optional) – The prescribed Bible book to be extracted, and it should be either a 3-letter short book name or a book number from 1 to 66, default to 0
chapter (int or tuple, optional) – The prescribed chapter or a tuple indicating the range of chapters of a Bible book to be extracted, default to 0
verse (int or tuple, optional) – The prescribed verse or a tuple indicating the range of verses from a chapter of a Bible book to be extracted, default to 0
- Returns:
The subset of the input Scripture, if any, otherwise, the message ‘No scripture is extracted!’
- Return type:
pandas.DataFrame or str
- wordtm.util.extract2(df, filter='', *, timing=False, code=0)[source]
Extract the Bible Scipture through a specific filter string by invoking the function ‘util.extract’.
- Parameters:
df (pandas.DataFrame) – The collection of the Bible Scripture, default to None
filter (str, optional) – The prescribed filter string with the format ‘<book> <chapter>:<verse>[-<verse2>]’ for extracting a range of verses in the Scripture, default to ‘’
- Returns:
The prescribed range of verses from the input Scripture, or the whole Scripture if the filter string is empty
- Return type:
pandas.DataFrame
- wordtm.util.get_diction(docs, *, timing=False, code=0)[source]
Determines which is the target language, English or Chinese, in order to build a dictionary of words with their frequencies.
- Parameters:
docs (pandas.DataFrame or list) – The collection of documents, default to None
- Returns:
The dictionary of words with their frequencies
- Return type:
dict
- wordtm.util.get_diction_chi(docs, *, timing=False, code=0)[source]
Tokenizes the collection of Chinese documents and builds a dictionary of words with their frequencies.
- Parameters:
docs (pandas.DataFrame or list) – The collection of text, default to None
- Returns:
The dictionary of words with their frequencies
- Return type:
dict
- wordtm.util.get_diction_en(docs, *, timing=False, code=0)[source]
Tokenizes the collection of English documents and builds a dictionary of words with their frequencies.
- Parameters:
docs (pandas.DataFrame or list) – The collection of text, default to None
- Returns:
The dictionary of words with their frequencies
- Return type:
dict
- wordtm.util.get_list(df, column='book', *, timing=False, code=0)[source]
Extracts and returns the prescribed column from the Scripture stored in the DataFrame ‘df’.
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
column (str, optional) – The column by which the Scriture is grouped, default to ‘book’
- Returns:
The grouped Scripture
- Return type:
pandas.DataFrame
- wordtm.util.get_sent_terms(text, *, timing=False, code=0)[source]
Determines how to tokenize the input text, based on the global language setting, either English (‘en’) or Traditional Chinese (‘chi’).
- Parameters:
text (str) – The input text to be tokenized, default to None
- Returns:
The list of tokenized words
- Return type:
list
- wordtm.util.get_text(df, *, timing=False, code=0)[source]
Extracts and returns the text from the Scripture stored in the DataFrame ‘df’ after joining the list of text into a string and removing all the ideographic spaces (’ ‘) from the text.
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
- Returns:
The extracted text
- Return type:
str
- wordtm.util.get_text_list(df, *, timing=False, code=0)[source]
Extracts and returns the list of text from the Scripture stored in the DataFrame ‘df’ after removing all the ideographic spaces (’ ‘) from the text.
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
- Returns:
The extracted text
- Return type:
list
- wordtm.util.group_text(df, column='chapter', *, timing=False, code=0)[source]
Groups the Bible Scripture in the DataFrame ‘df’ by the prescribed column, and ‘df’ should include columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’.
- Parameters:
df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None
column (str, optional) – The column by which the Scriture is grouped, default to ‘chapter’
- Returns:
The grouped Scripture
- Return type:
pandas.DataFrame
- wordtm.util.is_chi(*, timing=False, code=0)[source]
Checks whether the Chinese language flag is set.
- Returns:
True if the Chinese language flag (chi_flag) is set, False otherwise
- Return type:
bool
- wordtm.util.load_text(filepath, nr=0, info=False, *, timing=False, code=0)[source]
Loads and returns the text from the prescribed file path (‘filepath’).
- Parameters:
filepath (str) – The prescribed filepath from which the text is loaded, default to None
nr (int, optional) – The number of rows of text to be loaded; 0 represents all rows, default to 0
info (bool, optional) – The flag whether the dataset information is shown, default to False
- Returns:
The collection of text with the prescribed number of rows loaded
- Return type:
pandas.DataFrame
- wordtm.util.load_word(ver='web.csv', nr=0, info=False, *, timing=False, code=0)[source]
Loads and returns the text from the prescribed internal file (‘ver’).
- Parameters:
ver (str, optional) – The package’s internal Bible text from which the text is loaded, either World English Bible (‘web.csv’) or Chinese Union Version (Traditional)(‘cuv.csv’), default to ‘web.csv’
nr (int, optional) – The number of rows of Scripture to be loaded; 0 represents all rows, default to 0
info (bool, optional) – The flag whether the dataset information is shown, default to False
- Returns:
The collection of Scripture with the prescribed number of rows loaded
- Return type:
pandas.DataFrame
wordtm.version module
wordtm.viz module
- wordtm.viz.chi_wordcloud(docs, image='heart.jpg', *, timing=False, code=0)[source]
Prepare and show a Chinese wordcloud
- Parameters:
docs (pandas.DataFrame) – The collection of Chinese documents for preparing a wordcloud, default to None
image (str, optional) – The filename of the image as the mask of the wordcloud, default to ‘heart.jpg’
- wordtm.viz.plot_cloud(wordcloud, *, timing=False, code=0)[source]
Plot the prepared ‘wordcloud’ :param wordcloud: The WordCloud object for plotting, default to None :type wordcloud: WordCloud object
- wordtm.viz.show_wordcloud(docs, image='heart.jpg', *, timing=False, code=0)[source]
Prepare and show a wordcloud
- Parameters:
docs (pandas.DataFrame) – The collection of documents for preparing a wordcloud, default to None
image (str, optional) – The filename of the image as the mask of the wordcloud, default to ‘heart.jpg’