wordtm package

Subpackages

Submodules

wordtm.meta module

wordtm.meta.addin(func)[source]

Adds additional features (showing timing information and source code) to a function at runtime. This adds two parameters (‘timing’ & ‘code’) to function ‘func’ at runtime. ‘timing’ is a flag indicating whether the execution time of the function is shown, and it is default to False. ‘code’ is an indicator determining if the source code of the function ‘func’ is shown and/or the function is invoked; ‘0’ indicates the function is executed but its source code is not shown, ‘1’ indicates the source code of the function is shown after execution, or ‘2’ indicates the source code of the function is shown without execution, and it is default to 0.

Parameters:

func (function) – The target function for inserting additiolnal features - timing information and showing code, default to None

Returns:

The wrapper function

Return type:

function

wordtm.meta.addin_all(modname='wordtm')[source]

Applies ‘addin’ function to all functions of all sub-modules of a module at runtime.

Parameters:

modname (str, optional) – The target module of which all the functions are inserted additional features, default to ‘wordtm’

wordtm.meta.addin_all_functions(submod)[source]

Applies ‘addin’ function to all functions of a module at runtime.

Parameters:

submod (module) – The target sub-module of which all the functions are inserted additional features, default to None

wordtm.meta.get_module_info(detailed=False)[source]

Gets the information of the module ‘wordtm’.

Parameters:

detailed (bool, optional) – The flag indicating whether only function signature or detailed source code is shown, default to False

Returns:

The information of the module ‘wordtm’

Return type:

str

wordtm.pivot module

wordtm.pivot.stat(df, chi=False, *, timing=False, code=0)[source]

Returns a pivot table from the DataFrame ‘df’ storing the input Scripture, with columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None

  • chi (bool, optional) – If the value is True, assume the input text is in Chinese, otherwise, the input text is in English, default to False

Returns:

The pivot table of the input Scripture grouped by category (‘cat_no’)

Return type:

pandas.DataFrame

wordtm.quot module

wordtm.quot.extract_quotation(text, quot_marks, *, timing=False, code=0)[source]

Returns the text within a pair of quotation marks.

Parameters:
  • text (str) – The target text to be extracted, default to None

  • quot_marks (list) – A pair of quotation marks, [‘”’, ‘”’] for English text or [’『’, ‘』’] for Chinese text, default to None

Returns:

The text within a pair of quotation marks, if any, otherwise, an empty string

Return type:

str

wordtm.quot.match_text(target, sent_tokens, lang, threshold, n=5, *, timing=False, code=0)[source]

Returns a list of tuples of the cosine smilarity measure of the OT verse with target verse and the index of that OT verse in the DataFrame storing the prescribed OT Scripture.

Parameters:
  • target (str) – The target verse to be matched, default to None

  • sent_tokens (str) – The target verse to be matched, default to None

  • lang (str) – If the value is ‘chi’ , the processed language is assumed to be Chinese, otherwise, it is English, default to None

  • threshold (float) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where the cosine similarity measure of a matched OT verse and the target verse should be greater this value, default to None

  • n (int, optional) – The upper bound of the number of matched verses, default to 5

Returns:

The list of tuples of the cosine smilarity measure and the index of the OT verse

Return type:

list

wordtm.quot.match_verse(i, ot_list, otdf, df, book, chap, verse, lang, threshold, *, timing=False, code=0)[source]

Returns whether the target NT verse (book, chap, verse) can match a particular verse in the list of OT verses (ot_list), and prints the matched OT versed.

Parameters:
  • i (int) – The number of matched verses so far, default to None

  • ot_list (list) – The list of OT verses (str) to be matched, default to None

  • otdf (pandas.DataFrame) – The DataFrame storing the prescribed OT verses to be matched, default to None

  • df (pandas.DataFrame) – The DataFrame storing the collection of the target NT verses to be matched, default to None

  • book (str) – The Bible book short name (3 characters) of the target NT verse to be matched, default to None

  • chap (int) – The chapter number of the target NT verse to be matched, default to None

  • verse (int) – The verse number of the target NT verse to be matched, default to None

  • lang (str) – If the value is ‘chi’ , the processed language is assumed to be Chinese otherwise, it is English, default to None

  • threshold (float) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where that measure for successful match should be greater this value, default to None

Returns:

True if the target verse matched an OT verse, False otherwise

Return type:

bool

wordtm.quot.show_quot(target, source='ot', lang='en', threshold=0.5, *, timing=False, code=0)[source]

Shows a collection of matched OT verses, if any, based on the prescribed collection of target NT verse and the threshold value.

Parameters:
  • target (pandas.DataFrame) – The collection of target NT verses to be matched, default to None

  • source (str, optional) – The string representing the collection of all or subset of OT verses to be matched, default to ‘ot’

  • lang (str, optional) – If the value is ‘en’, the processed language is assumed to be English otherwise, it is Chinese, default to ‘en

  • threshold (str, optional) – The threshold value of the cosine similarity measure between the target verse and an OT verse, where that measure for successful match should be greater this value, default to 0.5

Returns:

The list of tuples of the cosine smilarity measure and the index of the OT verse

Return type:

list

wordtm.quot.tokenize(sentence, *, timing=False, code=0)[source]

Returns a list of tokens from a Chinese sentence.

Parameters:

sentence (str) – The target text to be tokenized, default to None

Returns:

The generator object that storing the list of tokens extracted from the sentence

Return type:

generator

wordtm.ta module

wordtm.ta.get_sent_scores(sentences, diction, sent_len, *, timing=False, code=0) dict[source]

Returns the dictionary of a list of sentences with their scores computed by their words

Parameters:
  • sentences (list) – The list of sentences for computing their scores, default to None

  • diction (collections.Counter object) – The dictionary storing the collection of tokenized words with their frequencies

  • sent_len (int) – The upper bound of the number of sentences to be processed, default to None

Returns:

The list of sentences tokenized from the collection of document

Return type:

pandas.DataFrame

wordtm.ta.get_sentences(df, lang, *, timing=False, code=0)[source]

Returns the list of sentences tokenized from the collection of documents (df).

Parameters:
  • df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None

  • lang (str) – If the value is ‘chi’ , the processed language is assumed to be Chinese otherwise, it is English, default to None

Returns:

The list of sentences tokenized from the collection of document

Return type:

pandas.DataFrame

wordtm.ta.get_summary(sentences, sent_weight, threshold, sent_len, *, timing=False, code=0)[source]

Returns the summary of the collection of sentences

Parameters:
  • sentences (list) – The list of target sentences for summarization, default to None

  • sent_weight (collections.Counter object) – The dictionary of a list of sentences with their scores computed by their words

  • threshold (float) – The minimum value of sentence weight for extracting that sentence as part of the final summary, default to None

  • sent_len (int) – The upper bound of the number of sentences to be processed, default to None

Returns:

The summary of the collection of sentences

Return type:

str

wordtm.ta.summary(df, lang='en', weight=1.5, sent_len=8, *, timing=False, code=0)[source]

Returns the summary of the collection of sentences stored in a DataFrame (df)

Parameters:
  • df (pandas.DataFrame) – The collection of target sentences for summarization, default to None

  • lang (str, optional) – The language, either English (‘en’) or Chinese (‘chi’) of the target text to be processed, default to ‘en’

  • weight (float, optional) – The factor to be multiplied to the threshold, which determines the sentences as the summary, default to 1.5

  • sent_len (int, optional) – The upper bound of the number of sentences to be processed, default to 8

Returns:

The summary of the collection of sentences

Return type:

str

wordtm.tm module

class wordtm.tm.BTM(textfile, chi=False, num_topics=15, embed=True)[source]

Bases: object

The BTM object for BERTopic modeling.

Variables:
  • num_topics (int) – The number of topics to be built from the modeling, default to 10

  • textfile (str) – The filename of the text file to be processed

  • chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English

  • num_topics – The number of topics set for the topic model

  • docs (pandas.DataFrame or list) – The collection of the original documents to be processed

  • pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing

  • dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)

  • corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)

  • model (bertopic.BERTopic) – The BERTopic model object

  • embed (bool) – The flag indicating whether the BERTopic model is trained with the BERT pretrained model

  • bmodel (transformers.BertModel) – The BERT pretrained model

  • bt_vectorizer (sklearn.feature_extraction.text.CountVectorizer) – The vectorizer extracted from the BERTopic model for model evaluation

  • bt_analyzer (functools.partial) – The analyzer extracted from the BERTopic model for model evaluation

  • cleaned_docs (list) – The list of documents (string) built by grouping the original documents by the topics created from the BERTopic model

evaluate()[source]

Computes and outputs the coherence score.

fit()[source]

Build the BERTopic model for English text with the created corpus and dictionary.

fit_chi()[source]

Build the BERTopic model for Chinese text with the created corpus and dictionary.

pre_evaluate()[source]

Prepare the original documents per built topic for model evaluation.

preprocess()[source]

Process the original English documents (wordtm.tm.BTM.docs) by invoking wordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the BERTopic model.

preprocess_chi()[source]

Process the original Chinese documents (wordtm.tm.BTM.docs) by invoking wordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the BERTopic model.

show_topics()[source]

Shows the topics with their keywords from the built BERTopic model.

viz()[source]

Visualize the built BERTopic model through Intertopic Distance Map, Topic Word Score Charts, and Topic Similarity Matrix.

class wordtm.tm.LDA(textfile, chi=False, num_topics=15)[source]

Bases: object

The LDA object for Latent Dirichlet Allocation (LDA) modeling.

Variables:
  • num_topics (int) – The number of topics to be built from the modeling, default to 10.

  • textfile (str) – The filename of the text file to be processed

  • chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English

  • num_topics – The number of topics set for the topic model

  • docs (pandas.DataFrame or list) – The collection of the original documents to be processed

  • pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing

  • dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)

  • corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)

  • model (gensim.models.LdaModel) – The LDA model object

  • vis_data (pyLDAvis.PreparedData) – The LDA model’s prepared data for visualization

evaluate()[source]

Computes and outputs the coherence score, perplexity, topic diversity, and topic size distribution.

fit()[source]

Build the LDA model with the created corpus and dictionary.

preprocess()[source]

Process the original English documents (wordtm.tm.LDA.docs) by invoking wordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the LDA model.

preprocess_chi()[source]

Process the original Chinese documents (wordtm.tm.LDA.docs) by invoking wordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the LDA model.

show_topics()[source]

Shows the topics with their keywords from the built LDA model.

viz()[source]

Shows the Intertopic Distance Map for the built LDA model.

class wordtm.tm.NMF(textfile, chi=False, num_topics=15)[source]

Bases: object

The NMF object for Non-negative Matrix Factorization (NMF) modeling.

Variables:
  • num_topics (int) – The number of topics to be built from the modeling, default to 10.

  • textfile (str) – The filename of the text file to be processed

  • chi (bool) – The flag indicating whether the processed text is in Chinese or not, True stands for Traditional Chinese or False for English

  • num_topics – The number of topics set for the topic model

  • docs (pandas.DataFrame or list) – The collection of the original documents to be processed

  • pro_docs (list) – The collection of documents, in form of list of lists of words after text preprocessing

  • dictionary (gensim.corpora.Dictionary) – The dictionary of word ids with their tokenized words from preprocessed documents (‘pro_docs’)

  • corpus (list) – The list of documents, where each document is a list of tuples (word id, word frequency in the particular document)

  • model (gensim.models.Nmf) – The NMF model object

evaluate()[source]

Computes and outputs the coherence score, topic diversity, and topic size distribution.

fit()[source]

Build the NMF model with the created corpus and dictionary.

preprocess()[source]

Process the original English documents (wordtm.tm.NMF.docs) by invoking wordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the NMF model.

preprocess_chi()[source]

Process the original Chinese documents (wordtm.tm.NMF.docs) by invoking wordtm.tm.process_text, and build a dictionary and a corpus from the preprocessed documents for the NMF model.

show_topics_words()[source]

Shows the topics with their keywords from the built NMF model.

wordtm.tm.btm_process(doc_file, cat=0, chi=False, group=True, eval=False, *, timing=False, code=0)[source]

Pipelines the BERTopic modeling.

Parameters:
  • doc_file (str) – The filename of the prescribed text file to be loaded, default to None

  • cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0

  • chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False

  • group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True

  • eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False

Returns:

The pipelined BTM

Return type:

wordtm.tm.BTM object

wordtm.tm.lda_process(doc_file, cat=0, chi=False, group=True, eval=False, *, timing=False, code=0)[source]

Pipelines the LDA modeling.

Parameters:
  • doc_file (str) – The filename of the prescribed text file to be loaded, default to None

  • cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0

  • chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False

  • group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True

  • eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False

Returns:

The pipelined LDA

Return type:

wordtm.tm.LDA object

wordtm.tm.load_bible(textfile, cat=0, group=True, *, timing=False, code=0)[source]

Loads and returns the Bible Scripture from the prescribed internal file (‘textfile’).

Parameters:
  • textfile (str) – The package’s internal Bible text from which the text is loaded, either World English Bible (‘web.csv’) or Chinese Union Version (Traditional) (‘cuv.csv’), default to None

  • cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0

  • group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True

Returns:

The collection of Scripture loaded

Return type:

pandas.DataFrame

wordtm.tm.load_text(textfile, *, timing=False, code=0)[source]

Loads and returns the list of documents from the prescribed file (‘textfile’).

Parameters:

textfile (str) – The prescribed text file from which the text is loaded, default to None

Returns:

The list of documents loaded

Return type:

list

wordtm.tm.nmf_process(doc_file, cat=0, chi=False, group=True, eval=False, *, timing=False, code=0)[source]

Pipelines the NMF modeling.

Parameters:
  • doc_file (str) – The filename of the prescribed text file to be loaded, default to None

  • cat (int or str, optional) – The category indicating a subset of the Scripture to be loaded, where 0 stands for the whole Bible, 1 for OT, 2 for NT, or one of the ten categories [‘tor’, ‘oth’, ‘ket’, ‘map’, ‘mip’, ‘gos’, ‘nth’, ‘pau’, ‘epi’, ‘apo’] (See the package’s internal file ‘data/book_cat.csv’), default to 0

  • chi (bool, optional) – The flag indicating whether the text is processed as Chinese (True) or English (False), default to False

  • group (bool, optional) – The flag indicating whether the loaded text is grouped by chapter, default to True

  • eval (bool, optional) – The flag indicating whether the model evaluation results will be shown, default to False

Returns:

The pipelined NMF

Return type:

wordtm.tm.NMF object

wordtm.tm.process_text(doc, *, timing=False, code=0)[source]

Processes the English text through tokenization, converting to lower case, removing all digits, stemming, removing punctuations and stopwords.

Parameters:

doc (str) – The prescribed text, in form of a string, to be processed, default to None

Returns:

The list of the processed strings

Return type:

list

wordtm.util module

wordtm.util.add_chi_vocab(*, timing=False, code=0)[source]

Loads the Chinese Bible vocabulary from the internal file ‘bible_vocab.txt’, and adds to the Jieba word list for future tokenization

wordtm.util.chi_sent_terms(text, *, timing=False, code=0)[source]

Returns the list of Chinese words tokenized from the input text.

Parameters:

text (str) – The input Chinese text to be tokenized, default to None

Returns:

The list of Chinese words

Return type:

list

wordtm.util.chi_stops(*, timing=False, code=0)[source]

Loads the common Chinese (Traditional) vocabulary to Jieba for future tokenization, and the Chinese stopwords for future wordcloud plotting.

Returns:

The list of stopwords for wordcloud plotting

Return type:

list

wordtm.util.clean_text(df, *, timing=False, code=0)[source]

Cleans the text from the Scripture stored in the DataFrame ‘df’, by removing all digits, replacing newline by a space, removing English stopwords, converting all characters to lower case, and removing all characters except alphanumeric and whitespace.

Parameters:

df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None

Returns:

The cleaned text in a DataFrame

Return type:

pandas.DataFrame

wordtm.util.extract(df, testament=-1, category='', book=0, chapter=0, verse=0, *, timing=False, code=0)[source]

Extract a subset of the Scripture stored in a DataFRame by testament, category, or book/chapter/verse.

Parameters:
  • df (pandas.DataFrame) – The collection of the Bible Scripture with columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’, default to None

  • testament (int, optional) – The prescribed testament to be extracted, -1 stands for no prescription, 0 for OT, or 1 for NT, default to -1

  • category (str, optional) – The prescribed category to be extracted, and it should be either a full category name or a short name with 3 lower-case letters from a list of 10 categories, default to ‘’

  • book (str, int, optional) – The prescribed Bible book to be extracted, and it should be either a 3-letter short book name or a book number from 1 to 66, default to 0

  • chapter (int or tuple, optional) – The prescribed chapter or a tuple indicating the range of chapters of a Bible book to be extracted, default to 0

  • verse (int or tuple, optional) – The prescribed verse or a tuple indicating the range of verses from a chapter of a Bible book to be extracted, default to 0

Returns:

The subset of the input Scripture, if any, otherwise, the message ‘No scripture is extracted!’

Return type:

pandas.DataFrame or str

wordtm.util.extract2(df, filter='', *, timing=False, code=0)[source]

Extract the Bible Scipture through a specific filter string by invoking the function ‘util.extract’.

Parameters:
  • df (pandas.DataFrame) – The collection of the Bible Scripture, default to None

  • filter (str, optional) – The prescribed filter string with the format ‘<book> <chapter>:<verse>[-<verse2>]’ for extracting a range of verses in the Scripture, default to ‘’

Returns:

The prescribed range of verses from the input Scripture, or the whole Scripture if the filter string is empty

Return type:

pandas.DataFrame

wordtm.util.get_diction(docs, *, timing=False, code=0)[source]

Determines which is the target language, English or Chinese, in order to build a dictionary of words with their frequencies.

Parameters:

docs (pandas.DataFrame or list) – The collection of documents, default to None

Returns:

The dictionary of words with their frequencies

Return type:

dict

wordtm.util.get_diction_chi(docs, *, timing=False, code=0)[source]

Tokenizes the collection of Chinese documents and builds a dictionary of words with their frequencies.

Parameters:

docs (pandas.DataFrame or list) – The collection of text, default to None

Returns:

The dictionary of words with their frequencies

Return type:

dict

wordtm.util.get_diction_en(docs, *, timing=False, code=0)[source]

Tokenizes the collection of English documents and builds a dictionary of words with their frequencies.

Parameters:

docs (pandas.DataFrame or list) – The collection of text, default to None

Returns:

The dictionary of words with their frequencies

Return type:

dict

wordtm.util.get_list(df, column='book', *, timing=False, code=0)[source]

Extracts and returns the prescribed column from the Scripture stored in the DataFrame ‘df’.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None

  • column (str, optional) – The column by which the Scriture is grouped, default to ‘book’

Returns:

The grouped Scripture

Return type:

pandas.DataFrame

wordtm.util.get_sent_terms(text, *, timing=False, code=0)[source]

Determines how to tokenize the input text, based on the global language setting, either English (‘en’) or Traditional Chinese (‘chi’).

Parameters:

text (str) – The input text to be tokenized, default to None

Returns:

The list of tokenized words

Return type:

list

wordtm.util.get_text(df, *, timing=False, code=0)[source]

Extracts and returns the text from the Scripture stored in the DataFrame ‘df’ after joining the list of text into a string and removing all the ideographic spaces (’ ‘) from the text.

Parameters:

df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None

Returns:

The extracted text

Return type:

str

wordtm.util.get_text_list(df, *, timing=False, code=0)[source]

Extracts and returns the list of text from the Scripture stored in the DataFrame ‘df’ after removing all the ideographic spaces (’ ‘) from the text.

Parameters:

df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None

Returns:

The extracted text

Return type:

list

wordtm.util.group_text(df, column='chapter', *, timing=False, code=0)[source]

Groups the Bible Scripture in the DataFrame ‘df’ by the prescribed column, and ‘df’ should include columns ‘book’, ‘book_no’, ‘chapter’, ‘verse’, ‘text’, ‘testament’, ‘category’, ‘cat’, and ‘cat_no’.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame storing the Scripture, default to None

  • column (str, optional) – The column by which the Scriture is grouped, default to ‘chapter’

Returns:

The grouped Scripture

Return type:

pandas.DataFrame

wordtm.util.is_chi(*, timing=False, code=0)[source]

Checks whether the Chinese language flag is set.

Returns:

True if the Chinese language flag (chi_flag) is set, False otherwise

Return type:

bool

wordtm.util.load_text(filepath, nr=0, info=False, *, timing=False, code=0)[source]

Loads and returns the text from the prescribed file path (‘filepath’).

Parameters:
  • filepath (str) – The prescribed filepath from which the text is loaded, default to None

  • nr (int, optional) – The number of rows of text to be loaded; 0 represents all rows, default to 0

  • info (bool, optional) – The flag whether the dataset information is shown, default to False

Returns:

The collection of text with the prescribed number of rows loaded

Return type:

pandas.DataFrame

wordtm.util.load_word(ver='web.csv', nr=0, info=False, *, timing=False, code=0)[source]

Loads and returns the text from the prescribed internal file (‘ver’).

Parameters:
  • ver (str, optional) – The package’s internal Bible text from which the text is loaded, either World English Bible (‘web.csv’) or Chinese Union Version (Traditional)(‘cuv.csv’), default to ‘web.csv’

  • nr (int, optional) – The number of rows of Scripture to be loaded; 0 represents all rows, default to 0

  • info (bool, optional) – The flag whether the dataset information is shown, default to False

Returns:

The collection of Scripture with the prescribed number of rows loaded

Return type:

pandas.DataFrame

wordtm.util.set_lang(lang='en', *, timing=False, code=0)[source]

Cleans the text from the Scripture stored in the DataFrame ‘df’.

Parameters:

lang (str, optional) – The prescribed language for text processing, where ‘en’ stands for English or ‘chi’ for Traditonal Chinese, default to ‘en’

wordtm.version module

wordtm.viz module

wordtm.viz.chi_wordcloud(docs, image='heart.jpg', *, timing=False, code=0)[source]

Prepare and show a Chinese wordcloud

Parameters:
  • docs (pandas.DataFrame) – The collection of Chinese documents for preparing a wordcloud, default to None

  • image (str, optional) – The filename of the image as the mask of the wordcloud, default to ‘heart.jpg’

wordtm.viz.plot_cloud(wordcloud, *, timing=False, code=0)[source]

Plot the prepared ‘wordcloud’ :param wordcloud: The WordCloud object for plotting, default to None :type wordcloud: WordCloud object

wordtm.viz.show_wordcloud(docs, image='heart.jpg', *, timing=False, code=0)[source]

Prepare and show a wordcloud

Parameters:
  • docs (pandas.DataFrame) – The collection of documents for preparing a wordcloud, default to None

  • image (str, optional) – The filename of the image as the mask of the wordcloud, default to ‘heart.jpg’

Module contents