PreProcessing
The classes for data preparation with nltk, sklearn and more
A collection of methods to simplify your code.
The class Ntk contains the Natural Language Processing tool kit. |
Natural Language Tool Kit
Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes |
|
Filters doc from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes |
|
Gets statistics of vocabulary |
|
Gets words top for each target |
|
Cleans vocabs from common words among targets |
|
Gets ngrams from doc or tokens |
|
Gets ngrams features from doc or tokens |
|
Gets features |
|
Gets features |
|
Creates tuples with sample and its target |
|
Creates vocabulary from list of docs |
|
Creates vocabulary from list of tuples |
|
Creates features from docs |
|
Creates features from tuples |
|
Creates ngrams features from docs |
|
Creates ngrams features from tuples |
|
Creates the map of words |
|
Creates the cloud of words |
|
Vectorizes docs |
Detailed list
- class smltk.preprocessing.Ntk(params=[])
The class Ntk contains the Natural Language Processing tool kit.
- Arguments: params (dict) with the keys below
- language (str):
default is english
- lemmatizer (obj):
obj with method lemmatize() like WordNetLemmatizer()
- min_length (int):
default is 1, so all words will be used
- stop_words (list[str]):
default is stopwords.words()
- tag_map (dict):
default contains J, V, R
Here’s an example:
>>> from smltk.preprocessing import Ntk >>> doc = 'Good case, Excellent value.' >>> ntk = Ntk() >>> get_doc_cleaned = ntk.get_doc_cleaned(doc) >>> print(get_doc_cleaned) good case excellent value
- add_doc_to_vocab(doc, vocab, is_lemma=True)
Adds tokens of that doc to vocabulary and updates vocabulary
- Arguments:
- doc (str):
text
- vocab (collections.Counter):
dictionary of tokens with its count
- is_lemma (bool):
default is True
- Returns:
list of tokens of that doc and vocab updated
- create_features_from_docs(docs, target, is_lemma=True, words_top={}, degree=0)
Creates features from docs
- Arguments:
- docs (list[str]):
list of text
- target (str):
target name of the docs
- is_lemma (bool):
default is True
- words_top (dict):
dictionary of the words top
- degree (int):
degree of ngrams, default is 0
- Returns:
list of tuples with features and relative target
- create_features_from_tuples(tuples, is_lemma=True, words_top={}, degree=0)
Creates features from tuples
- Arguments:
- tuples (list[tuples]):
list of tuples with sample and its target
- is_lemma (bool):
default is True
- words_top (dict):
dictionary of the words top
- degree (int):
degree of ngrams, default is 0
- Returns:
list of tuples with features and relative target
- create_ngrams_features_from_docs(docs, target, is_lemma=True, degree=2)
Creates ngrams features from docs
- Arguments:
- docs (list[str]):
list of text
- target (str):
target name of the docs
- is_lemma (bool):
default is True
- degree (int):
degree of ngrams, default is 2
- Returns:
list of tuples with features and relative target
- create_ngrams_features_from_tuples(tuples, is_lemma=True, degree=2)
Creates ngrams features from tuples
- Arguments:
- tuples (list[tuples]):
list of tuples with sample and its target
- is_lemma (bool):
default is True
- degree (int):
degree of ngrams, default is 2
- Returns:
list of tuples with features and relative target
- create_tuples(docs=[], target=[])
Creates tuples with sample and its target
- Arguments:
- docs (list[str]):
list of texts
- target (list[str]):
list of targets
- Returns:
list of tuples with sample and its target
- create_vocab_from_docs(docs, is_lemma=True)
Creates vocabulary from list of docs
- Arguments:
- docs (list[str]):
list of texts
- is_lemma (bool):
default is True
- Returns:
dictionary of tokens with its count in an object collections.Counter
- create_vocab_from_tuples(tuples, is_lemma=True)
Creates vocabulary from list of tuples
- Arguments:
- tuples (list[tuples]):
list of tuples with sample and its target
- is_lemma (bool):
default is True
- Returns:
dictionary of tokens with its count in an object collections.Counter
- create_words_cloud(words, is_test=False)
Creates the cloud of words
- Arguments:
- words (str):
words
- is_test (bool):
default is False
- Returns:
only words cloud plot
- create_words_map(words)
Creates the map of words
- Arguments:
- words (list[str]):
words list
- Returns:
string of all words
- get_doc_cleaned(doc, is_lemma=True)
Filters doc from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes
- Arguments:
- doc (str):
text
- is_lemma (bool):
default is True
- Returns:
string cleaned
- get_features(doc, is_lemma=True, words_top={}, degree=0)
Gets features
- Arguments:
- doc (str):
text
- is_lemma (bool):
default is True
- words_top (dict):
dictionary of the words top
- degree (int):
degree of ngrams, default is 0
- Returns:
dictionary of features extracted
- get_features_from_docs(docs, is_lemma=True, words_top={}, degree=0)
Gets features
- Arguments:
- docs (list[str]):
list of text
- is_lemma (bool):
default is True
- words_top (dict):
dictionary of the words top
- degree (int):
degree of ngrams, default is 0
- Returns:
dictionary of features extracted
- get_ngrams(degree=2, doc='', tokens=[], is_tuple=True, is_lemma=False)
Gets ngrams from doc or tokens
- Arguments:
- degree (int):
degree of ngrams, default is 2
- doc (str):
text, option if you pass tokens
- tokens (list[str]):
list of tokens, option if you pass doc
- is_tuple (bool):
default is True
- is_lemma (bool):
default is False
- Returns:
list of tuples (n_grams) for that degree, or list of string (token)
- get_ngrams_features(degree=2, doc='', tokens=[], is_lemma=False)
Gets ngrams features from doc or tokens
- Arguments:
- degree (int):
degree of ngrams, default is 2
- doc (str):
text, option if you pass tokens
- tokens (list[str]):
list of tokens, option if you pass doc
- is_lemma (bool):
default is False
- Returns:
dictionary of ngrams extracted
- get_stats_vocab(vocab, min_occurance=1)
Gets statistics of vocabulary
- Arguments:
- vocab (collections.Counter):
dictionary of tokens with its count
- min_occurance (int):
minimum occurance considered
- Returns:
tuple of tokens number with >= min_occurance and total tokens number
- get_tokens_cleaned(doc, is_lemma=True)
Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes
- Arguments:
- doc (str):
text
- is_lemma (bool):
default is True
- Returns:
list of tokens cleaned
- get_vocabs_cleaned(vocabs)
Cleans vocabs from common words among targets
- Arguments:
- vocabs (dict):
keys are targets, values are vocabularies for that target
- Returns:
vocabs cleaned from common words among targets
- get_words_top(vocab, how_many)
Gets words top for each target
- Arguments:
- vocab (collections.Counter):
dictionary of tokens with its count
- how_many (int):
how many words in your top how_many list
- Returns:
dictionary of the top how_many list
- lemmatize(tokens)
Lemmatizes tokens
- Arguments:
- tokens (list[str]):
list of words
- Returns:
list of tokens lemmatized
- tokenize_and_clean_doc(doc)
Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters
- Arguments:
- doc (str):
text
- Returns:
list of words filtered
- vectorize_docs(docs, is_count=True, is_lemma=False, is_test=False)
Vectorizes docs
- Arguments:
- docs (list[str]):
list of texts
- is_count (bool):
default is True
- is_lemma (bool):
default is True
- is_test (bool):
default is False
- Returns:
list of scipy.sparse.csr.csr_matrix, one for each doc
- word_tokenize(doc)
Splits document in each word
- Arguments:
- doc (str):
text
- Returns:
list of words