PreProcessing

The classes for data preparation with nltk, sklearn and more

A collection of methods to simplify your code.

smltk.preprocessing.Ntk

The class Ntk contains the Natural Language Processing tool kit.

Natural Language Tool Kit

`smltk.preprocessing.Ntk.get_tokens_cleaned`	Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes
`smltk.preprocessing.Ntk.get_doc_cleaned`	Filters doc from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes
`smltk.preprocessing.Ntk.get_stats_vocab`	Gets statistics of vocabulary
`smltk.preprocessing.Ntk.get_words_top`	Gets words top for each target
`smltk.preprocessing.Ntk.get_vocabs_cleaned`	Cleans vocabs from common words among targets
`smltk.preprocessing.Ntk.get_ngrams`	Gets ngrams from doc or tokens
`smltk.preprocessing.Ntk.get_ngrams_features`	Gets ngrams features from doc or tokens
`smltk.preprocessing.Ntk.get_features`	Gets features
`smltk.preprocessing.Ntk.get_features_from_docs`	Gets features
`smltk.preprocessing.Ntk.create_tuples`	Creates tuples with sample and its target
`smltk.preprocessing.Ntk.create_vocab_from_docs`	Creates vocabulary from list of docs
`smltk.preprocessing.Ntk.create_vocab_from_tuples`	Creates vocabulary from list of tuples
`smltk.preprocessing.Ntk.create_features_from_docs`	Creates features from docs
`smltk.preprocessing.Ntk.create_features_from_tuples`	Creates features from tuples
`smltk.preprocessing.Ntk.create_ngrams_features_from_docs`	Creates ngrams features from docs
`smltk.preprocessing.Ntk.create_ngrams_features_from_tuples`	Creates ngrams features from tuples
`smltk.preprocessing.Ntk.create_words_map`	Creates the map of words
`smltk.preprocessing.Ntk.create_words_cloud`	Creates the cloud of words
`smltk.preprocessing.Ntk.vectorize_docs`	Vectorizes docs

Detailed list

class smltk.preprocessing.Ntk(params=[])

The class Ntk contains the Natural Language Processing tool kit.

Arguments: params (dict) with the keys below

language (str):: default is english
lemmatizer (obj):: obj with method lemmatize() like WordNetLemmatizer()
min_length (int):: default is 1, so all words will be used
stop_words (list[str]):: default is stopwords.words()
tag_map (dict):: default contains J, V, R

Here’s an example:

>>> from smltk.preprocessing import Ntk
>>> doc = 'Good case, Excellent value.'
>>> ntk = Ntk()
>>> get_doc_cleaned = ntk.get_doc_cleaned(doc)
>>> print(get_doc_cleaned)
good case excellent value

add_doc_to_vocab(doc, vocab, is_lemma=True)

Adds tokens of that doc to vocabulary and updates vocabulary

Arguments:

doc (str):: text
vocab (collections.Counter):: dictionary of tokens with its count
is_lemma (bool):: default is True

Returns:

list of tokens of that doc and vocab updated

create_features_from_docs(docs, target, is_lemma=True, words_top={}, degree=0)

Creates features from docs

Arguments:

docs (list[str]):: list of text
target (str):: target name of the docs
is_lemma (bool):: default is True
words_top (dict):: dictionary of the words top
degree (int):: degree of ngrams, default is 0

Returns:

list of tuples with features and relative target

create_features_from_tuples(tuples, is_lemma=True, words_top={}, degree=0)

Creates features from tuples

Arguments:

tuples (list[tuples]):: list of tuples with sample and its target
is_lemma (bool):: default is True
words_top (dict):: dictionary of the words top
degree (int):: degree of ngrams, default is 0

Returns:

list of tuples with features and relative target

create_ngrams_features_from_docs(docs, target, is_lemma=True, degree=2)

Creates ngrams features from docs

Arguments:

docs (list[str]):: list of text
target (str):: target name of the docs
is_lemma (bool):: default is True
degree (int):: degree of ngrams, default is 2

Returns:

list of tuples with features and relative target

create_ngrams_features_from_tuples(tuples, is_lemma=True, degree=2)

Creates ngrams features from tuples

Arguments:

tuples (list[tuples]):: list of tuples with sample and its target
is_lemma (bool):: default is True
degree (int):: degree of ngrams, default is 2

Returns:

list of tuples with features and relative target

create_tuples(docs=[], target=[])

Creates tuples with sample and its target

Arguments:

docs (list[str]):: list of texts
target (list[str]):: list of targets

Returns:

list of tuples with sample and its target

create_vocab_from_docs(docs, is_lemma=True)

Creates vocabulary from list of docs

Arguments:

docs (list[str]):: list of texts
is_lemma (bool):: default is True

Returns:

dictionary of tokens with its count in an object collections.Counter

create_vocab_from_tuples(tuples, is_lemma=True)

Creates vocabulary from list of tuples

Arguments:

tuples (list[tuples]):: list of tuples with sample and its target
is_lemma (bool):: default is True

Returns:

dictionary of tokens with its count in an object collections.Counter

create_words_cloud(words, is_test=False)

Creates the cloud of words

Arguments:

words (str):: words
is_test (bool):: default is False

Returns:

only words cloud plot

create_words_map(words)

Creates the map of words

Arguments:

words (list[str]):: words list

Returns:

string of all words

get_doc_cleaned(doc, is_lemma=True)

Filters doc from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes

Arguments:

doc (str):: text
is_lemma (bool):: default is True

Returns:

string cleaned

get_features(doc, is_lemma=True, words_top={}, degree=0)

Gets features

Arguments:

doc (str):: text
is_lemma (bool):: default is True
words_top (dict):: dictionary of the words top
degree (int):: degree of ngrams, default is 0

Returns:

dictionary of features extracted

get_features_from_docs(docs, is_lemma=True, words_top={}, degree=0)

Gets features

Arguments:

docs (list[str]):: list of text
is_lemma (bool):: default is True
words_top (dict):: dictionary of the words top
degree (int):: degree of ngrams, default is 0

Returns:

dictionary of features extracted

get_ngrams(degree=2, doc='', tokens=[], is_tuple=True, is_lemma=False)

Gets ngrams from doc or tokens

Arguments:

degree (int):: degree of ngrams, default is 2
doc (str):: text, option if you pass tokens
tokens (list[str]):: list of tokens, option if you pass doc
is_tuple (bool):: default is True
is_lemma (bool):: default is False

Returns:

list of tuples (n_grams) for that degree, or list of string (token)

get_ngrams_features(degree=2, doc='', tokens=[], is_lemma=False)

Gets ngrams features from doc or tokens

Arguments:

degree (int):: degree of ngrams, default is 2
doc (str):: text, option if you pass tokens
tokens (list[str]):: list of tokens, option if you pass doc
is_lemma (bool):: default is False

Returns:

dictionary of ngrams extracted

get_stats_vocab(vocab, min_occurance=1)

Gets statistics of vocabulary

Arguments:

vocab (collections.Counter):: dictionary of tokens with its count
min_occurance (int):: minimum occurance considered

Returns:

tuple of tokens number with >= min_occurance and total tokens number

get_tokens_cleaned(doc, is_lemma=True)

Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes

Arguments:

doc (str):: text
is_lemma (bool):: default is True

Returns:

list of tokens cleaned

get_vocabs_cleaned(vocabs)

Cleans vocabs from common words among targets

Arguments:

vocabs (dict):: keys are targets, values are vocabularies for that target

Returns:

vocabs cleaned from common words among targets

get_words_top(vocab, how_many)

Gets words top for each target

Arguments:

vocab (collections.Counter):: dictionary of tokens with its count
how_many (int):: how many words in your top how_many list

Returns:

dictionary of the top how_many list

lemmatize(tokens)

Lemmatizes tokens

Arguments:

tokens (list[str]):: list of words

Returns:

list of tokens lemmatized

tokenize_and_clean_doc(doc)

Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters

Arguments:

doc (str):: text

Returns:

list of words filtered

vectorize_docs(docs, is_count=True, is_lemma=False, is_test=False)

Vectorizes docs

Arguments:

docs (list[str]):: list of texts
is_count (bool):: default is True
is_lemma (bool):: default is True
is_test (bool):: default is False

Returns:

list of scipy.sparse.csr.csr_matrix, one for each doc

word_tokenize(doc)

Splits document in each word

Arguments:

doc (str):: text

Returns:

list of words