PreProcessing

The classes for data preparation with nltk, sklearn and more

A collection of methods to simplify your code.

smltk.preprocessing.Ntk

The class Ntk contains the Natural Language Processing tool kit.

Natural Language Tool Kit

smltk.preprocessing.Ntk.get_tokens_cleaned

Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes

smltk.preprocessing.Ntk.get_doc_cleaned

Filters doc from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes

smltk.preprocessing.Ntk.get_stats_vocab

Gets statistics of vocabulary

smltk.preprocessing.Ntk.get_words_top

Gets words top for each target

smltk.preprocessing.Ntk.get_vocabs_cleaned

Cleans vocabs from common words among targets

smltk.preprocessing.Ntk.get_ngrams

Gets ngrams from doc or tokens

smltk.preprocessing.Ntk.get_ngrams_features

Gets ngrams features from doc or tokens

smltk.preprocessing.Ntk.get_features

Gets features

smltk.preprocessing.Ntk.get_features_from_docs

Gets features

smltk.preprocessing.Ntk.create_tuples

Creates tuples with sample and its target

smltk.preprocessing.Ntk.create_vocab_from_docs

Creates vocabulary from list of docs

smltk.preprocessing.Ntk.create_vocab_from_tuples

Creates vocabulary from list of tuples

smltk.preprocessing.Ntk.create_features_from_docs

Creates features from docs

smltk.preprocessing.Ntk.create_features_from_tuples

Creates features from tuples

smltk.preprocessing.Ntk.create_ngrams_features_from_docs

Creates ngrams features from docs

smltk.preprocessing.Ntk.create_ngrams_features_from_tuples

Creates ngrams features from tuples

smltk.preprocessing.Ntk.create_words_map

Creates the map of words

smltk.preprocessing.Ntk.create_words_cloud

Creates the cloud of words

smltk.preprocessing.Ntk.vectorize_docs

Vectorizes docs

Detailed list

class smltk.preprocessing.Ntk(params=[])

The class Ntk contains the Natural Language Processing tool kit.

Arguments: params (dict) with the keys below
language (str):

default is english

lemmatizer (obj):

obj with method lemmatize() like WordNetLemmatizer()

min_length (int):

default is 1, so all words will be used

stop_words (list[str]):

default is stopwords.words()

tag_map (dict):

default contains J, V, R

Here’s an example:

>>> from smltk.preprocessing import Ntk
>>> doc = 'Good case, Excellent value.'
>>> ntk = Ntk()
>>> get_doc_cleaned = ntk.get_doc_cleaned(doc)
>>> print(get_doc_cleaned)
good case excellent value
add_doc_to_vocab(doc, vocab, is_lemma=True)

Adds tokens of that doc to vocabulary and updates vocabulary

Arguments:
doc (str):

text

vocab (collections.Counter):

dictionary of tokens with its count

is_lemma (bool):

default is True

Returns:

list of tokens of that doc and vocab updated

create_features_from_docs(docs, target, is_lemma=True, words_top={}, degree=0)

Creates features from docs

Arguments:
docs (list[str]):

list of text

target (str):

target name of the docs

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

list of tuples with features and relative target

create_features_from_tuples(tuples, is_lemma=True, words_top={}, degree=0)

Creates features from tuples

Arguments:
tuples (list[tuples]):

list of tuples with sample and its target

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

list of tuples with features and relative target

create_ngrams_features_from_docs(docs, target, is_lemma=True, degree=2)

Creates ngrams features from docs

Arguments:
docs (list[str]):

list of text

target (str):

target name of the docs

is_lemma (bool):

default is True

degree (int):

degree of ngrams, default is 2

Returns:

list of tuples with features and relative target

create_ngrams_features_from_tuples(tuples, is_lemma=True, degree=2)

Creates ngrams features from tuples

Arguments:
tuples (list[tuples]):

list of tuples with sample and its target

is_lemma (bool):

default is True

degree (int):

degree of ngrams, default is 2

Returns:

list of tuples with features and relative target

create_tuples(docs=[], target=[])

Creates tuples with sample and its target

Arguments:
docs (list[str]):

list of texts

target (list[str]):

list of targets

Returns:

list of tuples with sample and its target

create_vocab_from_docs(docs, is_lemma=True)

Creates vocabulary from list of docs

Arguments:
docs (list[str]):

list of texts

is_lemma (bool):

default is True

Returns:

dictionary of tokens with its count in an object collections.Counter

create_vocab_from_tuples(tuples, is_lemma=True)

Creates vocabulary from list of tuples

Arguments:
tuples (list[tuples]):

list of tuples with sample and its target

is_lemma (bool):

default is True

Returns:

dictionary of tokens with its count in an object collections.Counter

create_words_cloud(words, is_test=False)

Creates the cloud of words

Arguments:
words (str):

words

is_test (bool):

default is False

Returns:

only words cloud plot

create_words_map(words)

Creates the map of words

Arguments:
words (list[str]):

words list

Returns:

string of all words

get_doc_cleaned(doc, is_lemma=True)

Filters doc from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes

Arguments:
doc (str):

text

is_lemma (bool):

default is True

Returns:

string cleaned

get_features(doc, is_lemma=True, words_top={}, degree=0)

Gets features

Arguments:
doc (str):

text

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

dictionary of features extracted

get_features_from_docs(docs, is_lemma=True, words_top={}, degree=0)

Gets features

Arguments:
docs (list[str]):

list of text

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

dictionary of features extracted

get_ngrams(degree=2, doc='', tokens=[], is_tuple=True, is_lemma=False)

Gets ngrams from doc or tokens

Arguments:
degree (int):

degree of ngrams, default is 2

doc (str):

text, option if you pass tokens

tokens (list[str]):

list of tokens, option if you pass doc

is_tuple (bool):

default is True

is_lemma (bool):

default is False

Returns:

list of tuples (n_grams) for that degree, or list of string (token)

get_ngrams_features(degree=2, doc='', tokens=[], is_lemma=False)

Gets ngrams features from doc or tokens

Arguments:
degree (int):

degree of ngrams, default is 2

doc (str):

text, option if you pass tokens

tokens (list[str]):

list of tokens, option if you pass doc

is_lemma (bool):

default is False

Returns:

dictionary of ngrams extracted

get_stats_vocab(vocab, min_occurance=1)

Gets statistics of vocabulary

Arguments:
vocab (collections.Counter):

dictionary of tokens with its count

min_occurance (int):

minimum occurance considered

Returns:

tuple of tokens number with >= min_occurance and total tokens number

get_tokens_cleaned(doc, is_lemma=True)

Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes

Arguments:
doc (str):

text

is_lemma (bool):

default is True

Returns:

list of tokens cleaned

get_vocabs_cleaned(vocabs)

Cleans vocabs from common words among targets

Arguments:
vocabs (dict):

keys are targets, values are vocabularies for that target

Returns:

vocabs cleaned from common words among targets

get_words_top(vocab, how_many)

Gets words top for each target

Arguments:
vocab (collections.Counter):

dictionary of tokens with its count

how_many (int):

how many words in your top how_many list

Returns:

dictionary of the top how_many list

lemmatize(tokens)

Lemmatizes tokens

Arguments:
tokens (list[str]):

list of words

Returns:

list of tokens lemmatized

tokenize_and_clean_doc(doc)

Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters

Arguments:
doc (str):

text

Returns:

list of words filtered

vectorize_docs(docs, is_count=True, is_lemma=False, is_test=False)

Vectorizes docs

Arguments:
docs (list[str]):

list of texts

is_count (bool):

default is True

is_lemma (bool):

default is True

is_test (bool):

default is False

Returns:

list of scipy.sparse.csr.csr_matrix, one for each doc

word_tokenize(doc)

Splits document in each word

Arguments:
doc (str):

text

Returns:

list of words