PreProcessing

The classes for data preparation with nltk, sklearn and more

A collection of methods to simplify your code.

Ntk

The class Ntk contains the Natural Language Processing tool kit.

Indicator

The class Indicator contains the tool kit to calculate the principal indicators.

Natural Language Tool Kit

Ntk.get_tokens_cleaned

Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes

Ntk.get_doc_cleaned

Filters doc from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes

Ntk.get_stats_vocab

Gets statistics of vocabulary

Ntk.get_words_top

Gets words top for each target

Ntk.get_vocabs_cleaned

Cleans vocabs from common words among targets

Ntk.get_ngrams

Gets ngrams from doc or tokens

Ntk.get_ngrams_features

Gets ngrams features from doc or tokens

Ntk.get_features

Gets features

Ntk.get_features_from_docs

Gets features

Ntk.create_tuples

Creates tuples with sample and its target

Ntk.create_vocab_from_docs

Creates vocabulary from list of docs

Ntk.create_vocab_from_tuples

Creates vocabulary from list of tuples

Ntk.create_features_from_docs

Creates features from docs

Ntk.create_features_from_tuples

Creates features from tuples

Ntk.create_ngrams_features_from_docs

Creates ngrams features from docs

Ntk.create_ngrams_features_from_tuples

Creates ngrams features from tuples

Ntk.create_words_map

Creates the map of words

Ntk.create_words_cloud

Creates the cloud of words

Ntk.vectorize_docs

Vectorizes docs

Indicators Tool Kit

Indicator.get_dc_events

Compute all relevant Directional Change parameters

Indicator.get_dc_events_starts

Get only Directional Changes starts

Detailed list

class smltk.preprocessing.Ntk(params=[])

The class Ntk contains the Natural Language Processing tool kit.

Arguments: params (dict) with the keys below
language (str):

default is english

lemmatizer (obj):

obj with method lemmatize() like WordNetLemmatizer()

min_length (int):

default is 1, so all words will be used

stop_words (list[str]):

default is stopwords.words()

tag_map (dict):

default contains J, V, R

Here’s an example:

>>> from smltk.preprocessing import Ntk
>>> doc = 'Good case, Excellent value.'
>>> ntk = Ntk()
>>> get_doc_cleaned = ntk.get_doc_cleaned(doc)
>>> print(get_doc_cleaned)
good case excellent value
add_doc_to_vocab(doc, vocab, is_lemma=True)

Adds tokens of that doc to vocabulary and updates vocabulary

Arguments:
doc (str):

text

vocab (collections.Counter):

dictionary of tokens with its count

is_lemma (bool):

default is True

Returns:

list of tokens of that doc and vocab updated

create_features_from_docs(docs, target, is_lemma=True, words_top={}, degree=0)

Creates features from docs

Arguments:
docs (list[str]):

list of text

target (str):

target name of the docs

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

list of tuples with features and relative target

create_features_from_tuples(tuples, is_lemma=True, words_top={}, degree=0)

Creates features from tuples

Arguments:
tuples (list[tuples]):

list of tuples with sample and its target

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

list of tuples with features and relative target

create_ngrams_features_from_docs(docs, target, is_lemma=True, degree=2)

Creates ngrams features from docs

Arguments:
docs (list[str]):

list of text

target (str):

target name of the docs

is_lemma (bool):

default is True

degree (int):

degree of ngrams, default is 2

Returns:

list of tuples with features and relative target

create_ngrams_features_from_tuples(tuples, is_lemma=True, degree=2)

Creates ngrams features from tuples

Arguments:
tuples (list[tuples]):

list of tuples with sample and its target

is_lemma (bool):

default is True

degree (int):

degree of ngrams, default is 2

Returns:

list of tuples with features and relative target

create_tuples(docs=[], target=[])

Creates tuples with sample and its target

Arguments:
docs (list[str]):

list of texts

target (list[str]):

list of targets

Returns:

list of tuples with sample and its target

create_vocab_from_docs(docs, is_lemma=True)

Creates vocabulary from list of docs

Arguments:
docs (list[str]):

list of texts

is_lemma (bool):

default is True

Returns:

dictionary of tokens with its count in an object collections.Counter

create_vocab_from_tuples(tuples, is_lemma=True)

Creates vocabulary from list of tuples

Arguments:
tuples (list[tuples]):

list of tuples with sample and its target

is_lemma (bool):

default is True

Returns:

dictionary of tokens with its count in an object collections.Counter

create_words_cloud(words, is_test=False)

Creates the cloud of words

Arguments:
words (str):

words

is_test (bool):

default is False

Returns:

only words cloud plot

create_words_map(words)

Creates the map of words

Arguments:
words (list[str]):

words list

Returns:

string of all words

get_doc_cleaned(doc, is_lemma=True)

Filters doc from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes

Arguments:
doc (str):

text

is_lemma (bool):

default is True

Returns:

string cleaned

get_features(doc, is_lemma=True, words_top={}, degree=0)

Gets features

Arguments:
doc (str):

text

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

dictionary of features extracted

get_features_from_docs(docs, is_lemma=True, words_top={}, degree=0)

Gets features

Arguments:
docs (list[str]):

list of text

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

dictionary of features extracted

get_ngrams(degree=2, doc='', tokens=[], is_tuple=True, is_lemma=False)

Gets ngrams from doc or tokens

Arguments:
degree (int):

degree of ngrams, default is 2

doc (str):

text, option if you pass tokens

tokens (list[str]):

list of tokens, option if you pass doc

is_tuple (bool):

default is True

is_lemma (bool):

default is False

Returns:

list of tuples (n_grams) for that degree, or list of string (token)

get_ngrams_features(degree=2, doc='', tokens=[], is_lemma=False)

Gets ngrams features from doc or tokens

Arguments:
degree (int):

degree of ngrams, default is 2

doc (str):

text, option if you pass tokens

tokens (list[str]):

list of tokens, option if you pass doc

is_lemma (bool):

default is False

Returns:

dictionary of ngrams extracted

get_stats_vocab(vocab, min_occurance=1)

Gets statistics of vocabulary

Arguments:
vocab (collections.Counter):

dictionary of tokens with its count

min_occurance (int):

minimum occurance considered

Returns:

tuple of tokens number with >= min_occurance and total tokens number

get_tokens_cleaned(doc, is_lemma=True)

Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters and if is_lemma == True, also it lemmatizes

Arguments:
doc (str):

text

is_lemma (bool):

default is True

Returns:

list of tokens cleaned

get_vocabs_cleaned(vocabs)

Cleans vocabs from common words among targets

Arguments:
vocabs (dict):

keys are targets, values are vocabularies for that target

Returns:

vocabs cleaned from common words among targets

get_words_top(vocab, how_many)

Gets words top for each target

Arguments:
vocab (collections.Counter):

dictionary of tokens with its count

how_many (int):

how many words in your top how_many list

Returns:

dictionary of the top how_many list

lemmatize(tokens)

Lemmatizes tokens

Arguments:
tokens (list[str]):

list of words

Returns:

list of tokens lemmatized

tokenize_and_clean_doc(doc)

Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters

Arguments:
doc (str):

text

Returns:

list of words filtered

vectorize_docs(docs, is_count=True, is_lemma=False, is_test=False)

Vectorizes docs

Arguments:
docs (list[str]):

list of texts

is_count (bool):

default is True

is_lemma (bool):

default is True

is_test (bool):

default is False

Returns:

list of scipy.sparse.csr.csr_matrix, one for each doc

word_tokenize(doc)

Splits document in each word

Arguments:
doc (str):

text

Returns:

list of words

class smltk.preprocessing.Indicator(params={})

The class Indicator contains the tool kit to calculate the principal indicators.

Arguments: params (dict) with the keys below
events (list[str]):

list of directional change events

timeseries (list[int|float]):

list of values, default None

Here’s an example:

>>> from smltk.preprocessing import Indicator
>>> timeseries = numpy.array()
>>> indicator = Indicator()
>>> dc_events = indicator.get_dc_events(timeseries)
>>> print(dc_events)
array['upward dc', 'downward dc', ..]
get_dc_events(timeseries: array = None, threshold: float = 0.0001) list

Compute all relevant Directional Change parameters

Arguments:
timeseries (list[int|float]):

list of values

threshold (float):

default is 0.0001

Returns:

list of directional change events

get_dc_events_starts(events: list = None, timeseries: list = None) dict

Get only Directional Changes starts

Arguments:
events (list[str]):

list of directional change events

timeseries (list[int|float]):

list of values

Returns:

dictionary of boolean lists when each directional change events starts