Data Processing

Simple Machine Learning Tool Kit package

This package contains the modules to simplify your code for your data processing processes.

It is part of the educational repositories (https://github.com/pandle/materials) to learn how to write stardard code and common uses of the TDD.

Package contents two classes to manage data processing.

>>> import smltk
>>> help(smltk)
>>> from smltk.data_processing import DataProcessing
>>> help(DataProcessing)

# license MIT # support https://github.com/bilardi/smltk/issues

`DataProcessing`	The class DataProcessing contains the simple methods to manage the data, both tabular and textual.
`Ntk`	The class Ntk contains the Natural Language Processing tool kit.

Data Processing

`DataProcessing.get_df`	Create a DataFrame from the data of the main repositories
`DataProcessing.get_inference_df`	Create a DataFrame from the data of the main repositories
`DataProcessing.transform_categories`	Transform categorical features in discrete features
`DataProcessing.clean_doc`	Filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation
`DataProcessing.tokenize_and_clean_doc`	Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation
`DataProcessing.get_tokens_cleaned`	Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems
`DataProcessing.get_doc_cleaned`	Filters doc from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems

Natural Language Tool Kit

`Ntk.get_stats_vocab`	Gets statistics of vocabulary
`Ntk.get_words_top`	Gets words top for each target
`Ntk.get_vocabs_cleaned`	Cleans vocabs from common words among targets
`Ntk.get_ngrams`	Gets ngrams from doc or tokens
`Ntk.get_ngrams_features`	Gets ngrams features from doc or tokens
`Ntk.get_features`	Gets features
`Ntk.get_features_from_docs`	Gets features
`Ntk.create_tuples`	Creates tuples with sample and its target
`Ntk.create_vocab_from_docs`	Creates vocabulary from list of docs
`Ntk.create_vocab_from_tuples`	Creates vocabulary from list of tuples
`Ntk.create_features_from_docs`	Creates features from docs
`Ntk.create_features_from_tuples`	Creates features from tuples
`Ntk.create_ngrams_features_from_docs`	Creates ngrams features from docs
`Ntk.create_ngrams_features_from_tuples`	Creates ngrams features from tuples
`Ntk.create_words_map`	Creates the map of words
`Ntk.create_words_cloud`	Creates the cloud of words
`Ntk.vectorize_docs`	Vectorizes docs

Detailed list

class smltk.data_processing.DataProcessing(params: dict = {})

The class DataProcessing contains the simple methods to manage the data, both tabular and textual.

Arguments: params (dict) with the key below

language (str):: default is english

Here’s an example:

>>> from smltk.data_processing import DataProcessing
>>> doc = 'Good case, Excellent value.'
>>> dp = DataProcessing()
>>> get_doc_cleaned = dp.get_doc_cleaned(doc)
>>> print(get_doc_cleaned)
good case excellent value

clean_doc(tokens: list, is_alpha: bool = True, is_punctuation: bool = True) → list

Filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation

Arguments:

doc (str):: text
is_alpha (bool):: default is True
is_punctuation (bool):: default is True

Returns:

list of words filtered

get_df(data)

Create a DataFrame from the data of the main repositories

Arguments:

data (mixed):: data loaded from one of the main repositories

Returns:

Pandas DataFrame

get_doc_cleaned(doc: str, is_lemma: bool = True, is_stem: bool = False, is_alpha: bool = True, is_punctuation: bool = True) → str

Filters doc from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems

Arguments:

doc (str):: text
is_lemma (bool):: default is True
is_stem (bool):: default is False
is_alpha (bool):: default is True
is_punctuation (bool):: default is True

Returns:

string cleaned

get_inference_df(data, x_test, y_test, y_pred)

Create a DataFrame from the data of the main repositories

Arguments:

x_test (Pandas DataFrame):: features used for the prediction
y_test (list of str):: list of the targets
y_pred (list of str):: list of the predictions

Returns:

Pandas DataFrame

get_tokens_cleaned(doc: str, is_lemma: bool = True, is_stem: bool = False, is_alpha: bool = True, is_punctuation: bool = True) → list

Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems

Arguments:

doc (str):: text
is_lemma (bool):: default is True
is_stem (bool):: default is False
is_alpha (bool):: default is True
is_punctuation (bool):: default is True

Returns:

list of tokens cleaned

harmonize_words(words: list, synonyms: dict = None, lang: list = None) → list

Harmonize a list of words to a canonical form.

Args:

words: list of words/phrases (may contain None or NaN). synonyms: optional dict mapping canonical -> list of variants.

Variants are normalized (lowercase + strip) automatically.

lang: optional list of languages for the lemmatizer.: Defaults to [self.language].

Returns:

list of words in canonical form, same length as input.: None is preserved as is; empty strings stay empty.

Raises:

ValueError: if a variant in synonyms maps to two different: canonical forms.

lemmatize(tokens: list) → list

Lemmatizes tokens

Arguments:

tokens (list[str]):: list of words

Returns:

list of tokens lemmatized

stem(tokens: list) → list

Stems tokens

Arguments:

tokens (list[str]):: list of words

Returns:

list of tokens stemmed

tokenize(doc: str) → list

Tokenizes doc

Arguments:

doc (str):: text

Returns:

list of words filtered

tokenize_and_clean_doc(doc: str, is_alpha: bool = True, is_punctuation: bool = True) → list

Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation

Arguments:

doc (str):: text
is_alpha (bool):: default is True
is_punctuation (bool):: default is True

Returns:

list of words filtered

transform_categories(data: DataFrame, categorical_features: dict = None, features: dict = None) → list

Transform categorical features in discrete features

Arguments:

data (Pandas DataFrame):: features to elaborate
categorical_features (dict):: with key the name of the categorical features and value the ordered list of values
features (dict):: with key categorical_features and value the list of the categorical features

Returns:

list of categorical_features and data transformed

class smltk.data_processing.Ntk(params: list = [])

The class Ntk contains the Natural Language Processing tool kit. This class extends the basic class DataProcessing and it overrides the methods lemmatize(), tokenize() and tokenize_and_clean_doc().

Arguments: params (dict) with the keys below

language (str):: default is english
lemmatizer (obj):: obj with method lemmatize() like WordNetLemmatizer()
min_length (int):: default is 1, so all words will be used
stop_words (list[str]):: default is stopwords.words()
tag_map (dict):: default contains J, V, R

Here’s an example:

>>> from smltk.data_processing import Ntk
>>> doc = 'Good case, Excellent value.'
>>> ntk = Ntk()
>>> get_doc_cleaned = ntk.get_doc_cleaned(doc)
>>> print(get_doc_cleaned)
good case excellent value

add_doc_to_vocab(doc: str, vocab: Counter, is_lemma: bool = True) → list

Adds tokens of that doc to vocabulary and updates vocabulary

Arguments:

doc (str):: text
vocab (collections.Counter):: dictionary of tokens with its count
is_lemma (bool):: default is True

Returns:

list of tokens of that doc and vocab updated

create_features_from_docs(docs, target, is_lemma=True, words_top={}, degree=0)

Creates features from docs

Arguments:

docs (list[str]):: list of text
target (str):: target name of the docs
is_lemma (bool):: default is True
words_top (dict):: dictionary of the words top
degree (int):: degree of ngrams, default is 0

Returns:

list of tuples with features and relative target

create_features_from_tuples(tuples, is_lemma=True, words_top={}, degree=0)

Creates features from tuples

Arguments:

tuples (list[tuples]):: list of tuples with sample and its target
is_lemma (bool):: default is True
words_top (dict):: dictionary of the words top
degree (int):: degree of ngrams, default is 0

Returns:

list of tuples with features and relative target

create_ngrams_features_from_docs(docs, target, is_lemma=True, degree=2)

Creates ngrams features from docs

Arguments:

docs (list[str]):: list of text
target (str):: target name of the docs
is_lemma (bool):: default is True
degree (int):: degree of ngrams, default is 2

Returns:

list of tuples with features and relative target

create_ngrams_features_from_tuples(tuples, is_lemma=True, degree=2)

Creates ngrams features from tuples

Arguments:

tuples (list[tuples]):: list of tuples with sample and its target
is_lemma (bool):: default is True
degree (int):: degree of ngrams, default is 2

Returns:

list of tuples with features and relative target

create_tuples(docs: list = [], target: list = []) → list

Creates tuples with sample and its target

Arguments:

docs (list[str]):: list of texts
target (list[str]):: list of targets

Returns:

list of tuples with sample and its target

create_vocab_from_docs(docs: list, is_lemma: bool = True) → dict

Creates vocabulary from list of docs

Arguments:

docs (list[str]):: list of texts
is_lemma (bool):: default is True

Returns:

dictionary of tokens with its count in an object collections.Counter

create_vocab_from_tuples(tuples: list, is_lemma: bool = True)

Creates vocabulary from list of tuples

Arguments:

tuples (list[tuples]):: list of tuples with sample and its target
is_lemma (bool):: default is True

Returns:

dictionary of tokens with its count in an object collections.Counter

create_words_cloud(words, is_test=False)

Creates the cloud of words

Arguments:

words (str):: words
is_test (bool):: default is False

Returns:

only words cloud plot

create_words_map(words)

Creates the map of words

Arguments:

words (list[str]):: words list

Returns:

string of all words

get_features(doc, is_lemma=True, words_top={}, degree=0)

Gets features

Arguments:

doc (str):: text
is_lemma (bool):: default is True
words_top (dict):: dictionary of the words top
degree (int):: degree of ngrams, default is 0

Returns:

dictionary of features extracted

get_features_from_docs(docs, is_lemma=True, words_top={}, degree=0)

Gets features

Arguments:

docs (list[str]):: list of text
is_lemma (bool):: default is True
words_top (dict):: dictionary of the words top
degree (int):: degree of ngrams, default is 0

Returns:

dictionary of features extracted

get_ngrams(degree=2, doc='', tokens=[], is_tuple=True, is_lemma=False)

Gets ngrams from doc or tokens

Arguments:

degree (int):: degree of ngrams, default is 2
doc (str):: text, option if you pass tokens
tokens (list[str]):: list of tokens, option if you pass doc
is_tuple (bool):: default is True
is_lemma (bool):: default is False

Returns:

list of tuples (n_grams) for that degree, or list of string (token)

get_ngrams_features(degree=2, doc='', tokens=[], is_lemma=False)

Gets ngrams features from doc or tokens

Arguments:

degree (int):: degree of ngrams, default is 2
doc (str):: text, option if you pass tokens
tokens (list[str]):: list of tokens, option if you pass doc
is_lemma (bool):: default is False

Returns:

dictionary of ngrams extracted

get_stats_vocab(vocab: Counter, min_occurance: int = 1) → tuple

Gets statistics of vocabulary

Arguments:

vocab (collections.Counter):: dictionary of tokens with its count
min_occurance (int):: minimum occurance considered

Returns:

tuple of tokens number with >= min_occurance and total tokens number

get_vocabs_cleaned(vocabs)

Cleans vocabs from common words among targets

Arguments:

vocabs (dict):: keys are targets, values are vocabularies for that target

Returns:

vocabs cleaned from common words among targets

get_words_top(vocab, how_many)

Gets words top for each target

Arguments:

vocab (collections.Counter):: dictionary of tokens with its count
how_many (int):: how many words in your top how_many list

Returns:

dictionary of the top how_many list

lemmatize(tokens: list) → list

Lemmatizes tokens

Arguments:

tokens (list[str]):: list of words

Returns:

list of tokens lemmatized

tokenize(doc: str) → list

Splits document in each word

Arguments:

doc (str):: text

Returns:

list of words

tokenize_and_clean_doc(doc: str, is_alpha: bool = False, is_punctuation: bool = False) → list

Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters

Arguments:

doc (str):: text

Returns:

list of words filtered

vectorize_docs(docs, is_count=True, is_lemma=False, is_test=False)

Vectorizes docs

Arguments:

docs (list[str]):: list of texts
is_count (bool):: default is True
is_lemma (bool):: default is True
is_test (bool):: default is False

Returns:

list of scipy.sparse.csr.csr_matrix, one for each doc