Data Processing

Simple Machine Learning Tool Kit package

This package contains the modules to simplify your code for your data processing processes.

It is part of the educational repositories (https://github.com/pandle/materials) to learn how to write stardard code and common uses of the TDD.

Package contents two classes to manage data processing.

>>> import smltk
>>> help(smltk)
>>> from smltk.data_processing import DataProcessing
>>> help(DataProcessing)

# license MIT # support https://github.com/bilardi/smltk/issues

DataProcessing

The class DataProcessing contains the simple methods to manage the data, both tabular and textual.

Ntk

The class Ntk contains the Natural Language Processing tool kit.

Data Processing

DataProcessing.get_df

Create a DataFrame from the data of the main repositories

DataProcessing.get_inference_df

Create a DataFrame from the data of the main repositories

DataProcessing.transform_categories

Transform categorical features in discrete features

DataProcessing.clean_doc

Filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation

DataProcessing.tokenize_and_clean_doc

Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation

DataProcessing.get_tokens_cleaned

Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems

DataProcessing.get_doc_cleaned

Filters doc from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems

Natural Language Tool Kit

Ntk.get_stats_vocab

Gets statistics of vocabulary

Ntk.get_words_top

Gets words top for each target

Ntk.get_vocabs_cleaned

Cleans vocabs from common words among targets

Ntk.get_ngrams

Gets ngrams from doc or tokens

Ntk.get_ngrams_features

Gets ngrams features from doc or tokens

Ntk.get_features

Gets features

Ntk.get_features_from_docs

Gets features

Ntk.create_tuples

Creates tuples with sample and its target

Ntk.create_vocab_from_docs

Creates vocabulary from list of docs

Ntk.create_vocab_from_tuples

Creates vocabulary from list of tuples

Ntk.create_features_from_docs

Creates features from docs

Ntk.create_features_from_tuples

Creates features from tuples

Ntk.create_ngrams_features_from_docs

Creates ngrams features from docs

Ntk.create_ngrams_features_from_tuples

Creates ngrams features from tuples

Ntk.create_words_map

Creates the map of words

Ntk.create_words_cloud

Creates the cloud of words

Ntk.vectorize_docs

Vectorizes docs

Detailed list

class smltk.data_processing.DataProcessing(params: dict = {})

The class DataProcessing contains the simple methods to manage the data, both tabular and textual.

Arguments: params (dict) with the key below
language (str):

default is english

Here’s an example:

>>> from smltk.data_processing import DataProcessing
>>> doc = 'Good case, Excellent value.'
>>> dp = DataProcessing()
>>> get_doc_cleaned = dp.get_doc_cleaned(doc)
>>> print(get_doc_cleaned)
good case excellent value
clean_doc(tokens: list, is_alpha: bool = True, is_punctuation: bool = True) list

Filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation

Arguments:
doc (str):

text

is_alpha (bool):

default is True

is_punctuation (bool):

default is True

Returns:

list of words filtered

get_df(data)

Create a DataFrame from the data of the main repositories

Arguments:
data (mixed):

data loaded from one of the main repositories

Returns:

Pandas DataFrame

get_doc_cleaned(doc: str, is_lemma: bool = True, is_stem: bool = False, is_alpha: bool = True, is_punctuation: bool = True) str

Filters doc from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems

Arguments:
doc (str):

text

is_lemma (bool):

default is True

is_stem (bool):

default is False

is_alpha (bool):

default is True

is_punctuation (bool):

default is True

Returns:

string cleaned

get_inference_df(data, x_test, y_test, y_pred)

Create a DataFrame from the data of the main repositories

Arguments:
x_test (Pandas DataFrame):

features used for the prediction

y_test (list of str):

list of the targets

y_pred (list of str):

list of the predictions

Returns:

Pandas DataFrame

get_tokens_cleaned(doc: str, is_lemma: bool = True, is_stem: bool = False, is_alpha: bool = True, is_punctuation: bool = True) list

Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems

Arguments:
doc (str):

text

is_lemma (bool):

default is True

is_stem (bool):

default is False

is_alpha (bool):

default is True

is_punctuation (bool):

default is True

Returns:

list of tokens cleaned

harmonize_words(words: list, synonyms: dict = None, lang: list = None) list

Harmonize a list of words to a canonical form.

Args:

words: list of words/phrases (may contain None or NaN). synonyms: optional dict mapping canonical -> list of variants.

Variants are normalized (lowercase + strip) automatically.

lang: optional list of languages for the lemmatizer.

Defaults to [self.language].

Returns:
list of words in canonical form, same length as input.

None is preserved as is; empty strings stay empty.

Raises:
ValueError: if a variant in synonyms maps to two different

canonical forms.

lemmatize(tokens: list) list

Lemmatizes tokens

Arguments:
tokens (list[str]):

list of words

Returns:

list of tokens lemmatized

stem(tokens: list) list

Stems tokens

Arguments:
tokens (list[str]):

list of words

Returns:

list of tokens stemmed

tokenize(doc: str) list

Tokenizes doc

Arguments:
doc (str):

text

Returns:

list of words filtered

tokenize_and_clean_doc(doc: str, is_alpha: bool = True, is_punctuation: bool = True) list

Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation

Arguments:
doc (str):

text

is_alpha (bool):

default is True

is_punctuation (bool):

default is True

Returns:

list of words filtered

transform_categories(data: DataFrame, categorical_features: dict = None, features: dict = None) list

Transform categorical features in discrete features

Arguments:
data (Pandas DataFrame):

features to elaborate

categorical_features (dict):

with key the name of the categorical features and value the ordered list of values

features (dict):

with key categorical_features and value the list of the categorical features

Returns:

list of categorical_features and data transformed

class smltk.data_processing.Ntk(params: list = [])

The class Ntk contains the Natural Language Processing tool kit. This class extends the basic class DataProcessing and it overrides the methods lemmatize(), tokenize() and tokenize_and_clean_doc().

Arguments: params (dict) with the keys below
language (str):

default is english

lemmatizer (obj):

obj with method lemmatize() like WordNetLemmatizer()

min_length (int):

default is 1, so all words will be used

stop_words (list[str]):

default is stopwords.words()

tag_map (dict):

default contains J, V, R

Here’s an example:

>>> from smltk.data_processing import Ntk
>>> doc = 'Good case, Excellent value.'
>>> ntk = Ntk()
>>> get_doc_cleaned = ntk.get_doc_cleaned(doc)
>>> print(get_doc_cleaned)
good case excellent value
add_doc_to_vocab(doc: str, vocab: Counter, is_lemma: bool = True) list

Adds tokens of that doc to vocabulary and updates vocabulary

Arguments:
doc (str):

text

vocab (collections.Counter):

dictionary of tokens with its count

is_lemma (bool):

default is True

Returns:

list of tokens of that doc and vocab updated

create_features_from_docs(docs, target, is_lemma=True, words_top={}, degree=0)

Creates features from docs

Arguments:
docs (list[str]):

list of text

target (str):

target name of the docs

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

list of tuples with features and relative target

create_features_from_tuples(tuples, is_lemma=True, words_top={}, degree=0)

Creates features from tuples

Arguments:
tuples (list[tuples]):

list of tuples with sample and its target

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

list of tuples with features and relative target

create_ngrams_features_from_docs(docs, target, is_lemma=True, degree=2)

Creates ngrams features from docs

Arguments:
docs (list[str]):

list of text

target (str):

target name of the docs

is_lemma (bool):

default is True

degree (int):

degree of ngrams, default is 2

Returns:

list of tuples with features and relative target

create_ngrams_features_from_tuples(tuples, is_lemma=True, degree=2)

Creates ngrams features from tuples

Arguments:
tuples (list[tuples]):

list of tuples with sample and its target

is_lemma (bool):

default is True

degree (int):

degree of ngrams, default is 2

Returns:

list of tuples with features and relative target

create_tuples(docs: list = [], target: list = []) list

Creates tuples with sample and its target

Arguments:
docs (list[str]):

list of texts

target (list[str]):

list of targets

Returns:

list of tuples with sample and its target

create_vocab_from_docs(docs: list, is_lemma: bool = True) dict

Creates vocabulary from list of docs

Arguments:
docs (list[str]):

list of texts

is_lemma (bool):

default is True

Returns:

dictionary of tokens with its count in an object collections.Counter

create_vocab_from_tuples(tuples: list, is_lemma: bool = True)

Creates vocabulary from list of tuples

Arguments:
tuples (list[tuples]):

list of tuples with sample and its target

is_lemma (bool):

default is True

Returns:

dictionary of tokens with its count in an object collections.Counter

create_words_cloud(words, is_test=False)

Creates the cloud of words

Arguments:
words (str):

words

is_test (bool):

default is False

Returns:

only words cloud plot

create_words_map(words)

Creates the map of words

Arguments:
words (list[str]):

words list

Returns:

string of all words

get_features(doc, is_lemma=True, words_top={}, degree=0)

Gets features

Arguments:
doc (str):

text

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

dictionary of features extracted

get_features_from_docs(docs, is_lemma=True, words_top={}, degree=0)

Gets features

Arguments:
docs (list[str]):

list of text

is_lemma (bool):

default is True

words_top (dict):

dictionary of the words top

degree (int):

degree of ngrams, default is 0

Returns:

dictionary of features extracted

get_ngrams(degree=2, doc='', tokens=[], is_tuple=True, is_lemma=False)

Gets ngrams from doc or tokens

Arguments:
degree (int):

degree of ngrams, default is 2

doc (str):

text, option if you pass tokens

tokens (list[str]):

list of tokens, option if you pass doc

is_tuple (bool):

default is True

is_lemma (bool):

default is False

Returns:

list of tuples (n_grams) for that degree, or list of string (token)

get_ngrams_features(degree=2, doc='', tokens=[], is_lemma=False)

Gets ngrams features from doc or tokens

Arguments:
degree (int):

degree of ngrams, default is 2

doc (str):

text, option if you pass tokens

tokens (list[str]):

list of tokens, option if you pass doc

is_lemma (bool):

default is False

Returns:

dictionary of ngrams extracted

get_stats_vocab(vocab: Counter, min_occurance: int = 1) tuple

Gets statistics of vocabulary

Arguments:
vocab (collections.Counter):

dictionary of tokens with its count

min_occurance (int):

minimum occurance considered

Returns:

tuple of tokens number with >= min_occurance and total tokens number

get_vocabs_cleaned(vocabs)

Cleans vocabs from common words among targets

Arguments:
vocabs (dict):

keys are targets, values are vocabularies for that target

Returns:

vocabs cleaned from common words among targets

get_words_top(vocab, how_many)

Gets words top for each target

Arguments:
vocab (collections.Counter):

dictionary of tokens with its count

how_many (int):

how many words in your top how_many list

Returns:

dictionary of the top how_many list

lemmatize(tokens: list) list

Lemmatizes tokens

Arguments:
tokens (list[str]):

list of words

Returns:

list of tokens lemmatized

tokenize(doc: str) list

Splits document in each word

Arguments:
doc (str):

text

Returns:

list of words

tokenize_and_clean_doc(doc: str, is_alpha: bool = False, is_punctuation: bool = False) list

Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters

Arguments:
doc (str):

text

Returns:

list of words filtered

vectorize_docs(docs, is_count=True, is_lemma=False, is_test=False)

Vectorizes docs

Arguments:
docs (list[str]):

list of texts

is_count (bool):

default is True

is_lemma (bool):

default is True

is_test (bool):

default is False

Returns:

list of scipy.sparse.csr.csr_matrix, one for each doc