Data Processing
Simple Machine Learning Tool Kit package
This package contains the modules to simplify your code for your data processing processes.
It is part of the educational repositories (https://github.com/pandle/materials) to learn how to write stardard code and common uses of the TDD.
Package contents two classes to manage data processing.
>>> import smltk
>>> help(smltk)
>>> from smltk.data_processing import DataProcessing
>>> help(DataProcessing)
# license MIT # support https://github.com/bilardi/smltk/issues
The class DataProcessing contains the simple methods to manage the data, both tabular and textual. |
|
The class Ntk contains the Natural Language Processing tool kit. |
Data Processing
Create a DataFrame from the data of the main repositories |
|
Create a DataFrame from the data of the main repositories |
|
Transform categorical features in discrete features |
|
Filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation |
|
Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation |
|
Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems |
|
Filters doc from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems |
Natural Language Tool Kit
Gets statistics of vocabulary |
|
Gets words top for each target |
|
Cleans vocabs from common words among targets |
|
Gets ngrams from doc or tokens |
|
Gets ngrams features from doc or tokens |
|
Gets features |
|
Gets features |
|
Creates tuples with sample and its target |
|
Creates vocabulary from list of docs |
|
Creates vocabulary from list of tuples |
|
Creates features from docs |
|
Creates features from tuples |
|
Creates ngrams features from docs |
|
Creates ngrams features from tuples |
|
Creates the map of words |
|
Creates the cloud of words |
|
Vectorizes docs |
Detailed list
- class smltk.data_processing.DataProcessing(params: dict = {})
The class DataProcessing contains the simple methods to manage the data, both tabular and textual.
- Arguments: params (dict) with the key below
- language (str):
default is english
Here’s an example:
>>> from smltk.data_processing import DataProcessing >>> doc = 'Good case, Excellent value.' >>> dp = DataProcessing() >>> get_doc_cleaned = dp.get_doc_cleaned(doc) >>> print(get_doc_cleaned) good case excellent value
- clean_doc(tokens: list, is_alpha: bool = True, is_punctuation: bool = True) list
Filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation
- Arguments:
- doc (str):
text
- is_alpha (bool):
default is True
- is_punctuation (bool):
default is True
- Returns:
list of words filtered
- get_df(data)
Create a DataFrame from the data of the main repositories
- Arguments:
- data (mixed):
data loaded from one of the main repositories
- Returns:
Pandas DataFrame
- get_doc_cleaned(doc: str, is_lemma: bool = True, is_stem: bool = False, is_alpha: bool = True, is_punctuation: bool = True) str
Filters doc from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems
- Arguments:
- doc (str):
text
- is_lemma (bool):
default is True
- is_stem (bool):
default is False
- is_alpha (bool):
default is True
- is_punctuation (bool):
default is True
- Returns:
string cleaned
- get_inference_df(data, x_test, y_test, y_pred)
Create a DataFrame from the data of the main repositories
- Arguments:
- x_test (Pandas DataFrame):
features used for the prediction
- y_test (list of str):
list of the targets
- y_pred (list of str):
list of the predictions
- Returns:
Pandas DataFrame
- get_tokens_cleaned(doc: str, is_lemma: bool = True, is_stem: bool = False, is_alpha: bool = True, is_punctuation: bool = True) list
Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters and if is_lemma is True, also it lemmatizes or if is_stem is True, also it stems
- Arguments:
- doc (str):
text
- is_lemma (bool):
default is True
- is_stem (bool):
default is False
- is_alpha (bool):
default is True
- is_punctuation (bool):
default is True
- Returns:
list of tokens cleaned
- harmonize_words(words: list, synonyms: dict = None, lang: list = None) list
Harmonize a list of words to a canonical form.
- Args:
words: list of words/phrases (may contain None or NaN). synonyms: optional dict mapping canonical -> list of variants.
Variants are normalized (lowercase + strip) automatically.
- lang: optional list of languages for the lemmatizer.
Defaults to [self.language].
- Returns:
- list of words in canonical form, same length as input.
None is preserved as is; empty strings stay empty.
- Raises:
- ValueError: if a variant in synonyms maps to two different
canonical forms.
- lemmatize(tokens: list) list
Lemmatizes tokens
- Arguments:
- tokens (list[str]):
list of words
- Returns:
list of tokens lemmatized
- stem(tokens: list) list
Stems tokens
- Arguments:
- tokens (list[str]):
list of words
- Returns:
list of tokens stemmed
- tokenize(doc: str) list
Tokenizes doc
- Arguments:
- doc (str):
text
- Returns:
list of words filtered
- tokenize_and_clean_doc(doc: str, is_alpha: bool = True, is_punctuation: bool = True) list
Tokenizes doc and filters tokens from punctuation, changes upper characters in lower characters if is_alpha is True, remove no alphabetic characters and if is_punctuation is True, remove punctuation
- Arguments:
- doc (str):
text
- is_alpha (bool):
default is True
- is_punctuation (bool):
default is True
- Returns:
list of words filtered
- transform_categories(data: DataFrame, categorical_features: dict = None, features: dict = None) list
Transform categorical features in discrete features
- Arguments:
- data (Pandas DataFrame):
features to elaborate
- categorical_features (dict):
with key the name of the categorical features and value the ordered list of values
- features (dict):
with key categorical_features and value the list of the categorical features
- Returns:
list of categorical_features and data transformed
- class smltk.data_processing.Ntk(params: list = [])
The class Ntk contains the Natural Language Processing tool kit. This class extends the basic class DataProcessing and it overrides the methods lemmatize(), tokenize() and tokenize_and_clean_doc().
- Arguments: params (dict) with the keys below
- language (str):
default is english
- lemmatizer (obj):
obj with method lemmatize() like WordNetLemmatizer()
- min_length (int):
default is 1, so all words will be used
- stop_words (list[str]):
default is stopwords.words()
- tag_map (dict):
default contains J, V, R
Here’s an example:
>>> from smltk.data_processing import Ntk >>> doc = 'Good case, Excellent value.' >>> ntk = Ntk() >>> get_doc_cleaned = ntk.get_doc_cleaned(doc) >>> print(get_doc_cleaned) good case excellent value
- add_doc_to_vocab(doc: str, vocab: Counter, is_lemma: bool = True) list
Adds tokens of that doc to vocabulary and updates vocabulary
- Arguments:
- doc (str):
text
- vocab (collections.Counter):
dictionary of tokens with its count
- is_lemma (bool):
default is True
- Returns:
list of tokens of that doc and vocab updated
- create_features_from_docs(docs, target, is_lemma=True, words_top={}, degree=0)
Creates features from docs
- Arguments:
- docs (list[str]):
list of text
- target (str):
target name of the docs
- is_lemma (bool):
default is True
- words_top (dict):
dictionary of the words top
- degree (int):
degree of ngrams, default is 0
- Returns:
list of tuples with features and relative target
- create_features_from_tuples(tuples, is_lemma=True, words_top={}, degree=0)
Creates features from tuples
- Arguments:
- tuples (list[tuples]):
list of tuples with sample and its target
- is_lemma (bool):
default is True
- words_top (dict):
dictionary of the words top
- degree (int):
degree of ngrams, default is 0
- Returns:
list of tuples with features and relative target
- create_ngrams_features_from_docs(docs, target, is_lemma=True, degree=2)
Creates ngrams features from docs
- Arguments:
- docs (list[str]):
list of text
- target (str):
target name of the docs
- is_lemma (bool):
default is True
- degree (int):
degree of ngrams, default is 2
- Returns:
list of tuples with features and relative target
- create_ngrams_features_from_tuples(tuples, is_lemma=True, degree=2)
Creates ngrams features from tuples
- Arguments:
- tuples (list[tuples]):
list of tuples with sample and its target
- is_lemma (bool):
default is True
- degree (int):
degree of ngrams, default is 2
- Returns:
list of tuples with features and relative target
- create_tuples(docs: list = [], target: list = []) list
Creates tuples with sample and its target
- Arguments:
- docs (list[str]):
list of texts
- target (list[str]):
list of targets
- Returns:
list of tuples with sample and its target
- create_vocab_from_docs(docs: list, is_lemma: bool = True) dict
Creates vocabulary from list of docs
- Arguments:
- docs (list[str]):
list of texts
- is_lemma (bool):
default is True
- Returns:
dictionary of tokens with its count in an object collections.Counter
- create_vocab_from_tuples(tuples: list, is_lemma: bool = True)
Creates vocabulary from list of tuples
- Arguments:
- tuples (list[tuples]):
list of tuples with sample and its target
- is_lemma (bool):
default is True
- Returns:
dictionary of tokens with its count in an object collections.Counter
- create_words_cloud(words, is_test=False)
Creates the cloud of words
- Arguments:
- words (str):
words
- is_test (bool):
default is False
- Returns:
only words cloud plot
- create_words_map(words)
Creates the map of words
- Arguments:
- words (list[str]):
words list
- Returns:
string of all words
- get_features(doc, is_lemma=True, words_top={}, degree=0)
Gets features
- Arguments:
- doc (str):
text
- is_lemma (bool):
default is True
- words_top (dict):
dictionary of the words top
- degree (int):
degree of ngrams, default is 0
- Returns:
dictionary of features extracted
- get_features_from_docs(docs, is_lemma=True, words_top={}, degree=0)
Gets features
- Arguments:
- docs (list[str]):
list of text
- is_lemma (bool):
default is True
- words_top (dict):
dictionary of the words top
- degree (int):
degree of ngrams, default is 0
- Returns:
dictionary of features extracted
- get_ngrams(degree=2, doc='', tokens=[], is_tuple=True, is_lemma=False)
Gets ngrams from doc or tokens
- Arguments:
- degree (int):
degree of ngrams, default is 2
- doc (str):
text, option if you pass tokens
- tokens (list[str]):
list of tokens, option if you pass doc
- is_tuple (bool):
default is True
- is_lemma (bool):
default is False
- Returns:
list of tuples (n_grams) for that degree, or list of string (token)
- get_ngrams_features(degree=2, doc='', tokens=[], is_lemma=False)
Gets ngrams features from doc or tokens
- Arguments:
- degree (int):
degree of ngrams, default is 2
- doc (str):
text, option if you pass tokens
- tokens (list[str]):
list of tokens, option if you pass doc
- is_lemma (bool):
default is False
- Returns:
dictionary of ngrams extracted
- get_stats_vocab(vocab: Counter, min_occurance: int = 1) tuple
Gets statistics of vocabulary
- Arguments:
- vocab (collections.Counter):
dictionary of tokens with its count
- min_occurance (int):
minimum occurance considered
- Returns:
tuple of tokens number with >= min_occurance and total tokens number
- get_vocabs_cleaned(vocabs)
Cleans vocabs from common words among targets
- Arguments:
- vocabs (dict):
keys are targets, values are vocabularies for that target
- Returns:
vocabs cleaned from common words among targets
- get_words_top(vocab, how_many)
Gets words top for each target
- Arguments:
- vocab (collections.Counter):
dictionary of tokens with its count
- how_many (int):
how many words in your top how_many list
- Returns:
dictionary of the top how_many list
- lemmatize(tokens: list) list
Lemmatizes tokens
- Arguments:
- tokens (list[str]):
list of words
- Returns:
list of tokens lemmatized
- tokenize(doc: str) list
Splits document in each word
- Arguments:
- doc (str):
text
- Returns:
list of words
- tokenize_and_clean_doc(doc: str, is_alpha: bool = False, is_punctuation: bool = False) list
Tokenizes doc and filters tokens from punctuation, numbers, stop words, and words <= min_length and changes upper characters in lower characters
- Arguments:
- doc (str):
text
- Returns:
list of words filtered
- vectorize_docs(docs, is_count=True, is_lemma=False, is_test=False)
Vectorizes docs
- Arguments:
- docs (list[str]):
list of texts
- is_count (bool):
default is True
- is_lemma (bool):
default is True
- is_test (bool):
default is False
- Returns:
list of scipy.sparse.csr.csr_matrix, one for each doc