Data Analysis

Simple Machine Learning Tool Kit package

This package contains the modules to simplify your code for your data analysis processes.

It is part of the educational repositories (https://github.com/pandle/materials) to learn how to write stardard code and common uses of the TDD.

Package contents two classes to manage data analysis.

>>> import smltk
>>> help(smltk)
>>> from smltk.data_analysis import DataAnalysis
>>> help(DataAnalysis)

# license MIT # support https://github.com/bilardi/smltk/issues

DataAnalysis

The class DataAnalysis contains methods to explore data analysis.

Data Analysis

DataAnalysis.get_eda

Plot images for exploratory data analysis

Detailed list

class smltk.data_analysis.DataAnalysis

The class DataAnalysis contains methods to explore data analysis.

Here’s an example:

>>> from smltk.data_analysis import DataAnalysis
>>> df = pd.read_csv(filename)
>>> da = DataAnalysis()
>>> features = da.get_eda(df)
>>> print(features.keys())
["data_amount", "hue_order", "categorical_features", "numerical_features", "data_missing", "relations"]

biserial_corr(x: Series, y: Series): biserial (dichotomous, continuous-continuous)

choose_relations(x: Series, y: Series) → dict: Choose relations based on data type

cramers_v(x: Series, y: Series): Cramer’s V (nominal categories)

get_eda(target: str, data: DataFrame, params: dict = {}) → dict

Plot images for exploratory data analysis

Arguments:

target (string):: name of feature target
data (Pandas DataFrame):: features to analyze
params (dict):: with the keys below
columns_to_filter (list[str]):: list of features names to filter
color_palette (string):: matplotlib colormap name, by default Set2
sample.frac (string):: fraction of axis items to return
corr_plot.cmap (string):: the mapping from data values to color space, matplotlib colormap name or object, or list of colors, by default viridis
missingval_plot.cmap (string):: matplotlib colormap name or object, by default Set2
analyses.skip (list[str]):: names of blocks to skip entirely (no calc, no plot). Valid names in DataAnalysis.VALID_BLOCK_NAMES.
plots.skip (list[str]):: names of blocks whose plots are skipped (calculations still performed). Relevant for relations_heatmaps which keeps populating features[“relations”].

Returns:

plots and features split in categorical and numerical features. Note: when relations_heatmaps is in analyses.skip, the “relations” key is omitted from the returned dict.

get_features_info(target: str, data: DataFrame, params: dict = {}) → dict

Calculate features types and data missing percentage

Arguments:

target (string):: name of feature target
data (Pandas DataFrame):: features to analyze

:params (dict) with the keys below :columns_to_filter (list[str]): list of features names to filter

Returns:

dict with categorical features list and numerical features list

get_relations(data: DataFrame, columns_to_use: list = []) → dict

Calculate relation among features

Arguments:

data (Pandas DataFrame):: features to analyze
columns_to_use (list[str]):: list of features names to use

Returns:

dictionary of Pandas Dataframe of the relations

kendall_corr(x: Series, y: Series): Kendall (monotone, ordinal)

mutual_info(x: Series, y: Series): mutual information (generic, discrete vs discrete)

pearson_corr(x: Series, y: Series): Pearson (linear, continuous-continuous)

phi_coeff(x: Series, y: Series): Phi coefficient (2 binary variables)

pointbiserial_corr(x: Series, y: Series): point-biserial (binary vs continuous)

spearman_corr(x: Series, y: Series): Spearman (monotonic, continuous-ordinal)