Data Analysis

Simple Machine Learning Tool Kit package

This package contains the modules to simplify your code for your data analysis processes.

It is part of the educational repositories (https://github.com/pandle/materials) to learn how to write stardard code and common uses of the TDD.

Package contents two classes to manage data analysis.

>>> import smltk
>>> help(smltk)
>>> from smltk.data_analysis import DataAnalysis
>>> help(DataAnalysis)

# license MIT # support https://github.com/bilardi/smltk/issues

DataAnalysis

The class DataAnalysis contains methods to explore data analysis.

Data Analysis

DataAnalysis.get_eda

Plot images for exploratory data analysis

Detailed list

class smltk.data_analysis.DataAnalysis

The class DataAnalysis contains methods to explore data analysis.

Here’s an example:

>>> from smltk.data_analysis import DataAnalysis
>>> df = pd.read_csv(filename)
>>> da = DataAnalysis()
>>> features = da.get_eda(df)
>>> print(features.keys())
["data_amount", "hue_order", "categorical_features", "numerical_features", "data_missing", "relations"]
biserial_corr(x: Series, y: Series)

biserial (dichotomous, continuous-continuous)

choose_relations(x: Series, y: Series) dict

Choose relations based on data type

cramers_v(x: Series, y: Series)

Cramer’s V (nominal categories)

get_eda(target: str, data: DataFrame, params: dict = {}) dict

Plot images for exploratory data analysis

Arguments:
target (string):

name of feature target

data (Pandas DataFrame):

features to analyze

params (dict):

with the keys below

columns_to_filter (list[str]):

list of features names to filter

color_palette (string):

matplotlib colormap name, by default Set2

sample.frac (string):

fraction of axis items to return

corr_plot.cmap (string):

the mapping from data values to color space, matplotlib colormap name or object, or list of colors, by default viridis

missingval_plot.cmap (string):

matplotlib colormap name or object, by default Set2

analyses.skip (list[str]):

names of blocks to skip entirely (no calc, no plot). Valid names in DataAnalysis.VALID_BLOCK_NAMES.

plots.skip (list[str]):

names of blocks whose plots are skipped (calculations still performed). Relevant for relations_heatmaps which keeps populating features[“relations”].

Returns:

plots and features split in categorical and numerical features. Note: when relations_heatmaps is in analyses.skip, the “relations” key is omitted from the returned dict.

get_features_info(target: str, data: DataFrame, params: dict = {}) dict

Calculate features types and data missing percentage

Arguments:
target (string):

name of feature target

data (Pandas DataFrame):

features to analyze

:params (dict) with the keys below :columns_to_filter (list[str]): list of features names to filter

Returns:

dict with categorical features list and numerical features list

get_relations(data: DataFrame, columns_to_use: list = []) dict

Calculate relation among features

Arguments:
data (Pandas DataFrame):

features to analyze

columns_to_use (list[str]):

list of features names to use

Returns:

dictionary of Pandas Dataframe of the relations

kendall_corr(x: Series, y: Series)

Kendall (monotone, ordinal)

mutual_info(x: Series, y: Series)

mutual information (generic, discrete vs discrete)

pearson_corr(x: Series, y: Series)

Pearson (linear, continuous-continuous)

phi_coeff(x: Series, y: Series)

Phi coefficient (2 binary variables)

pointbiserial_corr(x: Series, y: Series)

point-biserial (binary vs continuous)

spearman_corr(x: Series, y: Series)

Spearman (monotonic, continuous-ordinal)