Data Analysis
Simple Machine Learning Tool Kit package
This package contains the modules to simplify your code for your data analysis processes.
It is part of the educational repositories (https://github.com/pandle/materials) to learn how to write stardard code and common uses of the TDD.
Package contents two classes to manage data analysis.
>>> import smltk
>>> help(smltk)
>>> from smltk.data_analysis import DataAnalysis
>>> help(DataAnalysis)
# license MIT # support https://github.com/bilardi/smltk/issues
The class DataAnalysis contains methods to explore data analysis. |
Data Analysis
Plot images for exploratory data analysis |
Detailed list
- class smltk.data_analysis.DataAnalysis
The class DataAnalysis contains methods to explore data analysis.
Here’s an example:
>>> from smltk.data_analysis import DataAnalysis >>> df = pd.read_csv(filename) >>> da = DataAnalysis() >>> features = da.get_eda(df) >>> print(features.keys()) ["data_amount", "hue_order", "categorical_features", "numerical_features", "data_missing", "relations"]
- biserial_corr(x: Series, y: Series)
biserial (dichotomous, continuous-continuous)
- choose_relations(x: Series, y: Series) dict
Choose relations based on data type
- cramers_v(x: Series, y: Series)
Cramer’s V (nominal categories)
- get_eda(target: str, data: DataFrame, params: dict = {}) dict
Plot images for exploratory data analysis
- Arguments:
- target (string):
name of feature target
- data (Pandas DataFrame):
features to analyze
- params (dict):
with the keys below
- columns_to_filter (list[str]):
list of features names to filter
- color_palette (string):
matplotlib colormap name, by default Set2
- sample.frac (string):
fraction of axis items to return
- corr_plot.cmap (string):
the mapping from data values to color space, matplotlib colormap name or object, or list of colors, by default viridis
- missingval_plot.cmap (string):
matplotlib colormap name or object, by default Set2
- analyses.skip (list[str]):
names of blocks to skip entirely (no calc, no plot). Valid names in DataAnalysis.VALID_BLOCK_NAMES.
- plots.skip (list[str]):
names of blocks whose plots are skipped (calculations still performed). Relevant for relations_heatmaps which keeps populating features[“relations”].
- Returns:
plots and features split in categorical and numerical features. Note: when relations_heatmaps is in analyses.skip, the “relations” key is omitted from the returned dict.
- get_features_info(target: str, data: DataFrame, params: dict = {}) dict
Calculate features types and data missing percentage
- Arguments:
- target (string):
name of feature target
- data (Pandas DataFrame):
features to analyze
:params (dict) with the keys below :columns_to_filter (list[str]): list of features names to filter
- Returns:
dict with categorical features list and numerical features list
- get_relations(data: DataFrame, columns_to_use: list = []) dict
Calculate relation among features
- Arguments:
- data (Pandas DataFrame):
features to analyze
- columns_to_use (list[str]):
list of features names to use
- Returns:
dictionary of Pandas Dataframe of the relations
- kendall_corr(x: Series, y: Series)
Kendall (monotone, ordinal)
- mutual_info(x: Series, y: Series)
mutual information (generic, discrete vs discrete)
- pearson_corr(x: Series, y: Series)
Pearson (linear, continuous-continuous)
- phi_coeff(x: Series, y: Series)
Phi coefficient (2 binary variables)
- pointbiserial_corr(x: Series, y: Series)
point-biserial (binary vs continuous)
- spearman_corr(x: Series, y: Series)
Spearman (monotonic, continuous-ordinal)