instaeda package¶
Submodules¶
instaeda.instaeda module¶
-
instaeda.instaeda.divide_and_fill(dataframe, cols=None, missing_values=nan, strategy='mean', fill_value=None, random=False, parts=1, verbose=0)[source]¶ Takes a dataframe, subsets selected columns and divides into parts for imputation of missing values and returns a data frame.
- Parameters
dataframe (pd.DataFrame) – Dataframe from which to take columns and check for missing values.
cols (list, optional) – List of columns to perform imputation on. By default, None (perform on all numeric columns).
missing_values (int, float, str, np.nan or None) – The placeholder for the missing values. All occurences of missing values will be imputed.
strategy (string, optional) – imputation strategy, one of: {‘mean’, ‘median’, ‘constant’, ‘most_frequent’}. By default, ‘mean’.
fill_value (string or numerical value, optional) – When strategy == ‘constant’, fill_value is used to replace all occurences of missing_values. If left to default, fill_value will be 0 when filling numerical data and ‘missing’ for strings or object data types.
random (boolean, optional) – When random == True, shuffles data frame before filling. By default, False.
parts (integer, optional) – The number of parts to divide rows of data frame into. By default, 1.
verbose (integer, optional) – Controls the verbosity of the divide and fill. By default, 0.
- Returns
dataframe – Data frame obtained after divide and fill on the corresponding columns.
- Return type
pandas.DataFrame object
Examples
>>> import numpy as np >>> from instaeda import divide_and_fill >>> example_df = pd.DataFrame({'animal': ['falcon', 'dog', 'spider', 'fish'], 'num_legs': [2, 4, 8, np.nan], 'num_wings': [2, np.nan, 0, 0], 'num_specimen_seen': [10, 2, np.nan, np.nan]}) >>> divide_and_fill(example_df)
-
instaeda.instaeda.plot_basic_distributions(df, cols=None, include=None, vega_theme='ggplot2')[source]¶ Takes a dataframe and generates plots based on types
- Parameters
df (pd.DataFrame) – Dataframe from which to generate plots for each column from
cols (list, optional) – List of columns to generate plots for. By default, None (builds charts for all columns).
include (string, optional) – Select the data types to include. Supported values include None, “string” and “number”. By default, None - it will return both string and number columns.
vega_theme (string, optional) – Select the vega.themes for the altair plots. The options include: excel, ggplot2, quartz, vox, fivethirtyeight, dark, latimes, urbaninstitute, and googlecharts. By default, it uses ggplot2.
- Returns
dict_plots – dictionary of generated altair.Chart objects with the column name as the key
- Return type
dict of altair.Chart objects using the column name as the key
Examples
>>> example_df = pd.DataFrame({'animal': ['falcon', 'dog', 'spider', 'fish'], 'num_legs': [2, 4, 8, 0], 'num_wings': [2, 0, 0, 0], 'num_specimen_seen': [10, 2, 1, 8]}) >>> instaeda_py.plot_distribution(example_df)
-
instaeda.instaeda.plot_corr(df, cols=None, method='pearson', colour_palette='purpleorange')[source]¶ Takes a dataframe, subsets numeric columns and returns a correlation plot object.
- Parameters
df (pd.DataFrame) – Dataframe from which to take columns and calculate, plot correlation between columns.
cols (list, optional) – List of columns to perform correlation on. By default, None (perform on all numeric).
method (string, optional) – correlation calculation method, one of: {‘pearson’, ‘kendall’, ‘spearman’}. By default ‘pearson’
colour_palette (string, optional) – one of Altair accepted colour schemes
- Returns
plot – Correlation plot object displaying column names and corresponding correlation values.
- Return type
altair.Chart object
Examples
>>> example_df = pd.DataFrame({'animal': ['falcon', 'dog', 'spider', 'fish'], 'num_legs': [2, 4, 8, 0], 'num_wings': [2, 0, 0, 0], 'num_specimen_seen': [10, 2, 1, 8]}) >>> instaeda_py.plot_corr(example_df)
-
instaeda.instaeda.plot_intro(df, plot_title='', theme_config='Dimension')[source]¶ Takes a dataframe with configurations and returns an altair object with summary metrics.
- Parameters
df (pd.DataFrame) – Dataframe from which to take columns not limited to numerical columns only
plot_title (string, optional) – User can specify the plot title, by default to show the memory usage
theme_config (list, optional) – A list of color configurations to be passed to theme, by default to use Demension as config
- Returns
plot – An altair plot object displaying summary metrics including the memory usage and the basic description of the input data.
- Return type
altair.Chart object
Examples
>>> example_df = pd.DataFrame({'animal': ['falcon', 'dog', 'spider', 'fish'], 'num_legs': [2, 4, 8, 0], 'num_wings': [2, 0, 0, 0], 'num_specimen_seen': [10, 2, 1, 8]}) >>> instaeda_py.plot_intro(example_df)