tree_lab package#
Submodules#
tree_lab.Cleaning module#
- class tree_lab.Cleaning.TreeDataCleaner(data)[source]#
Bases:
object
Initializes a TreeDataCleaner object.
- Parameters:
data: pandas.DataFrame
Methods
del_cols
(columns_to_delete)Function to allow users to delete specified columns.
Checks for null values in the dataset, prints columns with null values (if any), and removes duplicate rows in the dataset.
display
()It displays the current state of the data.
Imputes missing values in the 'EMF' column by filling them with the mean of the column.
Modifies the 'Alive' column by replacing 'X' with 1.
- del_cols(columns_to_delete)[source]#
Function to allow users to delete specified columns.
- Parameters:
columns_to_delete: a list of column names to be deleted.
Example: cleaner.del_cols([‘Age’, ‘Height’])
- detect_na()[source]#
Checks for null values in the dataset, prints columns with null values (if any), and removes duplicate rows in the dataset.
tree_lab.Visualization module#
- tree_lab.Visualization.bar_plot(df, kind)[source]#
Generate different types of bar charts based on the specified kind parameter.
- Parameters:
df (pandas DataFrame): Input DataFrame containing relevant data.
kind (str): Type of bar chart to generate. Options: “Species_vs_Status”, “Species_vs_field”, “Light level vs status”.
- Returns:
The plots are displayed using the ‘plot.show()’ method.
- Notes:
For “Species_vs_Status”, the function generates a bar plot showing the count of alive and dead instances for each species.
For “Species_vs_field”, the function creates a stacked bar chart representing the count of each species in different fields.
For “Light level vs status”, a bar plot is generated to display the count of alive and dead instances for each light level category.
The function utilizes seaborn and matplotlib for visualization
- tree_lab.Visualization.compute_stats(dataframe, selected_columns)[source]#
Computes mean, standard deviation, minimum, maximum and median for the specified columns
- Parameters:
dataframe: the dataframe
selected_columns: list containing the columns for which we wish to have the statistics
- Returns:
a dataframe containing the statistics of the columns specified in input
- tree_lab.Visualization.scatter_plot(df, column_x, column_y, hue_column, title)[source]#
This function creates a scatter plot for the specified columns in the DataFrame
- Parameters:
df: a dataframe
column_x: a string specifying the name of a numerical variable of df
column_y: a string specifying the name of a numerical variable of df
hue_column: allows to assign a categorical variable to the data points and represent it using different colours
title: a string specifying the title of the plot
- Returns:
The plots are displayed using the ‘plot.show()’ method.
- tree_lab.Visualization.summarize(df, col, kind='Frequency and Relative frequency', dec=2)[source]#
Summarizes the columns selected from the dataframe by showing the frequency and/or relative frequency of the categories
- Parameter:
df: a pandas dataframe
col: the columns of the dataframe that the user wants to summarize
kind: a string specifying if the frequency and/or the relative frequencies should be displayed. The default is “Frequency and Relative frequency”, but it is also possible to choose “Frequency”, or “Relative frequency”
Returns the frequency tables for the selected columns
tree_lab.importing module#
tree_lab.preprocessing module#
- class tree_lab.preprocessing.DataPreprocessor(data)[source]#
Bases:
object
A class for preprocessing data.
- Parameters:
data: pandas.DataFrame
Methods
display
()It displays the current state of the data.
normalize_data
(numeric_columns[, scaler_type])Normalizes the numeric columns of the input data using the specified scaler type.
onehot_encode
(columns[, keep_original])Performs one-hot encoding on specified columns of the input data.
- normalize_data(numeric_columns, scaler_type='normal')[source]#
Normalizes the numeric columns of the input data using the specified scaler type.
- Parameters:
numeric_columns (list): list of column names containing numeric data to be normalized.
scaler_type (str): the type of scaler to be used. Options: ‘normal’ (default), ‘minmax’, ‘max_absolute’.
- Returns:
pandas.DataFrame: the normalized data.
Raises: ValueError, if the specified columns are not numeric or contain NA values.
- onehot_encode(columns, keep_original=True)[source]#
Performs one-hot encoding on specified columns of the input data.
- Parameters:
columns (list): list of column names containing categorical data to be one-hot encoded.
keep_original (bool): if True, keeps the original columns in addition to the one-hot encoded columns.
- Returns:
pandas.DataFrame: the one-hot encoded data.
Raises: ValueError, if the specified columns are not of type ‘object’.