tree_lab package#

Submodules#

tree_lab.Cleaning module#

class tree_lab.Cleaning.TreeDataCleaner(data)[source]#

Bases: object

Initializes a TreeDataCleaner object.

Parameters:
  • data: pandas.DataFrame

Methods

del_cols(columns_to_delete)

Function to allow users to delete specified columns.

detect_na()

Checks for null values in the dataset, prints columns with null values (if any), and removes duplicate rows in the dataset.

display()

It displays the current state of the data.

impute_na()

Imputes missing values in the 'EMF' column by filling them with the mean of the column.

modify_status()

Modifies the 'Alive' column by replacing 'X' with 1.

del_cols(columns_to_delete)[source]#

Function to allow users to delete specified columns.

Parameters:
  • columns_to_delete: a list of column names to be deleted.

Example: cleaner.del_cols([‘Age’, ‘Height’])

detect_na()[source]#

Checks for null values in the dataset, prints columns with null values (if any), and removes duplicate rows in the dataset.

display()[source]#

It displays the current state of the data.

Returns:

pandas.DataFrame

impute_na()[source]#

Imputes missing values in the ‘EMF’ column by filling them with the mean of the column. Fills any remaining null values in the dataset with 0.

modify_status()[source]#

Modifies the ‘Alive’ column by replacing ‘X’ with 1. Renames the ‘Event’ column to ‘Dead’.

Returns:

pandas.DataFrame: The modified DataFrame.

tree_lab.Visualization module#

tree_lab.Visualization.bar_plot(df, kind)[source]#

Generate different types of bar charts based on the specified kind parameter.

Parameters:
  • df (pandas DataFrame): Input DataFrame containing relevant data.

  • kind (str): Type of bar chart to generate. Options: “Species_vs_Status”, “Species_vs_field”, “Light level vs status”.

Returns:

The plots are displayed using the ‘plot.show()’ method.

Notes:
  • For “Species_vs_Status”, the function generates a bar plot showing the count of alive and dead instances for each species.

  • For “Species_vs_field”, the function creates a stacked bar chart representing the count of each species in different fields.

  • For “Light level vs status”, a bar plot is generated to display the count of alive and dead instances for each light level category.

The function utilizes seaborn and matplotlib for visualization

tree_lab.Visualization.compute_stats(dataframe, selected_columns)[source]#

Computes mean, standard deviation, minimum, maximum and median for the specified columns

Parameters:
  • dataframe: the dataframe

  • selected_columns: list containing the columns for which we wish to have the statistics

Returns:

a dataframe containing the statistics of the columns specified in input

tree_lab.Visualization.scatter_plot(df, column_x, column_y, hue_column, title)[source]#

This function creates a scatter plot for the specified columns in the DataFrame

Parameters:
  • df: a dataframe

  • column_x: a string specifying the name of a numerical variable of df

  • column_y: a string specifying the name of a numerical variable of df

  • hue_column: allows to assign a categorical variable to the data points and represent it using different colours

  • title: a string specifying the title of the plot

Returns:

The plots are displayed using the ‘plot.show()’ method.

tree_lab.Visualization.summarize(df, col, kind='Frequency and Relative frequency', dec=2)[source]#

Summarizes the columns selected from the dataframe by showing the frequency and/or relative frequency of the categories

Parameter:
  • df: a pandas dataframe

  • col: the columns of the dataframe that the user wants to summarize

  • kind: a string specifying if the frequency and/or the relative frequencies should be displayed. The default is “Frequency and Relative frequency”, but it is also possible to choose “Frequency”, or “Relative frequency”

Returns the frequency tables for the selected columns

tree_lab.importing module#

tree_lab.importing.import_data()[source]#

tree_lab.preprocessing module#

class tree_lab.preprocessing.DataPreprocessor(data)[source]#

Bases: object

A class for preprocessing data.

Parameters:
  • data: pandas.DataFrame

Methods

display()

It displays the current state of the data.

normalize_data(numeric_columns[, scaler_type])

Normalizes the numeric columns of the input data using the specified scaler type.

onehot_encode(columns[, keep_original])

Performs one-hot encoding on specified columns of the input data.

display()[source]#

It displays the current state of the data.

Returns:

pandas.DataFrame

normalize_data(numeric_columns, scaler_type='normal')[source]#

Normalizes the numeric columns of the input data using the specified scaler type.

Parameters:
  • numeric_columns (list): list of column names containing numeric data to be normalized.

  • scaler_type (str): the type of scaler to be used. Options: ‘normal’ (default), ‘minmax’, ‘max_absolute’.

Returns:

pandas.DataFrame: the normalized data.

Raises: ValueError, if the specified columns are not numeric or contain NA values.

onehot_encode(columns, keep_original=True)[source]#

Performs one-hot encoding on specified columns of the input data.

Parameters:
  • columns (list): list of column names containing categorical data to be one-hot encoded.

  • keep_original (bool): if True, keeps the original columns in addition to the one-hot encoded columns.

Returns:

pandas.DataFrame: the one-hot encoded data.

Raises: ValueError, if the specified columns are not of type ‘object’.