Skip to content

Preprocessing

File: src/preprocess.py

The DataPreprocessor class handles all feature engineering and data cleaning steps before PSM and model training. It operates in-place on an internal DataFrame and exposes each transformation as a separate method.


DataPreprocessor

from src.preprocess import DataPreprocessor

pre = DataPreprocessor(dataframe=df)
Parameter Type Description
dataframe pd.DataFrame Input dataset to preprocess

The preprocessor stores the DataFrame as self.dataframe. All methods modify it in place.


drop_columns_and_return

pre.drop_columns_and_return(columns_to_drop)

Removes specified columns from the DataFrame. Silently skips columns that don't exist.

Parameter Type Description
columns_to_drop list[str] Column names to remove

convert_yes_no_to_binary

pre.convert_yes_no_to_binary()

Finds all columns containing only "Yes" / "No" values (plus NaN) and converts them to 1 / 0 integers. Useful for encoding survey-style columns automatically.


transform_nan_to_zero_for_binary_columns

pre.transform_nan_to_zero_for_binary_columns()

Fills NaN values with 0 in binary columns (columns containing only 0, 1, and NaN). Treats missingness as absence of the event.


Usage pattern

Methods are typically chained in sequence during pipeline setup:

pre = DataPreprocessor(df)
pre.convert_yes_no_to_binary()
pre.transform_nan_to_zero_for_binary_columns()
pre.drop_columns_and_return(["pain_when", "is_smoker", "per_day"])

cleaned_df = pre.dataframe