Python Modules

train_model.py

This module defines classes for training models on various types of outcomes (binary and count-based) using logistic regression and negative binomial regression. These classes are integrated into the pipeline for training and evaluation.

OutcomeModel

class OutcomeModel(data, target_column, seed=None)

Base class for modeling an outcome from a dataset.

Parameters:

data (pandas.DataFrame) – The input dataset.
target_column (str) – The name of the column to predict.
seed (int, optional) – Random seed for reproducibility.

train(): Placeholder method to train a model.

evaluate(): Placeholder method to evaluate a model.

LogisticModel

class LogisticModel(data, target_column, Cs=[1.0], seed=None)

Logistic regression model with L1 regularization for feature selection.

Parameters:

data (pandas.DataFrame) – The input dataset.
target_column (str) – The name of the target column.
Cs (list) – List of inverse regularization strengths.
seed (int, optional) – Random seed.

feature_selection_and_model_fitting(): Perform feature selection using L1 regularization and fit logistic regression model.

find_best_threshold(): Determine the optimal classification threshold based on the proportion of positive outcomes.

train(): Train the logistic regression model.

evaluateOverallTest(): Evaluate the model on the overall test set.

evaluate(): Evaluate the logistic model using chosen performance metrics.

_evaluateOnValidation(): Internal method to evaluate performance on a validation split.

_countDemographic(): Count demographic group membership for fairness evaluation.

NegativeBinomialModel

class NegativeBinomialModel(data, target_column, seed=None)

Negative Binomial model for count-based outcomes using statsmodels.

train(): Train a Negative Binomial regression model using statsmodels.

predict(): Make predictions using the trained Negative Binomial model.

CoxProportionalHazard

class CoxProportionalHazard(data, target_column, seed=None)

Cox Proportional Hazards model for time-to-event (survival) analysis using the lifelines package.

Parameters:

data (pandas.DataFrame) – Input dataset containing features and event/time columns.
target_column (list of str) – List with two elements: [duration_column, event_column].
seed (int, optional) – Random seed for reproducibility.

train(): Fit a Cox Proportional Hazards model using lifelines’ CoxPHFitter.

predict(): Return model predictions (placeholder — not implemented in full).

_evaluateOnValidation(X, y, id): Evaluate model on the validation set using Concordance Index.

selectFeatures(): Use Lasso-based feature selection for survival outcomes.

BetaRegression

class BetaRegression(data, target_column, seed=None)

Beta regression model for modeling outcomes constrained between 0 and 1, using statsmodels.

Parameters:

data (pandas.DataFrame) – The input dataset with features and the beta-distributed target.
target_column (str) – The name of the outcome column to predict.
seed (int, optional) – Random seed.

train(): Fit a Beta regression model using statsmodels.othermod.betareg.BetaModel.

predict(): Return model predictions (placeholder — not implemented in full).

_evaluateOnValidation(X, y, id): Evaluate model performance using MSE, MAE, RMSE, Pearson R, and McFadden R².

selectFeatures(): Perform Lasso-based feature selection for beta regression.

create_demodf_knn.py

This module provides tools for creating balanced demographic datasets using propensity score matching and data splitting techniques. It supports both Python-based (PsmPy) and R-based (MatchIt) methods for matching.

holdOutTestData(df, testCount=100, seed=42)

Hold out test data by sampling a fixed number of majority and minority cases.

Parameters:

df (pandas.DataFrame) – Full dataset.
testCount (int) – Total number of test samples.
seed (int) – Random seed.

Returns:

Combined test set with both majority and minority samples.

Return type:

pandas.DataFrame

propensityScoreMatch(df)

Perform a simple train/test split for race-ethnicity classification.

Parameters:: df (pandas.DataFrame) – Full dataset.
Returns:: Tuple of train and test sets.
Return type:: Tuple[pandas.DataFrame, pandas.DataFrame]

create_subsets(df, demographic_col='RaceEth')

Split dataset into majority and minority subsets based on a demographic column.

Parameters:

df (pandas.DataFrame) – Full dataset.
demographic_col (str) – Column used for splitting groups.

Returns:

Tuple of (majority_df, minority_df)

Return type:

Tuple[pandas.DataFrame, pandas.DataFrame]

PropensityScoreMatchPsmPy(df)

Apply Propensity Score Matching using the PsmPy library.

Parameters:: df (pandas.DataFrame) – Full dataset.
Returns:: Matched dataset.
Return type:: pandas.DataFrame

PropensityScoreMatchRMatchit(df)

Apply Propensity Score Matching using the R MatchIt package via rpy2.

Parameters:: df (pandas.DataFrame) – Full dataset.
Returns:: Matched dataset using R’s MatchIt.
Return type:: pandas.DataFrame

preprocess.py

This module includes a DataPreprocessor class and various helper functions for transforming, cleaning, and preparing clinical and behavioral data for analysis.

DataPreprocessor

class DataPreprocessor(dataframe)

A class to preprocess pandas DataFrames by handling column drops and validation checks.

Parameters:: dataframe (pandas.DataFrame) – The pandas DataFrame to preprocess.

drop_columns_and_return(columns_to_drop)

Drops specified columns from the DataFrame and returns the modified DataFrame. Logs both successful drops and invalid column names.

Parameters:: columns_to_drop (list of str) – List of column names to drop.
Returns:: Modified DataFrame.
Return type:: pandas.DataFrame

convert_yes_no_to_binary(df, columns)

Convert ‘Yes’/’No’ categorical values to 1/0 binary in specified columns.

Parameters:

df – Input DataFrame.
columns – List of column names to convert.

Returns:

Updated DataFrame.

process_tlfb_columns(df)

Normalize TLFB (Timeline Follow-Back) columns using binary encoding.

Parameters:: df – Input DataFrame.
Returns:: Updated DataFrame.

calculate_behavioral_columns(df)

Generate and normalize behavioral columns like opioid use frequency.

Parameters:: df – Input DataFrame.
Returns:: Updated DataFrame.

move_column_to_end(df, column_name)

Move the specified column to the end of the DataFrame.

Parameters:

df – Input DataFrame.
column_name – Name of the column to move.

Returns:

Updated DataFrame.

rename_columns(df, rename_dict)

Rename columns in the DataFrame using a provided mapping.

Parameters:

df – Input DataFrame.
rename_dict – Dictionary of old-to-new column names.

Returns:

Updated DataFrame.

transform_nan_to_zero_for_binary_columns(df, columns)

Replace NaN values with 0 in binary columns.

Parameters:

df – Input DataFrame.
columns – List of column names.

Returns:

Updated DataFrame.

transform_and_rename_column(df, original_col, new_col)

Rename a column and fill missing values with 0.

Parameters:

df – Input DataFrame.
original_col – Original column name.
new_col – New column name.

Returns:

Updated DataFrame.

fill_nan_with_zero(df, columns)

Fill NaNs with 0 for specified columns.

Parameters:

df – Input DataFrame.
columns – List of column names.

Returns:

Updated DataFrame.

transform_data_with_nan_handling(df, columns)

Replace NaNs with 0 and standardize column values to 1.

Parameters:

df – Input DataFrame.
columns – List of column names.

Returns:

Updated DataFrame.

convert_uds_to_binary(df)

Convert Urine Drug Screen (UDS) result columns from text to binary values.

Parameters:: df – Input DataFrame.
Returns:: Updated DataFrame.

preprocess_pipeline.py

This module provides a single entry point for preprocessing data within the modeling pipeline.

preprocess_data(df)

Preprocesses a dataset by cleaning, transforming, and formatting features for modeling.

This function performs operations such as: - Dropping irrelevant or highly sparse columns - Converting categorical values to binary - Normalizing behavioral features - Handling missing values - Renaming columns for consistency - Converting drug test results to binary format

Parameters:: df (pandas.DataFrame) – The raw input DataFrame from the master dataset.
Returns:: Preprocessed DataFrame ready for modeling.
Return type:: pandas.DataFrame

model_training.py

This module provides the primary interface for training and evaluating outcome models in the pipeline. Depending on the selected outcome type (logical, integer, or survival), it dynamically loads the appropriate model class (Logistic Regression, Negative Binomial Regression, Cox Proportional Hazards, or Beta Regression). Each model is trained and evaluated on one or more data subsets and held-out validation data.

train_and_evaluate_models(merged_subsets, selected_outcome, processed_data_heldout)

Train and evaluate models on each demographic or data subset and return evaluation results.

This function dynamically selects the correct model type based on the endpointType of the selected outcome. It then loops through each data subset, trains the selected model, and evaluates performance on both the subset and a held-out dataset.

Parameters:

merged_subsets (list of pandas.DataFrame) – A list of DataFrames representing stratified or demographically-split training datasets.
selected_outcome (dict - columnsToUse: list of str — target variable columns. - endpointType: Enum — one of EndpointType.LOGICAL, EndpointType.SURVIVAL, or EndpointType.INTEGER.) – A dictionary containing the outcome column name(s) and the type of model to use.
processed_data_heldout (pandas.DataFrame) – The held-out dataset used for validation.

Returns:

A multi-indexed pandas DataFrame with predictions and evaluation metrics for both the held-out and subset data.

Return type:

pandas.DataFrame

Note

Logging is extensively used to track training and evaluation progress for each subset. Evaluation metrics vary depending on the model type (e.g., accuracy and ROC for classification, RMSE and McFadden R² for regression).

run_pipelineV2.py

This is the main pipeline orchestrator script for training, evaluating, and profiling statistical and machine learning models across demographic subsets using the CTN-0094 dataset. It supports multiple model types including logistic regression, negative binomial regression, survival analysis (Cox), and beta regression.

The script handles argument parsing, data loading, preprocessing, subset generation, model training, evaluation, and CSV logging of all results.

Functions

main(): Entry point for the pipeline. Parses arguments, initializes outcome and seed configurations, and runs profiling or standard pipeline execution for each outcome and seed.

argument_handler()

Parse command-line arguments including seed range, outcome name, output directory, and profiling method.

Returns:: A tuple of (loop range, outcomes, output directory, profiling flag).
Return type:: Tuple

initialize_pipeline(selected_outcome)

Load and merge the demographic and outcome datasets, apply preprocessing, and prepare the data for modeling.

Parameters:: selected_outcome (dict) – A dictionary defining the outcome variable and endpoint type.
Returns:: Preprocessed dataset ready for modeling.
Return type:: pandas.DataFrame

run_pipeline(processed_data, seed, selected_outcome, directory)

Executes the core pipeline logic for one run: splits data, performs matching, creates subsets, trains and evaluates models, and writes predictions and evaluations to CSV.

Parameters:

processed_data (pandas.DataFrame) – Cleaned and merged input dataset.
seed (int) – Random seed for reproducibility.
selected_outcome (dict) – Dictionary describing the outcome and model type.
directory (str) – Output path for saving logs and results.

save_evaluations_to_csv(results, seed, selected_outcome, directory, name)

Save evaluation metrics for all subsets and held-out predictions into a CSV file. Automatically adjusts headers based on model type.

Parameters:

results (dict) – Dictionary of evaluation results from each subset.
seed (int) – The random seed used for training.
selected_outcome (dict) – The outcome configuration dict.
directory (str) – Directory to save output CSVs.
name (str) – Subfolder name for organizing evaluation files.

save_predictions_to_csv(data, seed, selected_outcome, directory, name)

Save prediction scores for each individual across subsets and held-out data.

Parameters:

data (list) – Prediction tuples (id, score) across subsets.
seed (int) – The random seed used for training.
selected_outcome (dict) – The outcome configuration dict.
directory (str) – Output directory for saving results.
name (str) – Folder name under which to store predictions.

Globals

AVAILABLE_OUTCOMES

A predefined list of outcomes from the CTN-0094 dataset, each with its name, outcome column(s), and associated EndpointType.

Used for automatic selection of outcomes when not specified via command-line arguments.

CLI Usage Example

Run the pipeline for a specific outcome and seed range:

python run_pipelineV2.py --loop 42 45 --outcome Ab_ling_1998 --dir logs/run_test --prof simple

Or run all outcomes with profiling off:

python run_pipelineV2.py