Python Modules
train_model.py
This module defines classes for training models on various types of outcomes (binary and count-based) using logistic regression and negative binomial regression. These classes are integrated into the pipeline for training and evaluation.
OutcomeModel
- class OutcomeModel(data, target_column, seed=None)
Base class for modeling an outcome from a dataset.
- Parameters:
data (pandas.DataFrame) – The input dataset.
target_column (str) – The name of the column to predict.
seed (int, optional) – Random seed for reproducibility.
- train()
Placeholder method to train a model.
- evaluate()
Placeholder method to evaluate a model.
LogisticModel
- class LogisticModel(data, target_column, Cs=[1.0], seed=None)
Logistic regression model with L1 regularization for feature selection.
- Parameters:
data (pandas.DataFrame) – The input dataset.
target_column (str) – The name of the target column.
Cs (list) – List of inverse regularization strengths.
seed (int, optional) – Random seed.
- feature_selection_and_model_fitting()
Perform feature selection using L1 regularization and fit logistic regression model.
- find_best_threshold()
Determine the optimal classification threshold based on the proportion of positive outcomes.
- train()
Train the logistic regression model.
- evaluateOverallTest()
Evaluate the model on the overall test set.
- evaluate()
Evaluate the logistic model using chosen performance metrics.
- _evaluateOnValidation()
Internal method to evaluate performance on a validation split.
- _countDemographic()
Count demographic group membership for fairness evaluation.
NegativeBinomialModel
CoxProportionalHazard
- class CoxProportionalHazard(data, target_column, seed=None)
Cox Proportional Hazards model for time-to-event (survival) analysis using the lifelines package.
- Parameters:
data (pandas.DataFrame) – Input dataset containing features and event/time columns.
target_column (list of str) – List with two elements: [duration_column, event_column].
seed (int, optional) – Random seed for reproducibility.
- train()
Fit a Cox Proportional Hazards model using lifelines’ CoxPHFitter.
- predict()
Return model predictions (placeholder — not implemented in full).
- _evaluateOnValidation(X, y, id)
Evaluate model on the validation set using Concordance Index.
- selectFeatures()
Use Lasso-based feature selection for survival outcomes.
BetaRegression
- class BetaRegression(data, target_column, seed=None)
Beta regression model for modeling outcomes constrained between 0 and 1, using statsmodels.
- Parameters:
data (pandas.DataFrame) – The input dataset with features and the beta-distributed target.
target_column (str) – The name of the outcome column to predict.
seed (int, optional) – Random seed.
- train()
Fit a Beta regression model using statsmodels.othermod.betareg.BetaModel.
- predict()
Return model predictions (placeholder — not implemented in full).
- _evaluateOnValidation(X, y, id)
Evaluate model performance using MSE, MAE, RMSE, Pearson R, and McFadden R².
- selectFeatures()
Perform Lasso-based feature selection for beta regression.
create_demodf_knn.py
This module provides tools for creating balanced demographic datasets using propensity score matching and data splitting techniques. It supports both Python-based (PsmPy) and R-based (MatchIt) methods for matching.
- holdOutTestData(df, id_column, testCount=100, columnToSplit='RaceEth', majorityValue=1, percentMajority=58, seed=42)
Hold out a fixed, demographically stratified test set from the full dataset.
- Parameters:
df (pandas.DataFrame) – Full dataset.
id_column (str) – Name of the unique identifier column.
testCount (int) – Total number of holdout samples. Default 100.
columnToSplit (str) – Column used to define majority/minority groups. Default
'RaceEth'.majorityValue (int) – Value in
columnToSplitthat identifies the majority group. Default1.percentMajority (int) – Percentage of holdout that should be majority group (0–100). Default
58.seed (int) – Random seed for reproducibility. Default
42.
- Returns:
Tuple of (train_df, holdout_df).
- Return type:
Tuple[pandas.DataFrame, pandas.DataFrame]
- propensityScoreMatch(df, idColumn, columnToSplit='RaceEth', majorityValue=1, columnsToMatch=['age', 'is_female'], sampleSize=500)
Perform propensity score matching to create demographically balanced paired subsets.
- Parameters:
df (pandas.DataFrame) – Full dataset.
idColumn (str) – Name of the unique identifier column.
columnToSplit (str) – Column used to define majority/minority groups.
majorityValue (int) – Value identifying the majority group.
columnsToMatch (list of str) – Covariates used for matching.
sampleSize (int) – Number of treated samples to match against.
- Returns:
List of matched DataFrames (treated, control_0, control_1).
- Return type:
list of pandas.DataFrame
- create_subsets(dfs, splits=11, sampleSize=500)
Create a sequence of training subsets at incrementally varying majority/minority ratios.
Each subset shifts the proportion of treated (minority) vs matched control (majority) samples across
splitssteps, enabling evaluation of model performance across demographic compositions.- Parameters:
dfs (list of pandas.DataFrame) – List of three DataFrames — [treated, control_0, control_1].
splits (int) – Number of ratio steps to generate. Default
11.sampleSize (int) – Total sample size per subset. Default
500.
- Returns:
List of DataFrames, one per demographic ratio split.
- Return type:
list of pandas.DataFrame
- PropensityScoreMatchPsmPy(df)
Apply Propensity Score Matching using the PsmPy library.
- Parameters:
df (pandas.DataFrame) – Full dataset.
- Returns:
Matched dataset.
- Return type:
pandas.DataFrame
- PropensityScoreMatchRMatchit(df)
Apply Propensity Score Matching using the R MatchIt package via rpy2.
- Parameters:
df (pandas.DataFrame) – Full dataset.
- Returns:
Matched dataset using R’s MatchIt.
- Return type:
pandas.DataFrame
preprocess.py
This module includes a DataPreprocessor class and various helper functions for transforming, cleaning, and preparing clinical and behavioral data for analysis.
DataPreprocessor
- class DataPreprocessor(dataframe)
A class to preprocess pandas DataFrames by handling column drops and validation checks.
- Parameters:
dataframe (pandas.DataFrame) – The pandas DataFrame to preprocess.
- drop_columns_and_return(columns_to_drop)
Drop specified columns from the DataFrame in place. Silently skips any column names that are not present.
- Parameters:
columns_to_drop (list of str) – List of column names to drop.
- convert_yes_no_to_binary()
Convert all columns whose values are exclusively
'Yes'/'No'(and NaN) to binary1/0. Other columns are left untouched.
- process_tlfb_columns(specified_tlfb_columns)
Aggregate all TLFB columns not in
specified_tlfb_columnsinto a newTLFB_Othercolumn, then drop those unspecified columns.- Parameters:
specified_tlfb_columns (list of str) – TLFB columns to retain individually.
- calculate_behavioral_columns()
Derive
Homosexual_Behavior(based onmsm_nptandSex) andNon_monogamous_Relationships(based ontxx_prt) and append them to the DataFrame.
- move_column_to_end(column_names)
Reorder the DataFrame so the specified columns appear last. Ignores any column names not present in the DataFrame.
- Parameters:
column_names (list of str) – Column(s) to move to the end.
- rename_columns()
Apply the hardcoded rename mapping in place:
Sex→is_female,job→unemployed,is_living_stable→unstableliving.
- transform_nan_to_zero_for_binary_columns()
For every column that contains NaN values and has unique non-NaN values of exactly
[0, 1], fill NaN with0.
- transform_and_rename_column(original_column_name, new_column_name)
Convert a column to binary (
1where non-null,0where null) and rename it in place, preserving its position.- Parameters:
original_column_name (str) – Name of the column to transform.
new_column_name (str) – Replacement column name.
- fill_nan_with_zero(column_name)
Fill NaN values in the specified column with
0. If the column does not exist the call is a no-op.- Parameters:
column_name (str) – Name of the column to fill.
- transform_data_with_nan_handling()
Apply categorical-to-numeric mappings for
Sex,education,marital,job,is_living_stable,race,XTRT,RaceEth, andpain. Columns absent from the DataFrame are skipped without error.
- convert_uds_to_binary()
Convert all
UDS-prefixed columns to binary: values> 0become1, values== 0stay0.
preprocess_pipeline.py
This module provides a single entry point for preprocessing data within the modeling pipeline.
- preprocess_data(df)
Preprocesses a dataset by cleaning, transforming, and formatting features for modeling.
This function performs operations such as: - Dropping irrelevant or highly sparse columns - Converting categorical values to binary - Normalizing behavioral features - Handling missing values - Renaming columns for consistency - Converting drug test results to binary format
- Parameters:
df (pandas.DataFrame) – The raw input DataFrame from the master dataset.
- Returns:
Preprocessed DataFrame ready for modeling.
- Return type:
pandas.DataFrame
model_training.py
This module provides the primary interface for training and evaluating outcome models in the pipeline. Depending on the selected outcome type (logical, integer, or survival), it dynamically loads the appropriate model class (Logistic Regression, Negative Binomial Regression, Cox Proportional Hazards, or Beta Regression). Each model is trained and evaluated on one or more data subsets and held-out validation data.
- train_and_evaluate_models(merged_subsets, selected_outcome, processed_data_heldout)
Train and evaluate models on each demographic or data subset and return evaluation results.
This function dynamically selects the correct model type based on the endpointType of the selected outcome. It then loops through each data subset, trains the selected model, and evaluates performance on both the subset and a held-out dataset.
- Parameters:
merged_subsets (list of pandas.DataFrame) – A list of DataFrames representing stratified or demographically-split training datasets.
selected_outcome (dict - columnsToUse: list of str — target variable columns. - endpointType: Enum — one of EndpointType.LOGICAL, EndpointType.SURVIVAL, or EndpointType.INTEGER.) – A dictionary containing the outcome column name(s) and the type of model to use.
processed_data_heldout (pandas.DataFrame) – The held-out dataset used for validation.
- Returns:
A multi-indexed pandas DataFrame with predictions and evaluation metrics for both the held-out and subset data.
- Return type:
pandas.DataFrame
Note
Logging is extensively used to track training and evaluation progress for each subset. Evaluation metrics vary depending on the model type (e.g., accuracy and ROC for classification, RMSE and McFadden R² for regression).
run_pipelineV2.py
This is the main pipeline orchestrator script for training, evaluating, and profiling statistical and machine learning models across demographic subsets using the CTN-0094 dataset. It supports multiple model types including logistic regression, negative binomial regression, survival analysis (Cox), and beta regression.
The script handles argument parsing, data loading, preprocessing, subset generation, model training, evaluation, and CSV logging of all results.
Functions
- main()
Entry point for the pipeline. Parses arguments, initializes outcome and seed configurations, and runs profiling or standard pipeline execution for each outcome and seed.
- argument_handler()
Parse command-line arguments including seed range, outcome name, output directory, and profiling method.
- Returns:
A tuple of (loop range, outcomes, output directory, profiling flag).
- Return type:
Tuple
- initialize_pipeline(selected_outcome)
Load and merge the demographic and outcome datasets, apply preprocessing, and prepare the data for modeling.
- Parameters:
selected_outcome (dict) – A dictionary defining the outcome variable and endpoint type.
- Returns:
Preprocessed dataset ready for modeling.
- Return type:
pandas.DataFrame
- run_pipeline(processed_data, seed, selected_outcome, directory)
Executes the core pipeline logic for one run: splits data, performs matching, creates subsets, trains and evaluates models, and writes predictions and evaluations to CSV.
- Parameters:
processed_data (pandas.DataFrame) – Cleaned and merged input dataset.
seed (int) – Random seed for reproducibility.
selected_outcome (dict) – Dictionary describing the outcome and model type.
directory (str) – Output path for saving logs and results.
- save_evaluations_to_csv(results, seed, selected_outcome, directory, name)
Save evaluation metrics for all subsets and held-out predictions into a CSV file. Automatically adjusts headers based on model type.
- Parameters:
results (dict) – Dictionary of evaluation results from each subset.
seed (int) – The random seed used for training.
selected_outcome (dict) – The outcome configuration dict.
directory (str) – Directory to save output CSVs.
name (str) – Subfolder name for organizing evaluation files.
- save_predictions_to_csv(data, seed, selected_outcome, directory, name)
Save prediction scores for each individual across subsets and held-out data.
- Parameters:
data (list) – Prediction tuples (id, score) across subsets.
seed (int) – The random seed used for training.
selected_outcome (dict) – The outcome configuration dict.
directory (str) – Output directory for saving results.
name (str) – Folder name under which to store predictions.
Globals
- AVAILABLE_OUTCOMES
A predefined list of outcomes from the CTN-0094 dataset, each with its name, outcome column(s), and associated EndpointType.
Used for automatic selection of outcomes when not specified via command-line arguments.
CLI Usage Example
Run the pipeline for a specific outcome and seed range:
python run_pipelineV2.py --loop 42 45 --outcome Ab_ling_1998 --dir logs/run_test --prof simple
Or run all outcomes with profiling off:
python run_pipelineV2.py
validate.py
This module validates a dataset before passing it to a model, checking that required columns exist, data types are correct, and outcome values match the expected format for the chosen model type.
- validate_dataset_for_model(df, model_type, outcome_col, time_col=None)
Validate a dataset against the requirements of the specified model type.
Accepts either a string or an
EndpointTypeenum value formodel_type. RaisesValueErrorwith a descriptive message if any check fails.- Parameters:
df (pandas.DataFrame) – The dataset to validate.
model_type (str or EndpointType) – Model type — one of
'logical','integer','survival', or the correspondingEndpointTypeenum member.outcome_col (str) – Name of the outcome column to validate.
time_col (str, optional) – Name of the time column. Required for survival models. Default
None.
- Raises:
ValueError – If the outcome column is missing, values are of the wrong type, or required survival columns are absent or non-numeric.
Validation rules by model type:
logical— outcome must contain only0and1.integer— outcome must have an integer dtype.survival— requires a numerictime_coland a binary event column.
constants.py
This module defines the EndpointType enum used throughout the pipeline to identify
which type of statistical model should be applied to a given outcome.