README
Overview
Purpose: The purpose of this project is to establish a modular, scalable data pipeline for statistical modeling and machine learning on the CTN-0094 database. The pipeline will support various modeling strategies and evaluation metrics to deliver insights into the relationship between demographics and different target outcomes.
Team: This work is led by Prof. Laura Brandt (clinical arm) and Prof. Gabriel Odom (computational arm); Mr. Ganesh Jainarain is the primary data scientist and statistical programmer.
Funding: This work is towards the successful completion of “Towards a framework for algorithmic bias and fairness in predicting treatment outcome for opioid use disorder” (NIH AIM-AHEAD 1OT2OD032581-02-267) with contact PI Laura Brandt, City College of New York.
Quick Start
To see the list of arguments available, run:
python3 run_pipelineV2.py --help
Predictions, logs, and evaluations folders will be created in the directory specified by the -d flag. When running multiple tests, you can loop through a range of seeds using the -l flag. By default, all outcomes will be considered unless specified using the -o flag.
Example Usage:
python3 run_pipelineV2.py -d "C:\\Users\\John\\Desktop\\Results" -l 5 10
This command loops through all integer seeds between 5 and 10 and saves the results in the specified directory.
Step-by-Step Guide
Step 0: Master Dataset
Task: Maintain the “master” dataset created by joining tables from the CTN-0094 database.
Note: This dataset remains unchanged throughout the pipeline.
Step 1: Sampling
Task: Generate a dataset of 1000 samples with a specified demographic distribution.
Methods: - Random sampling - Partial matching - Sophisticated matching (future implementation)
Step 2: Data Pre-Processing
Task: Apply standard feature engineering and data preparation.
Note: The pre-processing script will likely remain stable.
Step 3: Join Dependent Variables
Task: Merge a chosen target variable (dependent variable) with the processed independent variables.
Selection: Choose from 11 predefined target variables.
Step 4: Machine Learning Model Selection
Task: Select an appropriate machine learning model and use it to predict target values.
Step 5: Model Evaluation
Task: Evaluate model performance using the following metrics: - AUC - F1 score - RMSE - Fairness (currently undefined)
Output: Return a tuple containing demographic composition, target variable, machine learning model, and metrics.
Step 6: Iterative Design Points
Task: Repeat Steps 1–5 across various design points.
Target Outcome Buckets
Binary
Count (with a fixed max)
Proportion
Model Prioritization
The models will be implemented in the following order:
Logistic LASSO (Binary Outcomes) - Port existing code into the pipeline. - Save the .job file trained on the full cohort.
Negative Binomial Regression (Count Outcomes)
Sigmoidal Regression (Proportion Outcomes)
Beta Regression (Proportion Outcomes)
Future Direction
The immediate goal is to develop a proof-of-concept pipeline using logistic LASSO, followed by an expansion to random forests. Additional models will be integrated as needed.
References
Luo SX, Feaster DJ, Liu Y et al. _Individual‑Level Risk Prediction of Return to Use During Opioid Use Disorder Treatment_. JAMA Psychiatry. 2024;81(1):45–56. doi:10.1001/jamapsychiatry.2023.3596
Multicenter decision‑analytic prediction model using CTN trial data.