README

Overview

Purpose: The purpose of this project is to establish a modular, scalable data pipeline for statistical modeling and machine learning on the CTN-0094 database. The pipeline will support various modeling strategies and evaluation metrics to deliver insights into the relationship between demographics and different target outcomes.
Team: This work is led by Prof. Laura Brandt (clinical arm) and Prof. Gabriel Odom (computational arm); Mr. Ganesh Jainarain is the primary data scientist and statistical programmer.
Funding: This work is towards the successful completion of “Towards a framework for algorithmic bias and fairness in predicting treatment outcome for opioid use disorder” (NIH AIM-AHEAD 1OT2OD032581-02-267) with contact PI Laura Brandt, City College of New York.

Quick Start

To see the list of arguments available, run:

python3 run_pipelineV2.py --help

Predictions, logs, and evaluations folders will be created in the directory specified by the -d flag. When running multiple tests, you can loop through a range of seeds using the -l flag. By default, all outcomes will be considered unless specified using the -o flag.

Example Usage:

python3 run_pipelineV2.py -d "C:\\Users\\John\\Desktop\\Results" -l 5 10

This command loops through all integer seeds between 5 and 10 and saves the results in the specified directory.

Step-by-Step Guide

Step 0: Master Dataset

Task: Maintain the “master” dataset created by joining tables from the CTN-0094 database.
Note: This dataset remains unchanged throughout the pipeline.

Step 1: Sampling

Task: Generate a dataset of 1000 samples with a specified demographic distribution.
Methods: - Random sampling - Partial matching - Sophisticated matching (future implementation)

Step 2: Data Pre-Processing

Task: Apply standard feature engineering and data preparation.
Note: The pre-processing script will likely remain stable.

Step 3: Join Dependent Variables

Task: Merge a chosen target variable (dependent variable) with the processed independent variables.
Selection: Choose from 11 predefined target variables.

Step 4: Machine Learning Model Selection

Task: Select an appropriate machine learning model and use it to predict target values.

Step 5: Model Evaluation

Task: Evaluate model performance using the following metrics: - AUC - F1 score - RMSE - Fairness (currently undefined)
Output: Return a tuple containing demographic composition, target variable, machine learning model, and metrics.

Step 6: Iterative Design Points

Task: Repeat Steps 1–5 across various design points.

Target Outcome Buckets

Binary
Count (with a fixed max)
Proportion

Model Prioritization

The models will be implemented in the following order:

Logistic LASSO (Binary Outcomes) - Port existing code into the pipeline. - Save the .job file trained on the full cohort.
Negative Binomial Regression (Count Outcomes)
Sigmoidal Regression (Proportion Outcomes)
Beta Regression (Proportion Outcomes)

Future Direction

The immediate goal is to develop a proof-of-concept pipeline using logistic LASSO, followed by an expansion to random forests. Additional models will be integrated as needed.

References

Luo SX, Feaster DJ, Liu Y et al. _Individual‑Level Risk Prediction of Return to Use During Opioid Use Disorder Treatment_. JAMA Psychiatry. 2024;81(1):45–56. doi:10.1001/jamapsychiatry.2023.3596

Multicenter decision‑analytic prediction model using CTN trial data.