README

Overview

  • Purpose: The purpose of this project is to establish a modular, scalable data pipeline for statistical modeling and machine learning on the CTN-0094 database. The pipeline will support various modeling strategies and evaluation metrics to deliver insights into the relationship between demographics and different target outcomes.

  • Team: This work is led by Prof. Laura Brandt (clinical arm) and Prof. Gabriel Odom (computational arm); Mr. Ganesh Jainarain is the primary data scientist and statistical programmer.

  • Funding: This work is towards the successful completion of “Towards a framework for algorithmic bias and fairness in predicting treatment outcome for opioid use disorder” (NIH AIM-AHEAD 1OT2OD032581-02-267) with contact PI Laura Brandt, City College of New York.

Quick Start

To see the list of arguments available, run:

python3 run_pipelineV2.py --help

Predictions, logs, and evaluations folders will be created in the directory specified by the -d flag. When running multiple tests, you can loop through a range of seeds using the -l flag. By default, all outcomes will be considered unless specified using the -o flag.

Example Usage:

python3 run_pipelineV2.py -d "C:\\Users\\John\\Desktop\\Results" -l 5 10

This command loops through all integer seeds between 5 and 10 and saves the results in the specified directory.

Step-by-Step Guide

Step 0: Master Dataset

  • Task: Maintain the “master” dataset created by joining tables from the CTN-0094 database.

  • Note: This dataset remains unchanged throughout the pipeline.

Step 1: Sampling

  • Task: Generate a dataset of 1000 samples with a specified demographic distribution.

  • Methods: - Random sampling - Partial matching - Sophisticated matching (future implementation)

Step 2: Data Pre-Processing

  • Task: Apply standard feature engineering and data preparation.

  • Note: The pre-processing script will likely remain stable.

Step 3: Join Dependent Variables

  • Task: Merge a chosen target variable (dependent variable) with the processed independent variables.

  • Selection: Choose from 11 predefined target variables.

Step 4: Machine Learning Model Selection

  • Task: Select an appropriate machine learning model and use it to predict target values.

Step 5: Model Evaluation

  • Task: Evaluate model performance using the following metrics: - AUC - F1 score - RMSE - Fairness (currently undefined)

  • Output: Return a tuple containing demographic composition, target variable, machine learning model, and metrics.

Step 6: Iterative Design Points

  • Task: Repeat Steps 1–5 across various design points.

Target Outcome Buckets

  1. Binary

  2. Count (with a fixed max)

  3. Proportion

Model Prioritization

The models will be implemented in the following order:

  1. Logistic LASSO (Binary Outcomes) - Port existing code into the pipeline. - Save the .job file trained on the full cohort.

  2. Negative Binomial Regression (Count Outcomes)

  3. Sigmoidal Regression (Proportion Outcomes)

  4. Beta Regression (Proportion Outcomes)

Future Direction

The immediate goal is to develop a proof-of-concept pipeline using logistic LASSO, followed by an expansion to random forests. Additional models will be integrated as needed.

References

Luo SX, Feaster DJ, Liu Y et al. _Individual‑Level Risk Prediction of Return to Use During Opioid Use Disorder Treatment_. JAMA Psychiatry. 2024;81(1):45–56. doi:10.1001/jamapsychiatry.2023.3596

Multicenter decision‑analytic prediction model using CTN trial data.

View full article on JAMA Psychiatry