README
======

Overview
--------

- **Purpose**: The purpose of this project is to establish a modular, scalable data pipeline for statistical modeling and machine learning on the CTN-0094 database. The pipeline will support various modeling strategies and evaluation metrics to deliver insights into the relationship between demographics and different target outcomes.
- **Team**: This work is led by Prof. Laura Brandt (clinical arm) and Prof. Gabriel Odom (computational arm); Mr. Ganesh Jainarain is the primary data scientist and statistical programmer.
- **Funding**: This work is towards the successful completion of *"Towards a framework for algorithmic bias and fairness in predicting treatment outcome for opioid use disorder"* (NIH AIM-AHEAD 1OT2OD032581-02-267) with contact PI Laura Brandt, City College of New York.


Quick Start
-----------

To see the list of arguments available, run:

.. code-block:: python

   python3 run_pipelineV2.py --help

Predictions, logs, and evaluations folders will be created in the directory specified by the `-d` flag. When running multiple tests, you can loop through a range of seeds using the `-l` flag. By default, all outcomes will be considered unless specified using the `-o` flag.

**Example Usage**:

.. code-block:: python

   python3 run_pipelineV2.py -d "C:\\Users\\John\\Desktop\\Results" -l 5 10

This command loops through all integer seeds between 5 and 10 and saves the results in the specified directory.

Step-by-Step Guide
------------------

Step 0: Master Dataset
^^^^^^^^^^^^^^^^^^^^^^

- **Task**: Maintain the "master" dataset created by joining tables from the CTN-0094 database.
- **Note**: This dataset remains unchanged throughout the pipeline.

Step 1: Sampling
^^^^^^^^^^^^^^^^

- **Task**: Generate a dataset of 1000 samples with a specified demographic distribution.
- **Methods**:
  - Random sampling
  - Partial matching
  - Sophisticated matching (future implementation)

Step 2: Data Pre-Processing
^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **Task**: Apply standard feature engineering and data preparation.
- **Note**: The pre-processing script will likely remain stable.

Step 3: Join Dependent Variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **Task**: Merge a chosen target variable (dependent variable) with the processed independent variables.
- **Selection**: Choose from 11 predefined target variables.

Step 4: Machine Learning Model Selection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **Task**: Select an appropriate machine learning model and use it to predict target values.

Step 5: Model Evaluation
^^^^^^^^^^^^^^^^^^^^^^^^

- **Task**: Evaluate model performance using the following metrics:
  - AUC
  - F1 score
  - RMSE
  - Fairness (currently undefined)

- **Output**: Return a tuple containing demographic composition, target variable, machine learning model, and metrics.

Step 6: Iterative Design Points
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **Task**: Repeat Steps 1–5 across various design points.

Target Outcome Buckets
----------------------

1. **Binary**  
2. **Count (with a fixed max)**  
3. **Proportion**  

Model Prioritization
--------------------

The models will be implemented in the following order:

1. **Logistic LASSO (Binary Outcomes)**
   - Port existing code into the pipeline.
   - Save the `.job` file trained on the full cohort.

2. **Negative Binomial Regression (Count Outcomes)**

3. **Sigmoidal Regression (Proportion Outcomes)**

4. **Beta Regression (Proportion Outcomes)**

Future Direction
----------------

The immediate goal is to develop a proof-of-concept pipeline using logistic LASSO, followed by an expansion to random forests. Additional models will be integrated as needed.

References
==========

**Luo SX, Feaster DJ, Liu Y et al. _Individual‑Level Risk Prediction of Return to Use During Opioid Use Disorder Treatment_. JAMA Psychiatry. 2024;81(1):45–56. doi:10.1001/jamapsychiatry.2023.3596**

Multicenter decision‑analytic prediction model using CTN trial data.

.. image:: https://img.shields.io/badge/View–JAMA%20Psychiatry-blue
   :target: https://jamanetwork.com/journals/jamapsychiatry/fullarticle/2810311
   :alt: View full article on JAMA Psychiatry