Using the Pipeline: A Step-by-Step Walkthrough

This page walks through a full end-to-end run of the CTN-0094 Pipeline — from setup to interpreting results — in a Colab notebook style. You can follow along locally or click the badge below to run it live in Google Colab.

Setup

Cell 1 — Clone the repository

import os

if not os.path.exists("/content/Pipeline"):
    !git clone https://github.com/CTN-0094/Pipeline.git

os.chdir("/content/Pipeline")

 Cloning into 'Pipeline'...
 remote: Enumerating objects: 412, done.
 remote: Counting objects: 100% (412/412), done.
 Resolving deltas: 100% (231/231), done.

Cell 2 — Install dependencies

This installs all Python packages required by the pipeline.

!pip install -r requirements.txt -q

 Successfully installed lifelines-0.29.0 psmpy-0.3.13 statsmodels-0.14.1 ...

Cell 3 — Verify the install

!python -c "from src.train_model import LogisticModel; print('Import OK')"

 Import OK

Running the Pipeline

Cell 4 — Run with a single outcome

The minimum required argument is --data, pointing to your cleaned CSV file, and --outcome specifying which endpoint to model.

!python run_pipelineV2.py \
    --data /content/my_data.csv \
    --outcome ctn0094_relapse_event \
    -d /content/results

 Processing subset 1 ____________________________________________________________
 Lasso feature selection completed. Selected 8 out of 24 features.
 Features are:  ['age', 'is_female', 'RaceEth', 'UDS_Opioid_Count', ...]

 Processing subset 2 ____________________________________________________________
 Lasso feature selection completed. Selected 6 out of 24 features.
 ...
 ________________________________________________________________________

 Elapsed time: 12.43 seconds

Cell 5 — Run across multiple seeds

Use -l <min> <max> to loop through a range of random seeds. Each seed produces a different PSM cohort, letting you measure stability across sampling variation.

!python run_pipelineV2.py \
    --data /content/my_data.csv \
    --outcome ctn0094_relapse_event \
    -l 1 10 \
    -d /content/results

Note

This will run the full pipeline 9 times (seeds 1 through 9) and save a separate results folder for each seed under /content/results.

Cell 6 — Run multiple outcomes at once

Combine -l and -o to sweep over several outcomes and several seeds in one call.

!python run_pipelineV2.py \
    --data /content/my_data.csv \
    -l 0 5 \
    -o ctn0094_relapse_event Ab_ling_1998 Rs_johnson_1992 \
    -d /content/results

 Running outcome: ctn0094_relapse_event | seed 0
 Running outcome: ctn0094_relapse_event | seed 1
 ...
 Running outcome: Rs_johnson_1992 | seed 4

Cell 7 — Preprocess and match only (no model training)

Use --data_only to stop after propensity score matching and save the ML-ready datasets without training any models. Useful for inspecting cohort balance before committing to a full run.

!python run_pipelineV2.py \
    --data /content/my_data.csv \
    --outcome ctn0094_relapse_event \
    --data_only \
    -d /content/psm_output

 Preprocessing complete. PSM subsets saved to /content/psm_output/

Customising the Cohort

Cell 8 — Change cohort size and held-out composition

The pipeline defaults to groups of 500 matched participants and a held-out set of 100 (58% majority). You can override both.

!python run_pipelineV2.py \
    --data /content/my_data.csv \
    --outcome ctn0094_relapse_event \
    --group_size 300 \
    --heldout_size 80 \
    --heldout_set_percent_majority 50 \
    -d /content/results

Cell 9 — Change the matching column or matching covariates

By default the pipeline splits on RaceEth and matches on age and is_female. Change these with --split and --match.

!python run_pipelineV2.py \
    --data /content/my_data.csv \
    --outcome ctn0094_relapse_event \
    --split RaceEth \
    --match "age is_female UDS_Opioid_Count" \
    -d /content/results

Using a Custom Outcome

Cell 10 — Bring your own outcome column

If your dataset contains a column not in the built-in outcome list, pass its name with -o and specify the endpoint type with --type.

`--type`	When to use
`logical`	Binary outcome (0/1)
`integer`	Count outcome (0, 1, 2, …)
`survival`	Time-to-event; expects two columns: `<name>_time` and `<name>_event`

# Binary custom outcome
!python run_pipelineV2.py \
    --data /content/my_data.csv \
    --outcome my_custom_outcome \
    --type logical \
    -d /content/results

# Survival custom outcome
# Expects columns: my_event_time  and  my_event_event
!python run_pipelineV2.py \
    --data /content/my_data.csv \
    --outcome my_event \
    --type survival \
    -d /content/results

Viewing the Results

Cell 11 — List output files

import os

for root, dirs, files in os.walk("/content/results"):
    level = root.replace("/content/results", "").count(os.sep)
    indent = "  " * level
    print(f"{indent}{os.path.basename(root)}/")
    for f in files:
        print(f"  {indent}{f}")

 results/
   ctn0094_relapse_event_seed0/
     evaluations/
       evaluations_2025-01-15.csv
     predictions/
       predictions_2025-01-15.csv
     logs/
       pipeline_2025-01-15.log

Cell 12 — Load and inspect evaluations

import pandas as pd, glob

files = glob.glob("/content/results/**/evaluations/*.csv", recursive=True)
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
df.head()

 ╔══════════════╦══════════╦══════════╦══════════╦═══════════════════════╗
 ║  Data Type   ║  roc     ║ precision║  recall  ║  training_demographics║
 ╠══════════════╬══════════╬══════════╬══════════╬═══════════════════════╣
 ║  heldout     ║  0.71    ║  0.68    ║  0.74    ║  250 White, 250 Black ║
 ║  subset      ║  0.69    ║  0.65    ║  0.71    ║  250 White, 250 Black ║
 ╚══════════════╩══════════╩══════════╩══════════╩═══════════════════════╝

Cell 13 — Plot ROC-AUC across seeds

import matplotlib.pyplot as plt

heldout = df[df.index % 2 == 0]["roc"].values   # heldout rows
subset  = df[df.index % 2 == 1]["roc"].values   # subset rows

plt.figure(figsize=(8, 4))
plt.plot(heldout, marker="o", label="Held-out")
plt.plot(subset,  marker="s", label="Subset")
plt.axhline(0.5, linestyle="--", color="gray", label="Chance")
plt.xlabel("Seed")
plt.ylabel("ROC-AUC")
plt.title("Model performance across PSM seeds")
plt.legend()
plt.tight_layout()
plt.show()

Interpreting Results

The pipeline is designed to detect algorithmic bias — a disparity in model performance when trained on cohorts with different demographic compositions.

Result	Interpretation
ROC-AUC stable across seeds	The outcome is measurement invariant — model performance does not depend on cohort demographics.
ROC-AUC varies across seeds	The outcome is measurement variant — model performance is sensitive to demographic composition, which may indicate bias.
Held-out AUC >> Subset AUC	Possible over-fitting to the matched cohort; consider increasing `--group_size` or reducing features.
No features selected (Lasso error)	Regularisation may be too strong; the alpha defaults are tuned for the CTN-0094 dataset and may need adjustment for other data.

Available Outcomes Reference

Outcome	Type	Description
`ctn0094_relapse_event`	Logical	Any positive urine drug screen
`Ab_krupitskyA_2011`	Logical	Confirmed abstinence (weeks 5–24)
`Ab_ling_1998`	Logical	13 consecutive negative UDS
`Rs_johnson_1992`	Logical	2 consecutive positive UDS after 4-week treatment
`Rs_krupitsky_2004`	Logical	3 consecutive positive UDS
`Rd_kostenB_1993`	Logical	3 consecutive negative UDS
`Ab_schottenfeldB_2008`	Integer	Count-based abstinence measure
`Ab_mokri_2016`	Survival	Time-to-event abstinence outcome