Using the Pipeline: A Step-by-Step Walkthrough
This page walks through a full end-to-end run of the CTN-0094 Pipeline — from setup to interpreting results — in a Colab notebook style. You can follow along locally or click the badge below to run it live in Google Colab.
Setup
Cell 1 — Clone the repository
import os
if not os.path.exists("/content/Pipeline"):
!git clone https://github.com/CTN-0094/Pipeline.git
os.chdir("/content/Pipeline")
Cloning into 'Pipeline'...
remote: Enumerating objects: 412, done.
remote: Counting objects: 100% (412/412), done.
Resolving deltas: 100% (231/231), done.
Cell 2 — Install dependencies
This installs all Python packages required by the pipeline.
!pip install -r requirements.txt -q
Successfully installed lifelines-0.29.0 psmpy-0.3.13 statsmodels-0.14.1 ...
Cell 3 — Verify the install
!python -c "from src.train_model import LogisticModel; print('Import OK')"
Import OK
Running the Pipeline
Cell 4 — Run with a single outcome
The minimum required argument is --data, pointing to your cleaned CSV file, and --outcome specifying which endpoint to model.
!python run_pipelineV2.py \
--data /content/my_data.csv \
--outcome ctn0094_relapse_event \
-d /content/results
Processing subset 1 ____________________________________________________________
Lasso feature selection completed. Selected 8 out of 24 features.
Features are: ['age', 'is_female', 'RaceEth', 'UDS_Opioid_Count', ...]
Processing subset 2 ____________________________________________________________
Lasso feature selection completed. Selected 6 out of 24 features.
...
________________________________________________________________________
Elapsed time: 12.43 seconds
Cell 5 — Run across multiple seeds
Use -l <min> <max> to loop through a range of random seeds. Each seed produces a different PSM cohort, letting you measure stability across sampling variation.
!python run_pipelineV2.py \
--data /content/my_data.csv \
--outcome ctn0094_relapse_event \
-l 1 10 \
-d /content/results
Note
This will run the full pipeline 9 times (seeds 1 through 9) and save a separate results folder for each seed under /content/results.
Cell 6 — Run multiple outcomes at once
Combine -l and -o to sweep over several outcomes and several seeds in one call.
!python run_pipelineV2.py \
--data /content/my_data.csv \
-l 0 5 \
-o ctn0094_relapse_event Ab_ling_1998 Rs_johnson_1992 \
-d /content/results
Running outcome: ctn0094_relapse_event | seed 0
Running outcome: ctn0094_relapse_event | seed 1
...
Running outcome: Rs_johnson_1992 | seed 4
Cell 7 — Preprocess and match only (no model training)
Use --data_only to stop after propensity score matching and save the ML-ready datasets without training any models. Useful for inspecting cohort balance before committing to a full run.
!python run_pipelineV2.py \
--data /content/my_data.csv \
--outcome ctn0094_relapse_event \
--data_only \
-d /content/psm_output
Preprocessing complete. PSM subsets saved to /content/psm_output/
Customising the Cohort
Cell 8 — Change cohort size and held-out composition
The pipeline defaults to groups of 500 matched participants and a held-out set of 100 (58% majority). You can override both.
!python run_pipelineV2.py \
--data /content/my_data.csv \
--outcome ctn0094_relapse_event \
--group_size 300 \
--heldout_size 80 \
--heldout_set_percent_majority 50 \
-d /content/results
Cell 9 — Change the matching column or matching covariates
By default the pipeline splits on RaceEth and matches on age and is_female. Change these with --split and --match.
!python run_pipelineV2.py \
--data /content/my_data.csv \
--outcome ctn0094_relapse_event \
--split RaceEth \
--match "age is_female UDS_Opioid_Count" \
-d /content/results
Using a Custom Outcome
Cell 10 — Bring your own outcome column
If your dataset contains a column not in the built-in outcome list, pass its name with -o and specify the endpoint type with --type.
|
When to use |
|---|---|
|
Binary outcome (0/1) |
|
Count outcome (0, 1, 2, …) |
|
Time-to-event; expects two columns: |
# Binary custom outcome
!python run_pipelineV2.py \
--data /content/my_data.csv \
--outcome my_custom_outcome \
--type logical \
-d /content/results
# Survival custom outcome
# Expects columns: my_event_time and my_event_event
!python run_pipelineV2.py \
--data /content/my_data.csv \
--outcome my_event \
--type survival \
-d /content/results
Viewing the Results
Cell 11 — List output files
import os
for root, dirs, files in os.walk("/content/results"):
level = root.replace("/content/results", "").count(os.sep)
indent = " " * level
print(f"{indent}{os.path.basename(root)}/")
for f in files:
print(f" {indent}{f}")
results/
ctn0094_relapse_event_seed0/
evaluations/
evaluations_2025-01-15.csv
predictions/
predictions_2025-01-15.csv
logs/
pipeline_2025-01-15.log
Cell 12 — Load and inspect evaluations
import pandas as pd, glob
files = glob.glob("/content/results/**/evaluations/*.csv", recursive=True)
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
df.head()
╔══════════════╦══════════╦══════════╦══════════╦═══════════════════════╗
║ Data Type ║ roc ║ precision║ recall ║ training_demographics║
╠══════════════╬══════════╬══════════╬══════════╬═══════════════════════╣
║ heldout ║ 0.71 ║ 0.68 ║ 0.74 ║ 250 White, 250 Black ║
║ subset ║ 0.69 ║ 0.65 ║ 0.71 ║ 250 White, 250 Black ║
╚══════════════╩══════════╩══════════╩══════════╩═══════════════════════╝
Cell 13 — Plot ROC-AUC across seeds
import matplotlib.pyplot as plt
heldout = df[df.index % 2 == 0]["roc"].values # heldout rows
subset = df[df.index % 2 == 1]["roc"].values # subset rows
plt.figure(figsize=(8, 4))
plt.plot(heldout, marker="o", label="Held-out")
plt.plot(subset, marker="s", label="Subset")
plt.axhline(0.5, linestyle="--", color="gray", label="Chance")
plt.xlabel("Seed")
plt.ylabel("ROC-AUC")
plt.title("Model performance across PSM seeds")
plt.legend()
plt.tight_layout()
plt.show()
Interpreting Results
The pipeline is designed to detect algorithmic bias — a disparity in model performance when trained on cohorts with different demographic compositions.
Result |
Interpretation |
|---|---|
ROC-AUC stable across seeds |
The outcome is measurement invariant — model performance does not depend on cohort demographics. |
ROC-AUC varies across seeds |
The outcome is measurement variant — model performance is sensitive to demographic composition, which may indicate bias. |
Held-out AUC >> Subset AUC |
Possible over-fitting to the matched cohort; consider increasing |
No features selected (Lasso error) |
Regularisation may be too strong; the alpha defaults are tuned for the CTN-0094 dataset and may need adjustment for other data. |
Available Outcomes Reference
Outcome |
Type |
Description |
|---|---|---|
|
Logical |
Any positive urine drug screen |
|
Logical |
Confirmed abstinence (weeks 5–24) |
|
Logical |
13 consecutive negative UDS |
|
Logical |
2 consecutive positive UDS after 4-week treatment |
|
Logical |
3 consecutive positive UDS |
|
Logical |
3 consecutive negative UDS |
|
Integer |
Count-based abstinence measure |
|
Survival |
Time-to-event abstinence outcome |