Using the Pipeline: A Step-by-Step Walkthrough =============================================== This page walks through a full end-to-end run of the CTN-0094 Pipeline — from setup to interpreting results — in a Colab notebook style. You can follow along locally or click the badge below to run it live in Google Colab. .. image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/drive/1Agj-EXE9WhLjMulA1mIVzW4YWNN7fiqX :alt: Open In Colab ---- Setup ----- **Cell 1 — Clone the repository** .. code-block:: python import os if not os.path.exists("/content/Pipeline"): !git clone https://github.com/CTN-0094/Pipeline.git os.chdir("/content/Pipeline") .. code-block:: none :class: cell-output Cloning into 'Pipeline'... remote: Enumerating objects: 412, done. remote: Counting objects: 100% (412/412), done. Resolving deltas: 100% (231/231), done. ---- **Cell 2 — Install dependencies** This installs all Python packages required by the pipeline. .. code-block:: python !pip install -r requirements.txt -q .. code-block:: none :class: cell-output Successfully installed lifelines-0.29.0 psmpy-0.3.13 statsmodels-0.14.1 ... ---- **Cell 3 — Verify the install** .. code-block:: python !python -c "from src.train_model import LogisticModel; print('Import OK')" .. code-block:: none :class: cell-output Import OK ---- Running the Pipeline -------------------- **Cell 4 — Run with a single outcome** The minimum required argument is ``--data``, pointing to your cleaned CSV file, and ``--outcome`` specifying which endpoint to model. .. code-block:: python !python run_pipelineV2.py \ --data /content/my_data.csv \ --outcome ctn0094_relapse_event \ -d /content/results .. code-block:: none :class: cell-output Processing subset 1 ____________________________________________________________ Lasso feature selection completed. Selected 8 out of 24 features. Features are: ['age', 'is_female', 'RaceEth', 'UDS_Opioid_Count', ...] Processing subset 2 ____________________________________________________________ Lasso feature selection completed. Selected 6 out of 24 features. ... ________________________________________________________________________ Elapsed time: 12.43 seconds ---- **Cell 5 — Run across multiple seeds** Use ``-l `` to loop through a range of random seeds. Each seed produces a different PSM cohort, letting you measure stability across sampling variation. .. code-block:: python !python run_pipelineV2.py \ --data /content/my_data.csv \ --outcome ctn0094_relapse_event \ -l 1 10 \ -d /content/results .. note:: This will run the full pipeline 9 times (seeds 1 through 9) and save a separate results folder for each seed under ``/content/results``. ---- **Cell 6 — Run multiple outcomes at once** Combine ``-l`` and ``-o`` to sweep over several outcomes and several seeds in one call. .. code-block:: python !python run_pipelineV2.py \ --data /content/my_data.csv \ -l 0 5 \ -o ctn0094_relapse_event Ab_ling_1998 Rs_johnson_1992 \ -d /content/results .. code-block:: none :class: cell-output Running outcome: ctn0094_relapse_event | seed 0 Running outcome: ctn0094_relapse_event | seed 1 ... Running outcome: Rs_johnson_1992 | seed 4 ---- **Cell 7 — Preprocess and match only (no model training)** Use ``--data_only`` to stop after propensity score matching and save the ML-ready datasets without training any models. Useful for inspecting cohort balance before committing to a full run. .. code-block:: python !python run_pipelineV2.py \ --data /content/my_data.csv \ --outcome ctn0094_relapse_event \ --data_only \ -d /content/psm_output .. code-block:: none :class: cell-output Preprocessing complete. PSM subsets saved to /content/psm_output/ ---- Customising the Cohort ---------------------- **Cell 8 — Change cohort size and held-out composition** The pipeline defaults to groups of 500 matched participants and a held-out set of 100 (58% majority). You can override both. .. code-block:: python !python run_pipelineV2.py \ --data /content/my_data.csv \ --outcome ctn0094_relapse_event \ --group_size 300 \ --heldout_size 80 \ --heldout_set_percent_majority 50 \ -d /content/results ---- **Cell 9 — Change the matching column or matching covariates** By default the pipeline splits on ``RaceEth`` and matches on ``age`` and ``is_female``. Change these with ``--split`` and ``--match``. .. code-block:: python !python run_pipelineV2.py \ --data /content/my_data.csv \ --outcome ctn0094_relapse_event \ --split RaceEth \ --match "age is_female UDS_Opioid_Count" \ -d /content/results ---- Using a Custom Outcome ---------------------- **Cell 10 — Bring your own outcome column** If your dataset contains a column not in the built-in outcome list, pass its name with ``-o`` and specify the endpoint type with ``--type``. .. list-table:: :header-rows: 1 :widths: 20 80 * - ``--type`` - When to use * - ``logical`` - Binary outcome (0/1) * - ``integer`` - Count outcome (0, 1, 2, …) * - ``survival`` - Time-to-event; expects two columns: ``_time`` and ``_event`` .. code-block:: python # Binary custom outcome !python run_pipelineV2.py \ --data /content/my_data.csv \ --outcome my_custom_outcome \ --type logical \ -d /content/results .. code-block:: python # Survival custom outcome # Expects columns: my_event_time and my_event_event !python run_pipelineV2.py \ --data /content/my_data.csv \ --outcome my_event \ --type survival \ -d /content/results ---- Viewing the Results ------------------- **Cell 11 — List output files** .. code-block:: python import os for root, dirs, files in os.walk("/content/results"): level = root.replace("/content/results", "").count(os.sep) indent = " " * level print(f"{indent}{os.path.basename(root)}/") for f in files: print(f" {indent}{f}") .. code-block:: none :class: cell-output results/ ctn0094_relapse_event_seed0/ evaluations/ evaluations_2025-01-15.csv predictions/ predictions_2025-01-15.csv logs/ pipeline_2025-01-15.log ---- **Cell 12 — Load and inspect evaluations** .. code-block:: python import pandas as pd, glob files = glob.glob("/content/results/**/evaluations/*.csv", recursive=True) df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True) df.head() .. code-block:: none :class: cell-output ╔══════════════╦══════════╦══════════╦══════════╦═══════════════════════╗ ║ Data Type ║ roc ║ precision║ recall ║ training_demographics║ ╠══════════════╬══════════╬══════════╬══════════╬═══════════════════════╣ ║ heldout ║ 0.71 ║ 0.68 ║ 0.74 ║ 250 White, 250 Black ║ ║ subset ║ 0.69 ║ 0.65 ║ 0.71 ║ 250 White, 250 Black ║ ╚══════════════╩══════════╩══════════╩══════════╩═══════════════════════╝ ---- **Cell 13 — Plot ROC-AUC across seeds** .. code-block:: python import matplotlib.pyplot as plt heldout = df[df.index % 2 == 0]["roc"].values # heldout rows subset = df[df.index % 2 == 1]["roc"].values # subset rows plt.figure(figsize=(8, 4)) plt.plot(heldout, marker="o", label="Held-out") plt.plot(subset, marker="s", label="Subset") plt.axhline(0.5, linestyle="--", color="gray", label="Chance") plt.xlabel("Seed") plt.ylabel("ROC-AUC") plt.title("Model performance across PSM seeds") plt.legend() plt.tight_layout() plt.show() ---- Interpreting Results -------------------- The pipeline is designed to detect **algorithmic bias** — a disparity in model performance when trained on cohorts with different demographic compositions. .. list-table:: :header-rows: 1 :widths: 10 90 * - Result - Interpretation * - ROC-AUC stable across seeds - The outcome is **measurement invariant** — model performance does not depend on cohort demographics. * - ROC-AUC varies across seeds - The outcome is **measurement variant** — model performance is sensitive to demographic composition, which may indicate bias. * - Held-out AUC >> Subset AUC - Possible over-fitting to the matched cohort; consider increasing ``--group_size`` or reducing features. * - No features selected (Lasso error) - Regularisation may be too strong; the alpha defaults are tuned for the CTN-0094 dataset and may need adjustment for other data. ---- Available Outcomes Reference ----------------------------- .. list-table:: :header-rows: 1 :widths: 40 20 40 * - Outcome - Type - Description * - ``ctn0094_relapse_event`` - Logical - Any positive urine drug screen * - ``Ab_krupitskyA_2011`` - Logical - Confirmed abstinence (weeks 5–24) * - ``Ab_ling_1998`` - Logical - 13 consecutive negative UDS * - ``Rs_johnson_1992`` - Logical - 2 consecutive positive UDS after 4-week treatment * - ``Rs_krupitsky_2004`` - Logical - 3 consecutive positive UDS * - ``Rd_kostenB_1993`` - Logical - 3 consecutive negative UDS * - ``Ab_schottenfeldB_2008`` - Integer - Count-based abstinence measure * - ``Ab_mokri_2016`` - Survival - Time-to-event abstinence outcome