Getting Started

Prerequisites

Python 3.12+
R (required for Propensity Score Matching via MatchIt)
Git

Installation

LocalGoogle Colab

git clone https://github.com/CTN-0094/Pipeline.git
cd Pipeline
pip install -r requirements.txt

import os
if not os.path.exists("/content/Pipeline"):
    !git clone https://github.com/CTN-0094/Pipeline.git
    os.chdir("/content/Pipeline")

!pip install -r requirements.txt

Running the Pipeline

Basic Usage

python3 run_pipelineV2.py --data <path/to/data.csv> --outcome <outcome_name> -d ./results

Examples

Single outcomeMulti-seed loopSubset of outcomesData only (no training)

python3 run_pipelineV2.py \
  --data data.csv \
  --outcome ctn0094_relapse_event \
  -d ./results

# Loop through seeds 5–10 across all outcomes
python3 run_pipelineV2.py \
  --data data.csv \
  -l 5 10 \
  -d ./results

python3 run_pipelineV2.py \
  --data data.csv \
  -l 5 10 \
  -o ctn0094_relapse_event Ab_ling_1998 \
  -d ./results

# Run preprocessing + PSM only, skip model training
python3 run_pipelineV2.py \
  --data data.csv \
  --outcome ctn0094_relapse_event \
  --data_only \
  -d ./results

CLI Reference

Argument	Default	Description
`--data`	(required)	Path to cleaned input dataset (CSV)
`-o / --outcome`	all	Outcome(s) to run
`-l / --loop`	—	Min and max seed for multi-seed runs
`-d / --dir`	`""`	Output directory for results
`--type`	—	Endpoint type for custom outcomes: `logical`, `integer`, `survival`
`--majority`	`1`	Value representing the majority group in the PSM column
`--split`	`RaceEth`	Column used to define majority/minority groups
`--match`	`age is_female`	Columns to match on during PSM
`--group_size`	`500`	Size of each PSM group
`--heldout_size`	`100`	Size of the held-out evaluation set
`--heldout_set_percent_majority`	`58`	Percent majority in the held-out set
`--data_only`	`False`	Skip model training, save preprocessed data only
`-p / --prof`	—	Profiling mode: `simple` or `complex`

Viewing Results

After a run, your output directory contains:

results/
├── logs/                     # Execution logs per run
├── heldout_predictions/      # Held-out set predictions (CSV)
├── subset_predictions/       # Subset predictions by ratio (CSV)
├── heldout_evaluations/      # Metrics on the held-out set (CSV)
├── subset_evaluations/       # Metrics per demographic ratio (CSV)
└── experiments/              # Experiment metadata (JSON)

Reading the evaluation CSVs

Each row in an evaluation CSV corresponds to a different majority/minority demographic ratio. The rightmost columns are the evaluation metrics. Compare rows to see how model performance changes as the training cohort composition shifts.

R Package Installation

When you run PSM for the first time, you will be prompted to install R packages:

Would you like to install MatchIt?
1: Yes
2: No

Selection: 1

Select 1 for both MatchIt and optmatch. This only happens once.