Skip to content

How It Works

The DAB pipeline runs five sequential steps for each outcome and seed combination.


Pipeline Overview

Input CSV
┌─────────────────────────────────────┐
│  Step 1 · Validation & Preprocessing│
│                                     │
│  • Schema and column type checks    │
│  • Binary encoding                  │
│  • TLFB feature engineering         │
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│  Step 2 · Propensity Score Matching │
│                                     │
│  • Majority/minority group split     │
│  • Balanced cohort construction     │
│  • Stratified held-out set          │
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│  Step 3 · Feature Selection         │
│                                     │
│  • L1 (Lasso) regularization        │
│  • Removes zero-coefficient features│
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│  Step 4 · Model Training            │
│                                     │
│  • Auto-selected by endpoint type   │
│  • Trained on each PSM subset       │
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│  Step 5 · Evaluation                │
│                                     │
│  • Subset internal test split       │
│  • Held-out set evaluation          │
│  • Demographic breakdown logged     │
└──────────────────┬──────────────────┘
        Repeat across seeds & outcomes
            Results Directory

Step 1 — Validation & Preprocessing

The pipeline validates the input CSV against expected schemas before any modeling occurs. It checks:

  • Required columns are present
  • Target column values match the endpoint type (binary 0/1, non-negative integer, positive duration)
  • No duplicate patient IDs

Feature engineering includes binary encoding of categorical variables and construction of Treatment Lifecycle Feedback (TLFB) features from weekly urine drug screening data.


Step 2 — Propensity Score Matching (PSM)

PSM constructs a series of training cohorts with varying majority/minority demographic ratios. The pipeline uses R's MatchIt package (via rpy2) to perform optimal matching on age and sex.

11 cohorts are constructed, ranging from 100% majority to 100% minority composition, in 10% increments. This allows the evaluation step to measure how model performance shifts as demographic composition changes.

A stratified held-out evaluation set is constructed separately with a fixed majority/minority ratio (default: 58/42) to reflect the real-world distribution of the dataset.


Step 3 — Feature Selection

L1 (Lasso) regularization is applied to automatically select predictive features before training. Features with zero coefficients after regularization are dropped. This reduces dimensionality and prevents overfitting on small cohorts.

Note

If Lasso drops all features (over-regularization), the pipeline raises an error rather than silently training on no signal.


Step 4 — Model Training

The model class is automatically selected based on the endpoint type of the chosen outcome:

Endpoint Type Model Library
logical — binary Logistic Regression (L1) scikit-learn
integer — count Negative Binomial Regression statsmodels
survival — time-to-event Cox Proportional Hazard lifelines

Each model is trained independently on each PSM cohort.


Step 5 — Evaluation

Each trained model is evaluated on two sets:

  • Internal test split — 25% of the PSM cohort held out during training
  • Held-out set — the fixed held-out set constructed in Step 2

Both evaluations record the demographic makeup of the training cohort alongside the metrics.

Metrics by Endpoint Type

Metric Description
ROC-AUC Area under the receiver operating characteristic curve
Precision True positives / (true positives + false positives)
Recall True positives / (true positives + false negatives)
Confusion Matrix Full 2×2 breakdown
Metric Description
MSE Mean squared error
RMSE Root mean squared error
MAE Mean absolute error
Pearson r Linear correlation between predicted and actual
McFadden R² Goodness-of-fit relative to null model
Metric Description
Concordance Index C-statistic: probability model ranks a random pair correctly

Interpreting Results

The key question is: does model performance change as the demographic composition of the training cohort changes?

  • No change across ratios → the outcome measure is measurement invariant — the model generalizes across groups equally
  • Performance drops as minority proportion increases → the outcome is measurement variant — the model has learned patterns specific to the majority group

This framing is based on the measurement invariance framework from Odom et al. (2025).