Input Format
The pipeline accepts a single cleaned CSV passed via --data. This page documents every column the preprocessing and PSM stages expect.
Minimum Required Columns
These columns must always be present regardless of outcome type.
| Column | Type | Description |
|---|---|---|
who |
integer | Unique patient identifier — hardcoded as the ID column throughout the pipeline |
RaceEth |
string | Demographic group label used for PSM splitting — see RaceEth values below |
age |
integer | Patient age in years — used as a PSM matching covariate by default |
Sex |
string | "male" or "female" — renamed to is_female (0/1) during preprocessing |
Outcome Columns
At least one outcome column is required. Its type must match the --type or the built-in outcome config.
| Column | Type | Valid values |
|---|---|---|
<outcome_name> |
integer | 0 or 1 only |
| Column | Type | Valid values |
|---|---|---|
<outcome_name> |
integer | Non-negative whole numbers |
Two columns are required. By convention the pipeline expects columnsToUse = [time_col, event_col].
| Column | Type | Valid values |
|---|---|---|
<outcome_name>_time |
float | Positive duration (days, weeks, etc.) |
<outcome_name>_event |
integer | 0 (censored) or 1 (event observed) |
Demographic & Socioeconomic Columns
Used in feature engineering and model training. Missing columns are silently skipped by the preprocessor.
| Column | Type | Valid values | Notes |
|---|---|---|---|
education |
string | "Less than HS", "HS/GED", "More than HS" |
Encoded 1–3; NaN → 0 |
marital |
string | "Never married", "Married or Partnered", "Separated/Divorced/Widowed" |
Encoded 2–4; NaN → 1 |
job |
string | "Full Time", "Part Time", or any other value |
Anything other than Full/Part Time → unemployed = 1 |
is_living_stable |
integer | 0 or 1 |
Inverted to unstableliving (1 = unstable) |
pain |
string | "No Pain", "Severe Pain", "Very mild to Moderate Pain", "Missing" |
Encoded to binary (0 = no pain) |
race |
string | "White", "Black", "Other", "Refused/missing" |
Encoded 1–3; dropped after encoding |
Treatment Column
| Column | Type | Valid values |
|---|---|---|
XTRT |
string | "CTN30BUP", "CTN51BUP", "CTN51NTX", "CTN27BUP", "CTN27MET" |
Encoded as integers 1–5. Rows with unrecognised values are set to -1.
TLFB Columns
Timeline Follow-Back (TLFB) columns record the number of days each substance was used in a recall period.
The nine columns below are kept individually. Any other TLFB_* column present in the data is summed into a single TLFB_Other feature.
| Column | Substance |
|---|---|
TLFB_Alcohol_Count |
Alcohol |
TLFB_Amphetamine_Count |
Amphetamine |
TLFB_Cocaine_Count |
Cocaine |
TLFB_Heroin_Count |
Heroin |
TLFB_Benzodiazepine_Count |
Benzodiazepine |
TLFB_Opioid_Count |
Opioid (non-heroin) |
TLFB_THC_Count |
Cannabis |
TLFB_Methadone_Count |
Methadone |
TLFB_Buprenorphine_Count |
Buprenorphine |
UDS Columns
Urine Drug Screening (UDS) count columns are binarised during preprocessing (any count > 0 → 1). Any column whose name starts with UDS_ is processed this way.
Note
UDS_Alcohol_Count and UDS_Mdma/Hallucinogen_Count are dropped before binarisation if present.
Behavioural Columns
These two columns are used to compute derived binary features and are then dropped.
| Column | Type | Used to compute |
|---|---|---|
msm_npt |
float | Homosexual_Behavior — 1 if msm_npt > 0 and Sex == "male" |
txx_prt |
float | Non_monogamous_Relationships — 1 if txx_prt > 1 |
Other Columns
| Column | Type | Notes |
|---|---|---|
heroin_inject_days |
float | Converted to binary rbsivheroin (1 if non-null, 0 if null) |
ftnd |
float | Fagerström score — NaN filled with 0, used as-is |
is_hispanic |
any | Dropped unconditionally before PSM |
Columns Dropped During Preprocessing
The following columns are removed if present. The pipeline will not raise if they are absent.
pain_when, is_smoker, per_day, max, amount, depression, anxiety, schizophrenia,
cocaine_inject_days, speedball_inject_days, opioid_inject_days, speed_inject_days,
UDS_Alcohol_Count, UDS_Mdma/Hallucinogen_Count, msm_frq, msm_npt, txx_prt,
rbs_iv_days, race, RBS_cocaine_Days, RBS_heroin_Days, RBS_opioid_Days,
RBS_speed_Days, RBS_speedball_Days
RaceEth Values
RaceEth is the column used for PSM splitting. It must contain string labels that map to the following numeric codes during preprocessing:
| Label | Code | Group |
|---|---|---|
"NHW" |
1 | Non-Hispanic White — default majority group |
"NHB" |
2 | Non-Hispanic Black |
"Hisp" |
3 | Hispanic |
"Other" |
4 | Other |
"Refused/Missing" |
0 | Refused or missing |
The default --majority 1 targets NHW as the majority group. To use a different majority, pass --majority <code> with the numeric code from the table above.
Using Custom Data
The pipeline is not restricted to CTN-0094 data. To run it on your own dataset:
- Ensure your CSV has at minimum
who,RaceEth,age, andSexcolumns. - Add at least one outcome column.
- Pass
--typeif your outcome is not one of the pre-defined outcomes.
python3 run_pipelineV2.py \
--data my_data.csv \
--outcome my_outcome \
--type logical \
--split RaceEth \
--match "age is_female" \
-d ./results
Minimal example
For a working minimal CSV, see the smoke test data generator used during development — it constructs a 500-row synthetic dataset with all required columns.