Data Overview
This project relies on several key datasets derived from the CTN-0094 database and various transformation stages in the pipeline. Below is a description of each dataset used throughout the pipeline, including purpose and example columns.
Master Dataset
Filename: master_data.csv
Purpose: The foundational dataset containing all raw independent variables and features extracted from CTN-0094.
Example Columns: - age, race, education, pain, depression, ftnd, TLFB_Alcohol_Count
Merged Data
Filename: merged_data.csv
Purpose: A version of the master dataset where the ctn0094_relapse_event binary outcome is appended.
Notes: Used for binary classification models on relapse.
Example Columns: - All columns from master_data.csv + ctn0094_relapse_event
Outcomes Merged Dataset
Filename: outcomes_merged_dataset.csv
Purpose: Contains both predictors and a wide variety of outcome variables used across various modeling tasks.
Example Outcome Columns: - Ab_krupitskyA_2011, Ab_ling_1998, Rs_johnson_1992, Rd_kostenB_1993, AbT_mokri_2016
CTN-0094 Outcomes
Filename: outcomesCTN0094.csv
Purpose: A standalone file that contains all known outcome measures from the CTN-0094 dataset.
Types of Outcomes: Abstinence (Ab), Retention (Rs), Dropout Rate (Rd), among others.
Example Columns: - Ab_ctnNinetyFour_2023, AbT_shufman_1994, Rd_strang_2019, RsE_ctnFiftyOne_2018
All Outcome Selections
Filename: all_outcome_selections.csv
Purpose: Subset of outcomes selected for model testing, spanning multiple outcome categories.
Example Columns: - Ab_ctnNinetyFour_2023, Rd_kostenB_1993, RsT_lee_2016
Master Outcome Selections
Filename: master_outcome_selections.csv
Purpose: Combines all outcome columns with the cleaned features used in modeling.
Example Columns: - Includes demographic and drug usage variables + outcome variables like RsT_ctnFiftyOne_2018
Binary Outcome Selections
Filename: binary_outcome_selections.csv
Purpose: Outcomes that are binary (e.g., yes/no, true/false).
Example Columns: - Ab_krupitskyA_2011, Rd_kostenB_1993, Rs_johnson_1992
All Binary Selected Outcomes
Filename: all_binary_selected_outcomes.csv
Purpose: Subset of the merged dataset that includes only binary outcomes.
Example Columns: - ctn0094_relapse_event, Ab_ling_1998, Rs_krupitsky_2004
Other Outcome Selections
Filename: other_outcome_selections.csv
Purpose: Collection of outcome variables that are count-based or numeric (e.g., session count, time until dropout).
Example Columns: - AbT_mokri_2016, RsT_ctnFiftyOne_2018
Dataset Usage Notes
All datasets use who as a unique identifier for participants.
Some datasets may have NaN values, especially in derived outcomes or less common drug use metrics.
The pipeline joins or filters these datasets at various stages depending on the outcome type and modeling goal.