Data Overview

This project relies on several key datasets derived from the CTN-0094 database and various transformation stages in the pipeline. Below is a description of each dataset used throughout the pipeline, including purpose and example columns.

Master Dataset

Filename: master_data.csv

  • Purpose: The foundational dataset containing all raw independent variables and features extracted from CTN-0094.

  • Example Columns: - age, race, education, pain, depression, ftnd, TLFB_Alcohol_Count

Merged Data

Filename: merged_data.csv

  • Purpose: A version of the master dataset where the ctn0094_relapse_event binary outcome is appended.

  • Notes: Used for binary classification models on relapse.

  • Example Columns: - All columns from master_data.csv + ctn0094_relapse_event

Outcomes Merged Dataset

Filename: outcomes_merged_dataset.csv

  • Purpose: Contains both predictors and a wide variety of outcome variables used across various modeling tasks.

  • Example Outcome Columns: - Ab_krupitskyA_2011, Ab_ling_1998, Rs_johnson_1992, Rd_kostenB_1993, AbT_mokri_2016

CTN-0094 Outcomes

Filename: outcomesCTN0094.csv

  • Purpose: A standalone file that contains all known outcome measures from the CTN-0094 dataset.

  • Types of Outcomes: Abstinence (Ab), Retention (Rs), Dropout Rate (Rd), among others.

  • Example Columns: - Ab_ctnNinetyFour_2023, AbT_shufman_1994, Rd_strang_2019, RsE_ctnFiftyOne_2018

All Outcome Selections

Filename: all_outcome_selections.csv

  • Purpose: Subset of outcomes selected for model testing, spanning multiple outcome categories.

  • Example Columns: - Ab_ctnNinetyFour_2023, Rd_kostenB_1993, RsT_lee_2016

Master Outcome Selections

Filename: master_outcome_selections.csv

  • Purpose: Combines all outcome columns with the cleaned features used in modeling.

  • Example Columns: - Includes demographic and drug usage variables + outcome variables like RsT_ctnFiftyOne_2018

Binary Outcome Selections

Filename: binary_outcome_selections.csv

  • Purpose: Outcomes that are binary (e.g., yes/no, true/false).

  • Example Columns: - Ab_krupitskyA_2011, Rd_kostenB_1993, Rs_johnson_1992

All Binary Selected Outcomes

Filename: all_binary_selected_outcomes.csv

  • Purpose: Subset of the merged dataset that includes only binary outcomes.

  • Example Columns: - ctn0094_relapse_event, Ab_ling_1998, Rs_krupitsky_2004

Other Outcome Selections

Filename: other_outcome_selections.csv

  • Purpose: Collection of outcome variables that are count-based or numeric (e.g., session count, time until dropout).

  • Example Columns: - AbT_mokri_2016, RsT_ctnFiftyOne_2018

Dataset Usage Notes

  • All datasets use who as a unique identifier for participants.

  • Some datasets may have NaN values, especially in derived outcomes or less common drug use metrics.

  • The pipeline joins or filters these datasets at various stages depending on the outcome type and modeling goal.