PSM & Cohort Construction

File: src/create_demodf_knn.py

Handles stratified held-out set creation and propensity score matching (PSM) to construct balanced training cohorts. PSM is performed using R's MatchIt package via rpy2.

`holdOutTestData`

holdOutTestData(df, id_column, testCount=100, columnToSplit='RaceEth',
                majorityValue=1, percentMajority=58, seed=42)

Stratifies and separates a held-out evaluation set from the main dataset before PSM runs. The held-out set is fixed for the duration of the experiment and is never used in training.

Parameters

Name	Type	Default	Description
`df`	`pd.DataFrame`	—	Full input dataset
`id_column`	`str`	—	Name of the unique patient ID column
`testCount`	`int`	`100`	Total number of held-out samples
`columnToSplit`	`str`	`"RaceEth"`	Column defining majority/minority groups
`majorityValue`	`int`	`1`	Value in `columnToSplit` representing the majority group
`percentMajority`	`int`	`58`	Percentage of held-out set drawn from the majority group
`seed`	`int`	`42`	Random seed for reproducibility

Returns: (train_df, test_df) — training pool and held-out set as DataFrames.

`propensityScoreMatch`

propensityScoreMatch(df, idColumn, columnToSplit='RaceEth', majorityValue=1,
                     columnsToMatch=['age', 'is_female'], sampleSize=500)

Runs PSM on the training pool to produce a matched set of minority and majority participants. Each minority participant is matched to 2 majority participants on columnsToMatch.

Parameters

Name	Type	Default	Description
`df`	`pd.DataFrame`	—	Training pool (after held-out removal)
`idColumn`	`str`	—	Unique patient ID column
`columnToSplit`	`str`	`"RaceEth"`	Column defining majority/minority groups
`majorityValue`	`int`	`1`	Value representing the majority group
`columnsToMatch`	`list`	`["age", "is_female"]`	Features to match on during PSM
`sampleSize`	`int`	`500`	Number of minority participants to match

Returns: List of 3 DataFrames — [minority_df, majority_df_1, majority_df_2] — one minority group and two matched majority groups.

R dependency

This function calls R's MatchIt package using rpy2. If MatchIt is not installed, the pipeline will prompt you to install it automatically on first run.

`create_subsets`

create_subsets(dfs, splits=11, sampleSize=500)

Combines the PSM-matched DataFrames into a series of training cohorts at varying majority/minority ratios. Produces splits cohorts spanning from 100% minority to 100% majority composition.

Parameters

Name	Type	Default	Description
`dfs`	`list[pd.DataFrame]`	—	Output of `propensityScoreMatch` — 3 DataFrames
`splits`	`int`	`11`	Number of cohorts to construct
`sampleSize`	`int`	`500`	Size of each cohort

Returns: List of splits DataFrames, one per demographic ratio.

Ratio progression (default 11 splits)

Subset	Minority %	Majority %
1	0%	100%
2	10%	90%
3	20%	80%
...	...	...
6	50%	50%
...	...	...
11	100%	0%