This vignette presents two typical use cases of MCBoost with data from a health survey. The goal is to post-process two initial prediction models for multi-accuracy using different flavors of MCBoost, and to eventually compare the naive and post-processed predictors overall and for subpopulations. The first scenario starts with a neural net and, as an example, evaluates the initial and post-processed predictors with a focus on subgroup accuracy after running MCBoost. The second scenario uses a random forest and evaluates the initial and post-processed predictors with respect to subgroup calibration.
We use data derived from the National Health Interview Survey (NHIS 2003), which includes demographic and health-related variables for 21,588 individuals. This data can directly be included from the
We can obtain more information using:
In the following, our outcome of interest is whether an individual is covered by any type of health insurance (
notcov, 1 = not covered, 0 = covered). We additionally prepare two sets of variables:
The second set of variables will not be used for training the initial prediction models, but will be our focus when it comes to evaluating prediction performance for subgroups.
Before we training an initial model, we preprocess the data:
NAs in categorical features to a dedicated factor level
NAs in the outcome variable
notcovas a factor variable instead of a dummy variable (1 =
notcov, 0 =
inv_wtas the inverse of survey weights
categorical <- c("age.grp", "parents", "educ", "inc.grp", "doing.lw", "limited", "sex", "hisp", "race") nhis <- nhis.large %>% mutate_at(categorical, as.factor) %>% mutate_at(categorical, fct_explicit_na) %>% drop_na(notcov) %>% select(all_of(categorical), notcov, svywt, ID) nhis$notcov <- factor(ifelse(nhis$notcov == 1, "notcov", "cov")) nhis_enc <- data.frame(model.matrix(notcov ~ ., data = nhis)[,-1]) nhis_enc$notcov <- nhis$notcov nhis_enc$sex <- nhis$sex nhis_enc$hisp <- nhis$hisp nhis_enc$race <- nhis$race nhis_enc$inv_wt <- (1 / nhis$svywt)
The pre-processed NHIS data will be split into three datasets:
trainfor training the initial prediction models (55 % of data)
postfor post-processing the initial models with MCBoost (20 %)
testfor model evaluation (25 %)
To increase the difficulty of the prediction task, we sample from the NHIS data such that the prevalence of demographic subgroups in the test data differs from their prevalence in the training and auditing data. This is achieved by employing weighted sampling from NHIS (variable
inv_wt from above).
set.seed(2953) test <- nhis_enc %>% slice_sample(prop = 0.25, weight_by = inv_wt) nontest_g <- nhis_enc %>% anti_join(test, by = "ID") train_g <- nontest_g %>% slice_sample(prop = 0.75) post <- nontest_g %>% anti_join(train_g, by = "ID") %>% select(-ID, -svywt, -inv_wt, -c(sex:race)) train <- train_g %>% select(-ID, -svywt, -inv_wt, -c(sex:race), -c(sex2:race3))
As a result, non-hispanic white individuals (
hisp2) are overrepresented and hispanic individuals are underrepresented in both the training and auditing set, compared to their prevalence in the test set.
We train an initial model for predicting healthcare coverage with the training set. Here, we use a neural network with one hidden layer, rather naively with little tweaking.
We prepare a function that allows us to pass the predictions of the model to MCBoost for post-processing.
To showcase different use cases of MCBoost, we prepare two post-processing data sets based on the auditing set. The first set includes only the predictor variables that were used by the initial models, whereas the second set will allow post-processing based on our demographic subgroups of interest (sex, hispanic ethnicity, race).
We initialize two custom auditors for MCBoost: Ridge regression with a small penalty on model complexity, and a
SubpopAuditorFitter with a fixed set of subpopulations.
The ridge regression will only be given access to the initial predictor variables when post-processing the neural net predictions with the auditing data. In contrast, we guide the subpop-fitter to audit the initial predictions explicitly on the outlined subpopulations (sex, hispanic ethnicity, race). In summary, we have:
nnet: Initial neural net
nnet_mc_ridge: Neural net, post-processed with ridge regression and the initial set of predictor variables
nnet_mc_subpop: Neural net, post-processed with a fixed set of subpopulations
nnet_mc_ridge = MCBoost$new(init_predictor = init_nnet, auditor_fitter = ridge, multiplicative = TRUE, partition = TRUE, max_iter = 15) nnet_mc_ridge$multicalibrate(d1, l) nnet_mc_subpop = MCBoost$new(init_predictor = init_nnet, auditor_fitter = pops, partition = TRUE, max_iter = 15) nnet_mc_subpop$multicalibrate(d2, l)
Next, we use the initial and post-processed models to predict the outcome in the test data. We compute predicted probabilities and class predictions.
test$nnet <- predict(nnet, newdata = test)[, 2] test$nnet_mc_ridge <- nnet_mc_ridge$predict_probs(test) test$nnet_mc_subpop <- nnet_mc_subpop$predict_probs(test) test$c_nnet <- round(test$nnet) test$c_nnet_mc_ridge <- round(test$nnet_mc_ridge) test$c_nnet_mc_subpop <- round(test$nnet_mc_subpop) test$label <- 1 - one_hot(test$notcov)
Here we compare the overall accuracy of the initial and post-processed models. Overall, we observe little differences in performance.
However, we might be concerned with model performance for smaller subpopulations. In the following, we focus on subgroups defined by 2-way conjunctions of sex, hispanic ethnicity, and race.
test <- test %>% group_by(sex, hisp) %>% mutate(sex_hisp = cur_group_id()) %>% group_by(sex, race) %>% mutate(sex_race = cur_group_id()) %>% group_by(hisp, race) %>% mutate(hisp_race = cur_group_id()) %>% ungroup() grouping_vars <- c("sex", "hisp", "race", "sex_hisp", "sex_race", "hisp_race") eval <- map(grouping_vars, group_by_at, .tbl = test) %>% map(summarise, 'accuracy_nnet' = mean(c_nnet == label), 'accuracy_nnet_mc_ridge' = mean(c_nnet_mc_ridge == label), 'accuracy_nnet_mc_subpop' = mean(c_nnet_mc_subpop == label), 'size' = n()) %>% bind_rows()
We evaluate classification accuracy on these subpopulations, and order the results according to the size of the selected subgroups (
size). Subgroup accuracy varies between methods, with MCBoost-Ridge (
nnet_mc_ridge) and MCBoost-Subpop (
nnet_mc_subpop) stabilizing subgroup performance when compared to the initial model, respectively.
In this scenario, we use a random forest with the default settings of the ranger package as the initial predictor.
We again prepare a function to pass the predictions to MCBoost for post-processing.
We use two custom auditors for MCBoost, i.e., ridge and lasso regression with different penalties on model complexity.
The ridge regression will only be given access to the initial predictor variables when post-processing the random forest predictions. In contrast, we allow the lasso regression to audit the initial predictions both with the initial predictors and the subpopulations (sex, hispanic ethnicity, race). In summary, we have:
rf: Initial random forest
rf_mc_ridge: Random forest, post-processed with ridge regression and the initial set of predictor variables
rf_mc_lasso: Random forest, post-processed with lasso regression and the extended set of predictors
rf_mc_ridge = MCBoost$new(init_predictor = init_rf, auditor_fitter = ridge, multiplicative = TRUE, partition = TRUE, max_iter = 15) rf_mc_ridge$multicalibrate(d1, l) rf_mc_lasso = MCBoost$new(init_predictor = init_rf, auditor_fitter = lasso, multiplicative = TRUE, partition = TRUE, max_iter = 15) rf_mc_lasso$multicalibrate(d2, l)
We again compute predicted probabilities and class predictions using the initial and post-processed models.
Here we compare the overall accuracy of the initial and post-processed models. As before, we observe small differences in overall performance.
However, we might be concerned with calibration in subpopulations. In the following we focus on subgroups defined by 2-way conjunctions of sex, hispanic ethnicity, and race.
This evaluation focuses on the difference between the average predicted risk of healthcare non-coverage and the observed proportion of non-coverage in the test data for subgroups. Considering the MCBoost-Ridge (
rf_mc_ridge) and MCBoost-Lasso (
rf_mc_lasso) results, post-processing with MCBoost reduces bias for many subpopulations.