I’m trying to calculate 95% confidence intervals for the sensitivity and specificity of a decision model that I’m building.
I’ve split my dataset into 90/10 train and test sets. I’ve used the 90% train set to perform hyperparameter turning, and then used the optimal decision model selected from within the 90% train dataset to evaluate the 10% holdout dataset, which is fully independent, and not used in the hyperparameter tuning process.
My problem is, what's the best approach to obtain 95% confidence intervals for the training dataset? Should I bootstrap multiple subsets of the test data against the optimal model identified using hyperparameter tuning, and use those for the calculation? In example uses of bootstrapping that I found, different subsets of the train and test dataset are used for bootsrtapping. However, I don’t want to do that because I want my testing to be across a truly holdout (aka validation) dataset.