
Perform repeated stratified k-fold cross-validation for model training and tuning
Source:R/machine_learning.R
compute_k_fold_CV.Rd
This function performs repeated stratified k-fold cross-validation on a dataset to train and tune hyperparameters for 13 machine learning methods. Optionally, it can also perform model stacking and Boruta-based feature selection. Performance is evaluated using user-specified metrics such as Accuracy, AUROC, or AUPRC.
Usage
compute_k_fold_CV(
model,
k_folds,
n_rep,
stacking = FALSE,
metric = "Accuracy",
boruta,
boruta_iterations = NULL,
fix_boruta = NULL,
tentative = FALSE,
boruta_threshold = NULL,
file_name = NULL,
LODO = FALSE,
ncores = NULL,
return = FALSE,
fold_construction_fun = NULL,
fold_construction_args = list()
)
Arguments
- model
A data frame containing features and a target column named 'target' corresponding to the response variable to predict.
- k_folds
Integer. Number of folds for k-fold cross-validation. Default is 5.
- n_rep
Integer. Number of repetitions of the k-fold cross-validation. Default is 100.
- stacking
Logical. Whether to perform model stacking. Default is FALSE.
- metric
Character. Metric used for hyperparameter tuning and model evaluation. Supported values are "Accuracy", "AUROC", and "AUPRC".
- boruta
Logical. Whether to apply Boruta for feature selection before model training. Note that many ML models handle feature importance internally, so prior selection is optional unless multicollinearity is a concern. Default is FALSE.
- boruta_iterations
Integer. Number of iterations to run Boruta. Since Boruta involves randomness, repeated runs improve consistency. Default is 100.
- fix_boruta
Logical. Whether to fix Boruta’s internal parameters. See
compute_boruta()
for details.- tentative
Logical. Whether to include tentative features as confirmed in the training dataset.
- boruta_threshold
Numeric. Threshold for confirming features after multiple Boruta iterations. For example, 0.8 means features must be confirmed in at least 80% of iterations. Default is 0.8.
- file_name
Character. File name used for saving output plots in the
Results/
directory.- LODO
Logical. If TRUE, performs Leave-One-Dataset-Out (LODO) cross-validation by stratifying folds based on cohort membership.
- ncores
Integer. Number of cores to use for parallelization. If not given, detectCores() - 1 will be used.
- return
Logical. Whether to return the results and generated plots.
- fold_construction_fun
Function. A custom function used to construct the cross-validation folds. It should return a list of training indices for each fold.
- fold_construction_args
List. Named list of additional arguments to pass to
fold_construction_fun
.
Value
A list containing:
Features used during training
The selected machine learning model
All trained machine learning models
If stacking = TRUE
, the list will also include:
Base models
Meta-learner
Matrix of weighted feature importance (see
calculate_feature_importance_stacking()
)