Perform repeated stratified k-fold cross-validation for model training and tuning

This function performs repeated stratified k-fold cross-validation on a dataset to train and tune hyperparameters for 13 machine learning methods. Optionally, it can also perform model stacking and Boruta-based feature selection. Performance is evaluated using user-specified metrics such as Accuracy, AUROC, or AUPRC.

Usage

compute_k_fold_CV(
  model,
  k_folds,
  n_rep,
  stacking = FALSE,
  metric = "Accuracy",
  boruta,
  boruta_iterations = NULL,
  fix_boruta = NULL,
  tentative = FALSE,
  boruta_threshold = NULL,
  file_name = NULL,
  LODO = FALSE,
  ncores = NULL,
  return = FALSE,
  fold_construction_fun = NULL,
  fold_construction_args = list()
)

Arguments

model: A data frame containing features and a target column named 'target' corresponding to the response variable to predict.
k_folds: Integer. Number of folds for k-fold cross-validation. Default is 5.
n_rep: Integer. Number of repetitions of the k-fold cross-validation. Default is 100.
stacking: Logical. Whether to perform model stacking. Default is FALSE.
metric: Character. Metric used for hyperparameter tuning and model evaluation. Supported values are "Accuracy", "AUROC", and "AUPRC".
boruta: Logical. Whether to apply Boruta for feature selection before model training. Note that many ML models handle feature importance internally, so prior selection is optional unless multicollinearity is a concern. Default is FALSE.
boruta_iterations: Integer. Number of iterations to run Boruta. Since Boruta involves randomness, repeated runs improve consistency. Default is 100.
fix_boruta: Logical. Whether to fix Boruta’s internal parameters. See compute_boruta() for details.
tentative: Logical. Whether to include tentative features as confirmed in the training dataset.
boruta_threshold: Numeric. Threshold for confirming features after multiple Boruta iterations. For example, 0.8 means features must be confirmed in at least 80% of iterations. Default is 0.8.
file_name: Character. File name used for saving output plots in the Results/ directory.
LODO: Logical. If TRUE, performs Leave-One-Dataset-Out (LODO) cross-validation by stratifying folds based on cohort membership.
ncores: Integer. Number of cores to use for parallelization. If not given, detectCores() - 1 will be used.
return: Logical. Whether to return the results and generated plots.
fold_construction_fun: Function. A custom function used to construct the cross-validation folds. It should return a list of training indices for each fold.
fold_construction_args: List. Named list of additional arguments to pass to fold_construction_fun.

Value

A list containing:

Features used during training
The selected machine learning model
All trained machine learning models

If stacking = TRUE, the list will also include:

Base models
Meta-learner
Matrix of weighted feature importance (see calculate_feature_importance_stacking())