Skip to contents

This function trains one or more machine learning models using repeated k-fold cross-validation, with optional model stacking and feature selection using Boruta. It supports stratified cross-validation, including the construction of k-folds stratified by cohorts when this information is available.

Usage

compute_features.training.ML(
  features_train,
  target_var,
  trait.positive,
  metric = "Accuracy",
  stack,
  k_folds = 10,
  n_rep = 5,
  feature.selection = FALSE,
  LODO = FALSE,
  n_boruta = 100,
  boruta_fix = FALSE,
  tentative = FALSE,
  boruta_threshold = 0.8,
  batch_id = NULL,
  file_name = NULL,
  ncores = NULL,
  return = FALSE,
  fold_construction_fun = NULL,
  fold_construction_args = list()
)

Arguments

features_train

A data frame containing the features used for training (samples should be as rows).

target_var

A vector containing the target variable to predict.

trait.positive

Value in target_var to be considered as the positive class.

metric

Character. Metric used for hyperparameter tuning and model selection. Supported values are "Accuracy", "AUROC", and "AUPRC".

stack

Logical. Whether to perform model stacking. Default is FALSE.

k_folds

Integer. Number of folds to use in cross-validation.

n_rep

Integer. Number of repetitions of the cross-validation.

feature.selection

Logical. Whether to apply Boruta feature selection before model training. Default is FALSE.

LODO

Logical. If TRUE, constructs folds stratified by cohorts (Leave-One-Dataset-Out CV).

n_boruta

Integer. Number of iterations to run Boruta. Since Boruta involves randomness, repeated runs improve consistency. Default is 100.

boruta_fix

Logical. Whether to fix Boruta’s internal parameters. See compute_boruta() for details.

tentative

Logical. Whether to include tentative features as confirmed in the training dataset (Only valid if boruta_fix = FALSE).

boruta_threshold

Numeric. Threshold for confirming features after multiple Boruta iterations. For example, 0.8 means features must be confirmed in at least 80% of iterations. Default is 0.8.

batch_id

A vector indicating the cohort or batch for each sample (required only if LODO = TRUE).

file_name

Character. File name used to save plots in the Results/ directory.

ncores

Integer. Number of cores to use for parallelization. If not given, detectCores() - 1 will be used.

return

Logical. Whether to return and save the plots generated by the function.

fold_construction_fun

Function. A custom function used to construct the cross-validation folds. It should return a list of training indices for each fold.

fold_construction_args

List. Named list of additional arguments to pass to fold_construction_fun.

Value

A list containing:

  • Trained model (or meta-learner if stack = TRUE)

  • Features used in model training (all features if feature.selection = FALSE)