Train and evaluate machine learning models with optional stacking and feature selection

This function trains machine learning models using cross-validation on training data and evaluates them on test data. It supports feature selection with Boruta, model stacking, cohort-based (LODO) validation, and allows for optimizing predictions by maximizing a specified performance metric.

Usage

compute_features.ML(
  features_train,
  features_test,
  clinical,
  trait,
  trait.positive,
  metric = "Accuracy",
  stack,
  k_folds = 10,
  n_rep = 5,
  feature.selection = FALSE,
  LODO = FALSE,
  n_boruta = 100,
  boruta_fix = FALSE,
  tentative = FALSE,
  boruta_threshold = 0.8,
  batch_id = NULL,
  file_name = NULL,
  ncores = NULL,
  maximize = "Accuracy",
  return = FALSE,
  fold_construction_fun = NULL,
  fold_construction_args = list()
)

Arguments

features_train: A data frame of features used for training the models (samples should be as rows).
features_test: A data frame of features used for testing the models.
clinical: A data frame containing clinical information, including the target variable and optionally a batch ID. Row names must match the sample identifiers in features_train and features_test.
trait: Character. The name of the column in clinical corresponding to the target variable.
trait.positive: Value in trait to be considered as the positive class.
metric: Character. Metric used for hyperparameter tuning and model selection. Supported values are "Accuracy", "AUROC", and "AUPRC".
stack: Logical. Whether to apply model stacking. Default is FALSE.
k_folds: Integer. Number of folds for cross-validation.
n_rep: Integer. Number of cross-validation repetitions.
feature.selection: Logical. Whether to apply Boruta feature selection before training. Default is FALSE.
LODO: Logical. If TRUE, folds are constructed in a Leave-One-Dataset-Out (LODO) manner based on cohorts.
n_boruta: Integer. Number of iterations to run Boruta. Since Boruta involves randomness, repeated runs improve consistency. Default is 100.
boruta_fix: Logical. Whether to fix Boruta’s internal parameters. See compute_boruta() for details.
tentative: Logical. Whether to include tentative features as confirmed in the training dataset (Only valid if boruta_fix = FALSE).
boruta_threshold: Numeric. Threshold for confirming features after multiple Boruta iterations. For example, 0.8 means features must be confirmed in at least 80% of iterations. Default is 0.8.
batch_id: A vector indicating the cohort/batch for each sample (only required if LODO = TRUE).
file_name: Character. Base name used to save plots in the Results/ directory.
ncores: Integer. Number of cores to use for parallelization. If not given, detectCores() - 1 will be used.
maximize: A character string indicating which metric to maximize when selecting the best threshold for the confusion matrix. Options include "Accuracy", "Precision", "Recall", "Specificity", "Sensitivity", "F1", or "MCC". Default is "Accuracy".
return: Logical. Whether to return and save plots generated by the function.
fold_construction_fun: Function. A custom function used to construct the cross-validation folds. It should return a list of training indices for each fold.
fold_construction_args: List. Named list of additional arguments to pass to fold_construction_fun.

Value

A list containing:

Trained model (or meta-learner if stacking was used)
Features used in model training
Prediction performance metrics
AUC scores (AUROC and AUPRC)
Predicted class probabilities on test data