Skip to contents

This function trains machine learning models using cross-validation on training data and evaluates them on test data. It supports feature selection with Boruta, model stacking, cohort-based (LODO) validation, and allows for optimizing predictions by maximizing a specified performance metric.

Usage

compute_features.ML(
  features_train,
  features_test,
  clinical,
  trait,
  trait.positive,
  metric = "Accuracy",
  stack,
  k_folds = 10,
  n_rep = 5,
  feature.selection = FALSE,
  seed,
  LODO = FALSE,
  n_boruta = 100,
  boruta_fix = FALSE,
  batch_id = NULL,
  file_name = NULL,
  ncores = NULL,
  maximize = "Accuracy",
  return = FALSE,
  fold_construction_fun = NULL,
  fold_construction_args = list()
)

Arguments

features_train

A data frame of features used for training the models.

features_test

A data frame of features used for testing the models.

clinical

A data frame containing clinical information, including the target variable and optionally a batch ID. Row names must match the sample identifiers in features_train and features_test.

trait

Character. The name of the column in clinical corresponding to the target variable.

trait.positive

Value in trait to be considered as the positive class.

metric

Character. Metric used for hyperparameter tuning and model selection. Supported values are "Accuracy", "AUROC", and "AUPRC".

stack

Logical. Whether to apply model stacking. Default is FALSE.

k_folds

Integer. Number of folds for cross-validation.

n_rep

Integer. Number of cross-validation repetitions.

feature.selection

Logical. Whether to apply Boruta feature selection before training. Default is FALSE.

seed

Integer. Random seed for reproducibility.

LODO

Logical. If TRUE, folds are constructed in a Leave-One-Dataset-Out (LODO) manner based on cohorts.

n_boruta

Integer. Number of iterations to run Boruta. Since Boruta involves randomness, repeated runs improve consistency. Default is 100.

boruta_fix

Logical. Whether to fix Boruta’s internal parameters. See compute_boruta() for details.

batch_id

A vector indicating the cohort/batch for each sample (only required if LODO = TRUE).

file_name

Character. Base name used to save plots in the Results/ directory.

ncores

Integer. Number of cores to use for parallelization. If not given, detectCores() - 1 will be used.

maximize

A character string indicating which metric to maximize when selecting the best threshold for the confusion matrix. Options include "Accuracy", "Precision", "Recall", "Specificity", "Sensitivity", "F1", or "MCC". Default is "Accuracy".

return

Logical. Whether to return and save plots generated by the function.

fold_construction_fun

Function. A custom function used to construct the cross-validation folds. It should return a list of training indices for each fold.

fold_construction_args

List. Named list of additional arguments to pass to fold_construction_fun.

Value

A list containing:

  • Trained model (or meta-learner if stacking was used)

  • Features used in model training

  • Prediction performance metrics

  • AUC scores (AUROC and AUPRC)

  • Predicted class probabilities on test data