Skip to contents

This function trains machine learning models using cross-validation on training data and evaluates them on test data. It supports feature selection with Boruta, model stacking, cohort-based (LODO) validation, and allows for optimizing predictions by maximizing a specified performance metric.

Usage

compute_features.ML(
  features_train,
  features_test,
  clinical,
  trait,
  trait.positive,
  metric = "Accuracy",
  stack,
  k_folds = 10,
  n_rep = 5,
  LODO = FALSE,
  batch_id = NULL,
  file_name = NULL,
  ncores = NULL,
  maximize = "Accuracy",
  return = FALSE,
  fold_construction_fun = NULL,
  fold_construction_args_fixed = NULL,
  fold_construction_args_tunable = NULL
)

Arguments

features_train

A data frame of features used for training the models (samples should be as rows).

features_test

A data frame of features used for testing the models.

clinical

A data frame containing clinical information, including the target variable and optionally a batch ID. Row names must match the sample identifiers in features_train and features_test.

trait

Character. The name of the column in clinical corresponding to the target variable.

trait.positive

Value in trait to be considered as the positive class.

metric

Character. Metric used for hyperparameter tuning and model selection. Supported values are "Accuracy", "AUROC", and "AUPRC".

stack

Logical. Whether to apply model stacking. Default is FALSE.

k_folds

Integer. Number of folds for cross-validation.

n_rep

Integer. Number of cross-validation repetitions.

LODO

Logical. If TRUE, folds are constructed in a Leave-One-Dataset-Out (LODO) manner based on cohorts.

batch_id

A vector indicating the cohort/batch for each sample (only required if LODO = TRUE).

file_name

Character. Base name used to save plots in the Results/ directory.

ncores

Integer. Number of cores to use for parallelization. If not given, detectCores() - 1 will be used.

maximize

A character string indicating which metric to maximize when selecting the best threshold for the confusion matrix. Options include "Accuracy", "Precision", "Recall", "Specificity", "Sensitivity", "F1", or "MCC". Default is "Accuracy".

return

Logical. Whether to return and save plots generated by the function.

fold_construction_fun

Function. A custom function used to construct the cross-validation folds. This function must accept a bestune argument, which is used internally to inject optimized parameters after hyperparameter tuning. If bestune = NULL, the function will explore a parameter grid across folds (parallelized with foreach); if bestune is provided, the optimized parameters will be applied to rebuild the features on the full training data.

fold_construction_args_fixed

List. A list of arguments passed to fold_construction_fun that remain fixed during both cross-validation and final training.

fold_construction_args_tunable

List. A list of arguments passed to fold_construction_fun that define the hyperparameters to be tuned during cross-validation. Each element should contain candidate values for tuning.

Value

A list containing:

  • Trained model (or meta-learner if stacking was used)

  • Features used in model training

  • Prediction performance metrics

  • AUC scores (AUROC and AUPRC)

  • Predicted class probabilities on test data