
Train and evaluate machine learning models with optional stacking and feature selection
Source:R/machine_learning.R
compute_features.ML.Rd
This function trains machine learning models using cross-validation on training data and evaluates them on test data. It supports feature selection with Boruta, model stacking, cohort-based (LODO) validation, and allows for optimizing predictions by maximizing a specified performance metric.
Usage
compute_features.ML(
features_train,
features_test,
clinical,
trait,
trait.positive,
metric = "Accuracy",
stack,
k_folds = 10,
n_rep = 5,
feature.selection = FALSE,
seed,
LODO = FALSE,
n_boruta = 100,
boruta_fix = FALSE,
batch_id = NULL,
file_name = NULL,
ncores = NULL,
maximize = "Accuracy",
return = FALSE,
fold_construction_fun = NULL,
fold_construction_args = list()
)
Arguments
- features_train
A data frame of features used for training the models.
- features_test
A data frame of features used for testing the models.
- clinical
A data frame containing clinical information, including the target variable and optionally a batch ID. Row names must match the sample identifiers in
features_train
andfeatures_test
.- trait
Character. The name of the column in
clinical
corresponding to the target variable.- trait.positive
Value in
trait
to be considered as the positive class.- metric
Character. Metric used for hyperparameter tuning and model selection. Supported values are
"Accuracy"
,"AUROC"
, and"AUPRC"
.- stack
Logical. Whether to apply model stacking. Default is
FALSE
.- k_folds
Integer. Number of folds for cross-validation.
- n_rep
Integer. Number of cross-validation repetitions.
- feature.selection
Logical. Whether to apply Boruta feature selection before training. Default is
FALSE
.- seed
Integer. Random seed for reproducibility.
- LODO
Logical. If
TRUE
, folds are constructed in a Leave-One-Dataset-Out (LODO) manner based on cohorts.- n_boruta
Integer. Number of iterations to run Boruta. Since Boruta involves randomness, repeated runs improve consistency. Default is 100.
- boruta_fix
Logical. Whether to fix Boruta’s internal parameters. See
compute_boruta()
for details.- batch_id
A vector indicating the cohort/batch for each sample (only required if
LODO = TRUE
).- file_name
Character. Base name used to save plots in the
Results/
directory.- ncores
Integer. Number of cores to use for parallelization. If not given, detectCores() - 1 will be used.
- maximize
A character string indicating which metric to maximize when selecting the best threshold for the confusion matrix. Options include "Accuracy", "Precision", "Recall", "Specificity", "Sensitivity", "F1", or "MCC". Default is "Accuracy".
- return
Logical. Whether to return and save plots generated by the function.
- fold_construction_fun
Function. A custom function used to construct the cross-validation folds. It should return a list of training indices for each fold.
- fold_construction_args
List. Named list of additional arguments to pass to
fold_construction_fun
.