A flexible and modular machine learning framework designed to support leakage-free model training through custom cross-validation fold construction
Installation
You can install the development version of pipeML from GitHub with:
# install.packages("pak")
pak::pkg_install("VeraPancaldiLab/pipeML")Description
pipeML is a flexible and leakage-aware machine learning framework for R designed for predictive modeling in high-dimensional biological data. The package integrates all key steps of the machine learning workflow — feature selection, model training, validation, prediction, and interpretation — into a single reproducible pipeline.
A key design goal of pipeML is to support fold-aware feature construction, allowing features that depend on the dataset (e.g. enrichment scores, correlation-based features, or network-derived features) to be recomputed within each cross-validation fold. This prevents information leakage and ensures reliable performance estimation.
The framework is designed to integrate naturally with R/Bioconductor workflows, making it particularly suitable for omics and biomedical machine learning applications.
Figure 1. General structure of the pipeML machine learning pipeline.
Key Features
End-to-end ML workflow
- Integrated pipeline for feature selection, model training, validation, prediction, and interpretation
Leakage-aware validation
- Custom cross-validation fold construction
- Support for fold-aware feature recomputation
- Prevents information leakage when using dataset-dependent features
Flexible model evaluation
- Repeated and stratified k-fold cross-validation
- Leave-one-dataset-out (LODO) evaluation for cross-cohort generalization
Supported Machine Learning Methods
Classification algorithms:
For classification tasks, we implemented a diverse set of classification algorithms that are benchmarked on the fly making extensive use of the R package caret.
- Bagged classification trees
- Random forests
- C5.0 decision trees
- Regularized logistic regression (elastic net)
- k-nearest neighbors (KNN)
- Classification and regression trees (CART)
- Lasso regression
- Ridge regression
- Support vector machines with linear and radial kernels
- Extreme Gradient Boosting (XGBoost)
Survival algorithms:
For time-to-event outcomes, pipeML implements a unified survival modeling framework based on the parsnip and workflows ecosystems, enabling consistent training, hyperparameter tuning, and evaluation across multiple survival model families.
- Cox proportional hazards model
- Elastic net–regularized Cox regression
- Parametric accelerated failure time (AFT) models
- Conditional inference survival trees
- Bagged CART survival models
- Random survival forests
- Gradient boosting for censored outcomes
General usage
Below are basic examples showing how to use pipeML
For a detailed tutorial, see Get started
Training models
res <- compute_features.training.ML(features_train = X_train,
target_var = y_train,
task_type = "classification",
trait.positive = "1",
metric = "AUROC",
k_folds = 5,
n_rep = 10,
return = F)Predicting on new data
pred = compute_prediction(model = res$Model,
test_data = X_test,
target_var = y_test,
task_type = "classification",
trait.positive = "1",
return = F)Training and Testing Workflow
res <- compute_features.ML(features_train = X_train,
features_test = X_test,
coldata = data,
task_type = "classification",
trait = "target",
trait.positive = "1",
metric = "AUROC",
k_folds = 5,
n_rep = 10,
ncores = 2)Issues
If you encounter any problems or have questions about the package, we encourage you to open an issue here. We’ll do our best to assist you!
Authors
pipeML was developed by Marcelo Hurtado in supervision of Vera Pancaldi and is part of the Pancaldi team. Currently, Marcelo is the primary maintainer of this package.
Citing pipeML
If you use pipeML in a scientific publication, we would appreciate citation to the :
Hurtado, M., & Pancaldi, V. (2026). A new pipeline for cross-validation fold-aware machine learning prediction of clinical outcomes addresses hidden data-leakage in omics based ‘predictors’. bioRxiv. https://doi.org/10.64898/2026.03.12.711429
