
Machine Learning
05-machine-learning.Rmd
library(CellTFusion)
#>
#> Cell group scores computed by CellTFusion can be used directly as
input features for machine learning models to predict clinical traits.
This article demonstrates how to train and evaluate models using the pipeML
R package.
Installation:
pipeMLis not on CRAN. Install it from GitHub before running the examples in this article:remotes::install_github("VeraPancaldiLab/pipeML")
Train/test split
Start by splitting the dataset into training and testing cohorts:
raw.counts <- CellTFusion::raw.counts.tuto
traitdata <- CellTFusion::traitdata.tuto
index <- caret::createDataPartition(
traitdata[, "Best.Confirmed.Overall.Response"],
times = 1, p = 0.8, list = FALSE
)
traitData_train <- traitdata[index, ]
raw.counts_train <- raw.counts[, index]
traitData_test <- traitdata[-index, ]
raw.counts_test <- raw.counts[, -index]Run CellTFusion on the training set
res_training <- CellTFusion(
raw.counts = raw.counts_train,
normalized = TRUE,
coldata = traitData_train,
trait = "Best.Confirmed.Overall.Response",
trait.positive = "PD",
deconv_methods = c("Quantiseq", "Epidish"),
file_name = "TestRun",
corr = 0.7,
pval = 0.05,
high_corr_groups = 0.85,
return = FALSE
)Train a machine learning model
Use pipeML::compute_features.training.ML() to train
classifiers on the cell group scores:
library(pipeML)
res <- pipeML::compute_features.training.ML(
features_train = res_training$Cell_groups[[1]],
target_var = traitData_train$Best.Confirmed.Overall.Response,
trait.positive = "PD",
metric = "AUROC",
task_type = "classification",
stack = FALSE,
k_folds = 2,
n_rep = 2,
ncores = 2,
return = FALSE
)Project cell groups to the test set
To apply the trained cell groups to an independent dataset, replicate the deconvolution and cell group structure:
deconv_test <- multideconv::compute.deconvolution(
raw.counts_test,
methods = c("Quantiseq", "Epidish"),
normalized = TRUE,
return = FALSE
)
testing_set <- compute.test.set(
res_training$Processed_deconvolution,
res_training$Cell_groups,
names(res_training$Cell_groups[[2]]),
deconv_test
)Predict on the test set
pred <- pipeML::compute_prediction(
res, testing_set,
traitData_test$Best.Confirmed.Overall.Response,
"PD", stack = FALSE
)
head(pred$Metrics[, 1:5])
pred$AUC$AUROCCustom k-fold cross-validation
In certain use cases it is essential to carefully design train/test splits to prevent data leakage. This is especially important for feature construction methods that depend on sample-level correlations or deconvolution results — if such features are computed on the full dataset before splitting, information from the test set can inadvertently influence training.
CellTFusion provides
prepare_CellTFusion_folds(), which is fully compatible with
pipeML and recomputes CellTFusion features inside each
fold. This ensures that:
- Test samples are never included in the feature learning process.
- Hyperparameter tuning can be performed safely without bias.
- Parallelization (via
foreach/doParallel) reduces runtime.
universe <- decoupleR::get_collectri(organism = "human", split_complexes = FALSE)
paths <- decoupleR::get_progeny(organism = "human", top = 500)
res_groups <- pipeML::compute_features.training.ML(
t(raw.counts_train),
traitData_train$Best.Confirmed.Overall.Response,
trait.positive = "PD",
metric = "AUROC",
stack = FALSE,
k_folds = 3,
n_rep = 5,
LODO = FALSE,
file_name = "Test",
ncores = 2,
return = TRUE,
fold_construction_fun = prepare_CellTFusion_folds,
fold_construction_args_fixed = list(
deconv = deconv,
universe = universe,
paths = paths,
ncores = 2,
normalized = TRUE,
coldata = traitData_train,
trait = "Best.Confirmed.Overall.Response",
trait.positive = "PD"
),
fold_construction_args_tunable = list(
min_targets_size = c(5, 10, 15, 20),
minMod = c(5, 10, 15, 20),
corr_mod = c(0.7, 0.8, 0.9),
corr = c(0.7, 0.8, 0.9),
high_corr_groups = 0.9
)
)fold_construction_args_fixed holds parameters that stay
constant across folds; fold_construction_args_tunable
defines hyperparameters to tune during cross-validation.
👉 For full details on pipeML arguments, see the pipeML
tutorial.