Machine learning workflows • multideconv

Machine learning models using deconvolution subgroups

The deconvolution subgroups generated by multideconv can be used as input features for training machine learning (ML) models. However, since these subgroups are derived based on sample-level correlations, special care is needed to avoid data leakage. If you compute the full subgroup matrix on the entire dataset before splitting into training and test sets (or before performing k-fold cross-validation), the model may indirectly access information from the test set during training — a form of hidden data leakage described in Hurtado and Pancaldi (2026).

To address this issue, we provide the function prepare_multideconv_folds(), which ensures proper separation between training and test data during the computation of deconvolution subgroups. This function constructs ML folds in a way that subgroup features are learned only from the training data of each fold and then projected onto the corresponding test fold out.

Below is an example usage of the multideconv subgroups using pipeML, an R package that includes direct flexibility into k-fold construction. We will add our custom function to compute the folds of the model using multideconv in the argument fold_construction_fun. For more information about the functions arguments visit the documentation of pipeML :

library(pipeML)

# traitData_train: Subset of clinical data used for training 
# traitData_test: Subset of clinical data used for testing 
# deconv_train: Deconvolution of samples used for training
# deconv_test: Deconvolution of samples used for testing

# Training in deconv subgroups
res = pipeML::compute_features.training.ML(deconv_train, 
                                           traitData_train$Response, 
                                           trait.positive = "R", 
                                           metric = "AUROC", 
                                           stack = F, 
                                           k_folds = 5, 
                                           n_rep = 10, 
                                           feature.selection = F, 
                                           LODO = F, 
                                           ncores = 3, 
                                           return = F, 
                                           fold_construction_fun = prepare_multideconv_folds)

# Replicate deconvolution subgroups
dt_test = replicate_deconvolution_subgroups(res$Custom_output$Processed_deconvolution, 
                                            deconv_test) 

# Predict in deconv subgroups
pred = pipeML::compute_prediction(res, 
                                  dt_test, 
                                  traitData_test$Response, 
                                  trait.positive = "R", 
                                  stack = F, 
                                  maximize = "Accuracy", 
                                  return = T)

NOTE: multideconv is built on top of existing frameworks and makes extensive use of the R packages immunedeconv (Sturm et al. (2019)) and omnideconv (Dietrich et al. (2024)). If you use multideconv in your work, please cite our package along with these foundational packages. We also encourage you to cite the individual deconvolution algorithms you employ in your analysis.

method	license	citation
quanTIseq	free (BSD)	Finotello, F., Mayer, C., Plattner, C., Laschober, G., Rieder, D., Hackl, H., …, Sopper, S. (2019). Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data. Genome medicine, 11(1), 34. https://doi.org/10.1186/s13073-019-0638-6
EpiDISH	free (GPL 2.0)	Zheng SC, Breeze CE, Beck S, Teschendorff AE (2018). “Identification of differentially methylated cell-types in Epigenome-Wide Association Studies.” Nature Methods, 15(12), 1059. https://doi.org/10.1038/s41592-018-0213-x
DeconRNASeq	free (GPL-2)	joseph.szustakowski@novartis.com TGJDS (2025). DeconRNASeq: Deconvolution of Heterogeneous Tissue Samples for mRNA-Seq data. doi:10.18129/B9.bioc.DeconRNASeq, R package version 1.50.0, https://bioconductor.org/packages/DeconRNASeq
AutoGeneS	free (MIT)	Aliee, H., & Theis, F. (2021). AutoGeneS: Automatic gene selection using multi-objective optimization for RNA-seq deconvolution. https://doi.org/10.1101/2020.02.21.940650
BayesPrism	free (GPL 3.0)	Chu, T., Wang, Z., Pe’er, D. et al. Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat Cancer 3, 505–517 (2022). https://doi.org/10.1038/s43018-022-00356-3
Bisque	free (GPL 3.0)	Jew, B., Alvarez, M., Rahmani, E., Miao, Z., Ko, A., Garske, K. M., Sul, J. H., Pietiläinen, K. H., Pajukanta, P., & Halperin, E. (2020). Publisher Correction: Accurate estimation of cell composition in bulk expression through robust integration of single-cell information. Nature Communications, 11(1), 2891. https://doi.org/10.1038/s41467-020-16607-9
BSeq-sc	free (GPL 2.0)	Baron, M., Veres, A., Wolock, S. L., Faust, A. L., Gaujoux, R., Vetere, A., Ryu, J. H., Wagner, B. K., Shen-Orr, S. S., Klein, A. M., Melton, D. A., & Yanai, I. (2016). A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. In Cell Systems (Vol. 3, Issue 4, pp. 346–360.e4). https://doi.org/10.1016/j.cels.2016.08.011
CIBERSORTx	free for non-commerical use only	Newman, A. M., Liu, C. L., Green, M. R., Gentles, A. J., Feng, W., Xu, Y., Hoang, C. D., Diehn, M., & Alizadeh, A. A. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12(5), 453–457. https://doi.org/10.1038/s41587-019-0114-2
CPM	free (GPL 2.0)	Frishberg, A., Peshes-Yaloz, N., Cohn, O., Rosentul, D., Steuerman, Y., Valadarsky, L., Yankovitz, G., Mandelboim, M., Iraqi, F. A., Amit, I., Mayo, L., Bacharach, E., & Gat-Viks, I. (2019). Cell composition analysis of bulk genomics using single-cell data. Nature Methods, 16(4), 327–332. https://doi.org/10.1038/s41592-019-0355-5
DWLS	free (GPL)	Tsoucas, D., Dong, R., Chen, H., Zhu, Q., Guo, G., & Yuan, G.-C. (2019). Accurate estimation of cell-type composition from gene expression data. Nature Communications, 10(1), 2975. https://doi.org/10.1038/s41467-019-10802-z
MOMF	free (GPL 3.0)	Xifang Sun, Shiquan Sun, and Sheng Yang. An efficient and flexible method for deconvoluting bulk RNAseq data with single-cell RNAseq data, 2019, DOI: 10.5281/zenodo.3373980
MuSiC	free (GPL 3.0)	Wang, X., Park, J., Susztak, K., Zhang, N. R., & Li, M. (2019). Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nature Communications, 10(1), 380. https://doi.org/10.1038/s41467-018-08023-x
SCDC	(MIT)	Dong, M., Thennavan, A., Urrutia, E., Li, Y., Perou, C. M., Zou, F., & Jiang, Y. (2020). SCDC: bulk gene expression deconvolution by multiple single-cell RNA sequencing references. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbz166

References

Dietrich, Alexander, Lorenzo Merotto, Konstantin Pelz, et al. 2024. “Benchmarking Second-Generation Methods for Cell-Type Deconvolution of Transcriptomic Data.” bioRxiv, ahead of print. https://doi.org/10.1101/2024.06.10.598226.

Hurtado, Marcelo, and Vera Pancaldi. 2026. “A New Pipeline for Cross-Validation Fold-Aware Machine Learning Prediction of Clinical Outcomes Addresses Hidden Data-Leakage in Omics Based ’Predictors’.” bioRxiv, ahead of print. https://doi.org/10.64898/2026.03.12.711429.

Sturm, Gregor, Francesca Finotello, Florent Petitprez, et al. 2019. “Comprehensive Evaluation of Transcriptome-Based Cell-Type Quantification Methods for Immuno-Oncology.” Bioinformatics 35 (14): i436–45. https://doi.org/10.1093/bioinformatics/btz363.