Skip to contents

Machine learning models using deconvolution subgroups

The deconvolution subgroups generated by multideconv can be used as input features for training machine learning (ML) models. However, since these subgroups are derived based on sample-level correlations, special care is needed to avoid data leakage. If you compute the full subgroup matrix on the entire dataset before splitting into training and test sets (or before performing k-fold cross-validation), the model may indirectly access information from the test set during training.

To address this issue, we provide the function prepare_multideconv_folds(), which ensures proper separation between training and test data during the computation of deconvolution subgroups. This function constructs ML folds in a way that subgroup features are learned only from the training data of each fold and then projected onto the corresponding test fold out.

Below is an example usage of the multideconv subgroups using pipeML, an R package that includes direct flexibility into k-fold construction. We will add our custom function to compute the folds of the model using multideconv in the argument fold_construction_fun. For more information about the functions arguments visit the documentation of pipeML :

library(pipeML)

# traitData_train: Subset of clinical data used for training 
# traitData_test: Subset of clinical data used for testing 
# deconv_train: Deconvolution of samples used for training
# deconv_test: Deconvolution of samples used for testing

# Training in deconv subgroups
res = pipeML::compute_features.training.ML(deconv_train, 
                                           traitData_train$Response, 
                                           trait.positive = "R", 
                                           metric = "AUROC", 
                                           stack = F, 
                                           k_folds = 5, 
                                           n_rep = 10, 
                                           feature.selection = F, 
                                           LODO = F, 
                                           ncores = 3, 
                                           return = F, 
                                           fold_construction_fun = prepare_multideconv_folds)

# Replicate deconvolution subgroups
dt_test = replicate_deconvolution_subgroups(res$Custom_output$Processed_deconvolution, 
                                            deconv_test) 

# Predict in deconv subgroups
pred = pipeML::compute_prediction(res, 
                                  dt_test, 
                                  traitData_test$Response, 
                                  trait.positive = "R", 
                                  stack = F, 
                                  maximize = "Accuracy", 
                                  return = T)

NOTE: multideconv is built on top of existing frameworks and makes extensive use of the R packages immunedeconv (Sturm et al. (2019)) and omnideconv (Dietrich et al. (2024)). If you use multideconv in your work, please cite our package along with these foundational packages. We also encourage you to cite the individual deconvolution algorithms you employ in your analysis.

method license citation
quanTIseq free (BSD) Finotello, F., Mayer, C., Plattner, C., Laschober, G., Rieder, D., Hackl, H., …, Sopper, S. (2019). Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data. Genome medicine, 11(1), 34. https://doi.org/10.1186/s13073-019-0638-6
EpiDISH free (GPL 2.0) Zheng SC, Breeze CE, Beck S, Teschendorff AE (2018). “Identification of differentially methylated cell-types in Epigenome-Wide Association Studies.” Nature Methods, 15(12), 1059. https://doi.org/10.1038/s41592-018-0213-x
DeconRNASeq free (GPL-2) TGJDS (2025). DeconRNASeq: Deconvolution of Heterogeneous Tissue Samples for mRNA-Seq data. doi:10.18129/B9.bioc.DeconRNASeq, R package version 1.50.0, https://bioconductor.org/packages/DeconRNASeq
AutoGeneS free (MIT) Aliee, H., & Theis, F. (2021). AutoGeneS: Automatic gene selection using multi-objective optimization for RNA-seq deconvolution. https://doi.org/10.1101/2020.02.21.940650
BayesPrism free (GPL 3.0) Chu, T., Wang, Z., Pe’er, D. et al. Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat Cancer 3, 505–517 (2022). https://doi.org/10.1038/s43018-022-00356-3
Bisque free (GPL 3.0) Jew, B., Alvarez, M., Rahmani, E., Miao, Z., Ko, A., Garske, K. M., Sul, J. H., Pietiläinen, K. H., Pajukanta, P., & Halperin, E. (2020). Publisher Correction: Accurate estimation of cell composition in bulk expression through robust integration of single-cell information. Nature Communications, 11(1), 2891. https://doi.org/10.1038/s41467-020-16607-9
BSeq-sc free (GPL 2.0) Baron, M., Veres, A., Wolock, S. L., Faust, A. L., Gaujoux, R., Vetere, A., Ryu, J. H., Wagner, B. K., Shen-Orr, S. S., Klein, A. M., Melton, D. A., & Yanai, I. (2016). A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. In Cell Systems (Vol. 3, Issue 4, pp. 346–360.e4). https://doi.org/10.1016/j.cels.2016.08.011
CIBERSORTx free for non-commerical use only Newman, A. M., Liu, C. L., Green, M. R., Gentles, A. J., Feng, W., Xu, Y., Hoang, C. D., Diehn, M., & Alizadeh, A. A. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12(5), 453–457. https://doi.org/10.1038/s41587-019-0114-2
CPM free (GPL 2.0) Frishberg, A., Peshes-Yaloz, N., Cohn, O., Rosentul, D., Steuerman, Y., Valadarsky, L., Yankovitz, G., Mandelboim, M., Iraqi, F. A., Amit, I., Mayo, L., Bacharach, E., & Gat-Viks, I. (2019). Cell composition analysis of bulk genomics using single-cell data. Nature Methods, 16(4), 327–332. https://doi.org/10.1038/s41592-019-0355-5
DWLS free (GPL) Tsoucas, D., Dong, R., Chen, H., Zhu, Q., Guo, G., & Yuan, G.-C. (2019). Accurate estimation of cell-type composition from gene expression data. Nature Communications, 10(1), 2975. https://doi.org/10.1038/s41467-019-10802-z
MOMF free (GPL 3.0) Xifang Sun, Shiquan Sun, and Sheng Yang. An efficient and flexible method for deconvoluting bulk RNAseq data with single-cell RNAseq data, 2019, DOI: 10.5281/zenodo.3373980
MuSiC free (GPL 3.0) Wang, X., Park, J., Susztak, K., Zhang, N. R., & Li, M. (2019). Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nature Communications, 10(1), 380. https://doi.org/10.1038/s41467-018-08023-x
SCDC (MIT) Dong, M., Thennavan, A., Urrutia, E., Li, Y., Perou, C. M., Zou, F., & Jiang, Y. (2020). SCDC: bulk gene expression deconvolution by multiple single-cell RNA sequencing references. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbz166

References

Dietrich, Alexander, Lorenzo Merotto, Konstantin Pelz, Bernhard Eder, Constantin Zackl, Katharina Reinisch, Frank Edenhofer, et al. 2024. “Benchmarking Second-Generation Methods for Cell-Type Deconvolution of Transcriptomic Data.” bioRxiv. https://doi.org/10.1101/2024.06.10.598226.
Sturm, Gregor, Francesca Finotello, Florent Petitprez, Jitao David Zhang, Jan Baumbach, Wolf H Fridman, Markus List, and Tatsiana Aneichyk. 2019. “Comprehensive Evaluation of Transcriptome-Based Cell-Type Quantification Methods for Immuno-Oncology.” Bioinformatics 35 (14): i436–45. https://doi.org/10.1093/bioinformatics/btz363.