Machine learning models using deconvolution subgroups
The deconvolution subgroups generated by multideconv can
be used as input features for training machine learning (ML) models.
However, since these subgroups are derived based on sample-level
correlations, special care is needed to avoid data leakage. If you
compute the full subgroup matrix on the entire dataset before splitting
into training and test sets (or before performing k-fold
cross-validation), the model may indirectly access information from the
test set during training.
To address this issue, we provide the function
prepare_multideconv_folds(), which ensures proper
separation between training and test data during the computation of
deconvolution subgroups. This function constructs ML folds in a way that
subgroup features are learned only from the training data of each fold
and then projected onto the corresponding test fold out.
Below is an example usage of the multideconv subgroups
using pipeML, an R package that includes direct flexibility
into k-fold construction. We will add our custom function to compute the
folds of the model using multideconv in the argument
fold_construction_fun. For more information about the
functions arguments visit the documentation of pipeML :
library(pipeML)
# traitData_train: Subset of clinical data used for training
# traitData_test: Subset of clinical data used for testing
# deconv_train: Deconvolution of samples used for training
# deconv_test: Deconvolution of samples used for testing
# Training in deconv subgroups
res = pipeML::compute_features.training.ML(deconv_train,
traitData_train$Response,
trait.positive = "R",
metric = "AUROC",
stack = F,
k_folds = 5,
n_rep = 10,
feature.selection = F,
LODO = F,
ncores = 3,
return = F,
fold_construction_fun = prepare_multideconv_folds)
# Replicate deconvolution subgroups
dt_test = replicate_deconvolution_subgroups(res$Custom_output$Processed_deconvolution,
deconv_test)
# Predict in deconv subgroups
pred = pipeML::compute_prediction(res,
dt_test,
traitData_test$Response,
trait.positive = "R",
stack = F,
maximize = "Accuracy",
return = T)NOTE: multideconv is built on top of
existing frameworks and makes extensive use of the R packages
immunedeconv (Sturm et al. (2019)) and
omnideconv (Dietrich et al. (2024)). If you use
multideconv in your work, please cite our package along
with these foundational packages. We also encourage you to cite the
individual deconvolution algorithms you employ in your analysis.
| method | license | citation |
|---|---|---|
| quanTIseq | free (BSD) | Finotello, F., Mayer, C., Plattner, C., Laschober, G., Rieder, D., Hackl, H., …, Sopper, S. (2019). Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data. Genome medicine, 11(1), 34. https://doi.org/10.1186/s13073-019-0638-6 |
| EpiDISH | free (GPL 2.0) | Zheng SC, Breeze CE, Beck S, Teschendorff AE (2018). “Identification of differentially methylated cell-types in Epigenome-Wide Association Studies.” Nature Methods, 15(12), 1059. https://doi.org/10.1038/s41592-018-0213-x |
| DeconRNASeq | free (GPL-2) | joseph.szustakowski@novartis.com TGJDS (2025). DeconRNASeq: Deconvolution of Heterogeneous Tissue Samples for mRNA-Seq data. doi:10.18129/B9.bioc.DeconRNASeq, R package version 1.50.0, https://bioconductor.org/packages/DeconRNASeq |
| AutoGeneS | free (MIT) | Aliee, H., & Theis, F. (2021). AutoGeneS: Automatic gene selection using multi-objective optimization for RNA-seq deconvolution. https://doi.org/10.1101/2020.02.21.940650 |
| BayesPrism | free (GPL 3.0) | Chu, T., Wang, Z., Pe’er, D. et al. Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat Cancer 3, 505–517 (2022). https://doi.org/10.1038/s43018-022-00356-3 |
| Bisque | free (GPL 3.0) | Jew, B., Alvarez, M., Rahmani, E., Miao, Z., Ko, A., Garske, K. M., Sul, J. H., Pietiläinen, K. H., Pajukanta, P., & Halperin, E. (2020). Publisher Correction: Accurate estimation of cell composition in bulk expression through robust integration of single-cell information. Nature Communications, 11(1), 2891. https://doi.org/10.1038/s41467-020-16607-9 |
| BSeq-sc | free (GPL 2.0) | Baron, M., Veres, A., Wolock, S. L., Faust, A. L., Gaujoux, R., Vetere, A., Ryu, J. H., Wagner, B. K., Shen-Orr, S. S., Klein, A. M., Melton, D. A., & Yanai, I. (2016). A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. In Cell Systems (Vol. 3, Issue 4, pp. 346–360.e4). https://doi.org/10.1016/j.cels.2016.08.011 |
| CIBERSORTx | free for non-commerical use only | Newman, A. M., Liu, C. L., Green, M. R., Gentles, A. J., Feng, W., Xu, Y., Hoang, C. D., Diehn, M., & Alizadeh, A. A. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12(5), 453–457. https://doi.org/10.1038/s41587-019-0114-2 |
| CPM | free (GPL 2.0) | Frishberg, A., Peshes-Yaloz, N., Cohn, O., Rosentul, D., Steuerman, Y., Valadarsky, L., Yankovitz, G., Mandelboim, M., Iraqi, F. A., Amit, I., Mayo, L., Bacharach, E., & Gat-Viks, I. (2019). Cell composition analysis of bulk genomics using single-cell data. Nature Methods, 16(4), 327–332. https://doi.org/10.1038/s41592-019-0355-5 |
| DWLS | free (GPL) | Tsoucas, D., Dong, R., Chen, H., Zhu, Q., Guo, G., & Yuan, G.-C. (2019). Accurate estimation of cell-type composition from gene expression data. Nature Communications, 10(1), 2975. https://doi.org/10.1038/s41467-019-10802-z |
| MOMF | free (GPL 3.0) | Xifang Sun, Shiquan Sun, and Sheng Yang. An efficient and flexible method for deconvoluting bulk RNAseq data with single-cell RNAseq data, 2019, DOI: 10.5281/zenodo.3373980 |
| MuSiC | free (GPL 3.0) | Wang, X., Park, J., Susztak, K., Zhang, N. R., & Li, M. (2019). Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nature Communications, 10(1), 380. https://doi.org/10.1038/s41467-018-08023-x |
| SCDC | (MIT) | Dong, M., Thennavan, A., Urrutia, E., Li, Y., Perou, C. M., Zou, F., & Jiang, Y. (2020). SCDC: bulk gene expression deconvolution by multiple single-cell RNA sequencing references. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbz166 |
