
Feature Computation
01-feature-computation.Rmd
library(CellTFusion)
#>
#> CellTFusion requires two primary inputs: cell-type deconvolution proportions and transcription factor (TF) activity scores. This article walks through how to compute both, followed by TF module construction and pathway activity scoring.
Load the pre-packaged example data:
raw.counts <- CellTFusion::raw.counts.tuto
traitdata <- CellTFusion::traitdata.tutoCell-type deconvolution
To compute cell-type proportions, we use the multideconv
R package, which integrates multiple deconvolution algorithms and
signature matrices to estimate cell-type abundances from bulk RNA-seq
data. For more details, visit the multideconv GitHub
repository.
deconv <- multideconv::compute.deconvolution(
raw.counts,
methods = c("Quantiseq", "Epidish"),
normalized = TRUE,
return = FALSE
)
#> Performing TPM normalization ................................................................................
#> Converting input to matrix.
#> Running deconvolution using the following methods...............................................................
#>
#> * Quantiseq
#> * Epidish
#>
#> Running Quantiseq...............................................................
#>
#> >>> Running quantiseq
#>
#> Running quanTIseq deconvolution module
#> Gene expression normalization and re-annotation (arrays: FALSE)
#> Removing 17 noisy genes
#> Removing 15 genes with high expression in tumors
#> Signature genes found in data set: 128/138 (92.75%)
#> Mixture deconvolution (method: lsei)
#> Deconvolution successful!
#>
#> The following method-signature combinations are going to be calculated...............................................................
#>
#> Methods
#> * Epidish
#>
#> Signatures
#> * CBSX-HNSCC-scRNAseq
#> * CBSX-Melanoma-scRNAseq
#> * CBSX-NSCLC-PBMCs-scRNAseq
#> * CBSX-Vanderbilt-scRNAseq
#> * CBSX-Zilionis-scRNAseq
#> * LM22
#> * TIL10
#>
#> Running Epidish...............................................................
#>
#>
#> Running Epidish...............................................................
#>
#>
#> Running Epidish...............................................................
#>
#>
#> Running Epidish...............................................................
#>
#>
#> Running Epidish...............................................................
#>
#>
#> Running Epidish...............................................................
#>
#>
#> Running Epidish...............................................................
#>
#> Deconvolution cache files removed.
#> Preprocessing deconvolution features...............................................................
#>
#> Checking consistency in deconvolution cell fractions across patients...............................................................
#>
#> No extra cell types provided. Only the following cell types will be considered:
#> B.cells
#> B.naive.cells
#> B.memory.cells
#> Macrophages.cells
#> Macrophages.M0
#> Macrophages.M1
#> Macrophages.M2
#> Monocytes
#> Neutrophils
#> NK.cells
#> NK.activated
#> NK.resting
#> NKT.cells
#> CD4.cells
#> CD4.memory.activated
#> CD4.memory.resting
#> CD4.naive
#> CD8.cells
#> CD4.regulatory
#> CD4.non.regulatory
#> T.cells.helper
#> T.cells.gamma.delta
#> Dendritic.cells
#> Dendritic.activated.cells
#> Dendritic.resting.cells
#> Cancer
#> Endothelial
#> Eosinophils
#> Plasma
#> Myocytes
#> Fibroblast
#> Mast.cells
#> Mast.activated.cells
#> Mast.resting.cells
#> CAF
#> uncharacterized_cell
#>
#> If you want to consider other cell types (e.g. from a custom signature) which are not included in the package by default (see README), please provide them in the cells_extra argument.
#>
#> Total sum across samples of combination Quantiseq_TIL10 is 1
#> Total sum across samples of combination Epidish_CBSX.HNSCC.scRNAseq is 1
#> Total sum across samples of combination Epidish_CBSX.Melanoma.scRNAseq is 1
#> Total sum across samples of combination Epidish_CBSX.NSCLC.PBMCs.scRNAseq is 1
#> Total sum across samples of combination Epidish_CBSX.Vanderbilt.scRNAseq is 1
#> Total sum across samples of combination Epidish_TIL10 is 1
#> Total sum across samples of combination Epidish_CBSX.Zilionis.scRNAseq is 1
#> Total sum across samples of combination Epidish_LM22 is 1
head(deconv[, 1:5])
#> Quantiseq_TIL10_B.cells Epidish_CBSX.HNSCC.scRNAseq_B.cells
#> SAM7f0d9cc7f001 0.050210988 0.025626151
#> SAM4305ab968b90 0.008603929 0.000000000
#> SAMcf018fee2acd 0.042857608 0.080914523
#> SAMcc4675f394a1 0.025216439 0.001434758
#> SAM49f9b2e57aa5 0.022668958 0.137032856
#> SAM2e7aa8fa0ab3 0.012022510 0.282602048
#> Epidish_CBSX.Melanoma.scRNAseq_B.cells
#> SAM7f0d9cc7f001 0.066233918
#> SAM4305ab968b90 0.008756584
#> SAMcf018fee2acd 0.009133098
#> SAMcc4675f394a1 0.000000000
#> SAM49f9b2e57aa5 0.036205984
#> SAM2e7aa8fa0ab3 0.005377105
#> Epidish_CBSX.NSCLC.PBMCs.scRNAseq_B.cells
#> SAM7f0d9cc7f001 0.9270601
#> SAM4305ab968b90 0.7335563
#> SAMcf018fee2acd 0.8704203
#> SAMcc4675f394a1 0.7806137
#> SAM49f9b2e57aa5 0.7666496
#> SAM2e7aa8fa0ab3 0.9091358
#> Epidish_CBSX.Vanderbilt.scRNAseq_B.cells
#> SAM7f0d9cc7f001 0
#> SAM4305ab968b90 0
#> SAMcf018fee2acd 0
#> SAMcc4675f394a1 0
#> SAM49f9b2e57aa5 0
#> SAM2e7aa8fa0ab3 0TF activity inference
To infer transcription factor activity, we use the viper
package (Alvarez et
al. 2016) combined with the CollecTRI regulon
database (Müller-Dott et al. 2023).
This approach estimates TF activity from the expression levels of
downstream target genes.
# Normalize by log2(TPM + 1)
counts.norm <- data.frame(ADImpute::NormalizeTPM(raw.counts, log = TRUE))
universe <- decoupleR::get_collectri(organism = "human", split_complexes = FALSE)
tfs <- compute.TFs.activity(counts.norm, universe = universe)
head(tfs[, 1:5])TF module construction (WTCNA)
Before constructing cell groups, we reduce the TF activity matrix into modules of TFs that exhibit similar activity patterns across samples. This is done using a Weighted TF Correlation Network Analysis (WTCNA), adapted from WGCNA (Langfelder and Horvath 2008).
network <- compute.WTCNA(tfs, corr_mod = 0.8, clustering.method = "ward.D2", return = TRUE)To explore how clinical variables associate with TF modules, use
compute.metadata.association(). Plots are saved in the
Results/ directory:
compute.metadata.association(network[[1]], traitdata, pval = 0.05, file.name = "Tutorial", width = 10)Pathway activity scoring
To functionally characterize TF modules, we estimate pathway
activities using a multivariate linear model (MLM) via
decoupleR (Badia-i-Mompel et al.
2022) with the PROGENy database (Schubert et al.
2018) by default. A custom gene set can also be supplied, in
which case GSVA is applied instead.
pathways <- compute.pathway.activity(counts.norm)To visualize the relationship between TF modules and pathways, use
compute.modules.relationship(). This saves a labeled
correlation heatmap to Results/:
compute.modules.relationship(network[[1]], pathways, "Pathways_Progeny-TFs_Modules", width = 15)Hub TF enrichment analysis
If pathway scoring alone is insufficient, you can perform Over-Representation Analysis (ORA) on the Reactome database using hub TFs. The function identifies hub TFs per module (based on module membership and node degree) and runs enrichment on their target genes:
hub_tfs <- identify_hub_TFs(t(tfs), network, MM_thresh = 0.8, degree_thresh = 0.9)
compute.modules.enrichment(counts.norm, hub_tfs)Deconvolution dimensionality reduction
To reduce dimensionality in the deconvolution output and improve
statistical power downstream, we group cell types with similar abundance
patterns using multideconv:
dt <- multideconv::compute.deconvolution.analysis(deconv, corr = 0.7, seed = 123)
dt <- multideconv::deconvolution_dictionary(dt, pathways)
# Correlate deconvolution groups with TF modules
compute.modules.relationship(
network[[1]], dt[[1]], "Deconvolution-TFs_Modules",
vertical = TRUE, height = 30, width = 10, pval = 0.05
)