
Batch/Multi-cohort Analysis
07-batch-analysis.Rmd
library(CellTFusion)
#>
#> When samples come from multiple cohorts, batches, or sequencing runs,
technical and cohort-level variation can dominate TF activity and
cell-type correlation structure, masking real biological signal.
CellTFusion handles this in two distinct — and complementary — ways
depending on the pipeline stage, both controlled by the
batch argument. This article explains what
batch actually does at each stage, why it matters, and how
to use it.
Two different meanings of batch
batch is not a single mechanism reused everywhere — its
type and effect differ by function:
| Function |
batch type |
What it does |
|---|---|---|
CellTFusion(batch = TRUE, batch_id = "Cohort") |
Logical + column name | Splits samples by cohort and computes TF activity separately per cohort; runs consensus WGCNA instead of a single network. |
compute.WTCNA(batch = TRUE) |
Logical | Same as above at the network-construction level: expects
TFs.matrix to be a list of per-cohort
matrices and calls WGCNA::blockwiseConsensusModules(). |
compute.deconvolution.analysis(batch = batch_vec) |
Vector | Cell-type correlations used for subgrouping are computed as partial correlations controlling for cohort. |
compute.modules.relationship(batch = batch_vec) |
Vector | TF module vs. deconvolution (or pathway) correlations are computed
as partial correlations via
ppcor::pcor.test(x, y, batch), removing the linear effect
of cohort before testing association. |
construct_cell_groups(batch = batch_vec) |
Vector | Pass-through: forwards the batch vector to
compute.modules.relationship() and
cell.groups.computation() (which forwards it to
compute_composite_score()). |
compute_composite_score() (internal, called by
cell.groups.computation()) |
Vector | Cell-group and TF-module scores are residualized against
batch via residuals(lm(x ~ batch)) before being
used, removing any linear cohort effect from the composite scores
themselves. |
In short: at the TF network stage
(CellTFusion/compute.WTCNA),
batch is a logical switch that changes the
algorithm (per-cohort computation + consensus WGCNA). At every
downstream stage (deconvolution analysis, module-relationship
correlations, cell group construction, composite scores),
batch is the actual cohort vector, used as
a covariate to regress out or partial out cohort effects.
When to use it
Use batch = TRUE whenever your samples are not a single
homogeneous cohort — for example, RNAseq from different studies,
sequencing batches, or clinical sites — and cohort identity could
plausibly create spurious TF co-activity or cell-type correlation
patterns unrelated to biology.
If you skip batch on multi-cohort data, TF modules and
cell groups risk being driven primarily by which cohort a sample came
from, rather than by shared TME biology.
Implications
-
Consensus WGCNA is more conservative. It only keeps
TF-TF correlation structure that is consistent across all cohorts, so
batch = TRUEcan produce fewer, larger, or differently colored modules than pooling all samples into one matrix. -
Common TF set. When cohorts don’t share every TF
(e.g. different regulon coverage),
compute.WTCNA(batch = TRUE)restricts analysis to the TFs present in all cohorts before building the consensus network. -
Partial correlation vs. residualization. Partial
correlations (
compute.deconvolution.analysis(),compute.modules.relationship()) remove the linear association with batch only for the purpose of that specific correlation test; residualization (compute_composite_score()) actually replaces the composite score values with their batch-adjusted residuals, which propagates downstream (e.g. into cell group scores and, from there, into latent factors). -
batch_idmust be provided consistently.CellTFusion()requires bothcoldataandbatch_idwhenbatch = TRUE; the same cohort vector (coldata[, batch_id]) is threaded through every downstream function automatically — you do not need to pass it manually if you use theCellTFusion()wrapper.
Example: multi-cohort pipeline
raw.counts <- CellTFusion::raw.counts.tuto
traitdata <- CellTFusion::traitdata.tuto
# traitdata must contain a column identifying the cohort of each sample, e.g. "Cohort"
res <- CellTFusion(
raw.counts = raw.counts,
normalized = FALSE,
coldata = traitdata,
task = "unsupervised",
batch = TRUE,
batch_id = "Cohort",
deconv_methods = c("Quantiseq", "Epidish"),
TF.collection = "CollecTRI",
cancer_type = "skcm",
corr = 0.7,
corr_mod = 0.9,
pval = 0.05,
file_name = "Tutorial_batch",
return = TRUE
)Internally this:
- Splits samples by
traitdata$Cohortand computescompute.TFs.activity()separately for each cohort, yielding a list of TF activity matrices. - Runs
compute.WTCNA(batch = TRUE)on that list, producing a consensus TF module network shared across cohorts. - Passes
batch_vec <- traitdata[, "Cohort"]intocompute.deconvolution.analysis(),compute.modules.relationship(), andconstruct_cell_groups(), so subgroup correlations, module-deconvolution associations, and composite cell-group scores are all cohort-adjusted.
Example: using individual functions directly
If you are running the feature computation steps manually (see Feature Computation) rather than
through CellTFusion(), you must build the per-cohort TF
list and batch vector yourself:
batch_vec <- traitdata[, "Cohort"]
cohorts <- split(seq_len(ncol(counts.norm)), batch_vec)
tfs_list <- lapply(cohorts, function(idx) {
compute.TFs.activity(counts.norm[, idx, drop = FALSE], TF.collection = "CollecTRI")
})
network <- compute.WTCNA(TFs.matrix = tfs_list, batch = TRUE, minMod = 15, corr_mod = 0.9)
dt <- multideconv::compute.deconvolution.analysis(deconv, corr = 0.7, batch = batch_vec)
cell_groups <- construct_cell_groups(network, dt, batch = batch_vec, pval = 0.05)