Machine Learning • CellTFusion

library(CellTFusion)
#> 
#>

CellTFusion latent factors (res$Latent_spaces$Z) serve as features for machine learning models. This article shows how to train a classifier on a training cohort and project it onto an independent test cohort using project_test_factors().

Installation: this article uses the pipeML package for model training. Install it from GitHub before running these examples:
remotes::install_github("VeraPancaldiLab/pipeML")

Train/test split

raw.counts <- CellTFusion::raw.counts.tuto
traitdata  <- CellTFusion::traitdata.tuto

index <- caret::createDataPartition(
  traitdata[, "Best.Confirmed.Overall.Response"],
  times = 1, p = 0.8, list = FALSE
)

traitData_train  <- traitdata[index, ]
raw.counts_train <- raw.counts[, index]

traitData_test   <- traitdata[-index, ]
raw.counts_test  <- raw.counts[, -index]

Run CellTFusion on the training set

res_train <- CellTFusion(
  raw.counts    = raw.counts_train,
  normalized    = TRUE,
  coldata       = traitData_train,
  task          = "unsupervised",
  deconv_methods = c("Quantiseq", "Epidish"),
  cancer_type   = "skcm",
  corr          = 0.7,
  pval          = 0.05,
  file_name     = "Train",
  return        = TRUE
)

# Latent factor scores are in res_train$Latent_spaces$Z
head(res_train$Latent_spaces$Z)

Train a machine learning model

The latent factor scores ($Z) are used as features. pipeML::compute_features.training.ML() trains and cross-validates classifiers (random forest, elastic net, etc.) and returns AUROC results.

library(pipeML)

ml_res <- pipeML::compute_features.training.ML(
  features_train = res_train$Latent_spaces$Z,
  target_var     = traitData_train$Best.Confirmed.Overall.Response,
  trait.positive = "PD",
  metric         = "AUROC",
  task_type      = "classification",
  stack          = FALSE,
  k_folds        = 5,
  n_rep          = 10,
  ncores         = 2,
  return         = TRUE
)

Project to an independent test set

project_test_factors() takes the full training result object (res_train) and the deconvolution of the test cohort, and projects the test samples into the training latent space. This ensures no data leakage from the test set into the feature construction.

# Compute deconvolution on the test cohort
deconv_test <- multideconv::compute.deconvolution(
  raw.counts_test,
  methods    = c("Quantiseq", "Epidish"),
  normalized = TRUE,
  return     = FALSE
)

# Project test samples into training latent space
features_test <- data.frame(project_test_factors(res_train, deconv_test))
head(features_test)

Predict on the test set

pred <- pipeML::compute_prediction(
  ml_res,
  features_test,
  traitData_test$Best.Confirmed.Overall.Response,
  trait.positive = "PD",
  stack          = FALSE
)

head(pred$Metrics[, 1:5])
pred$AUC$AUROC

Custom k-fold cross-validation

For rigorous evaluation, CellTFusion features should be recomputed inside each fold to prevent data leakage. The helper compute_features_modular() pattern from the paper scripts wraps CellTFusion() + project_test_factors() for use inside pipeML’s custom fold interface:

compute_features_modular <- function(data, structure = NULL, deconv = NULL,
                                     coldata = NULL, universe = NULL, paths = NULL,
                                     normalized = TRUE, task = "unsupervised",
                                     cancer_type = "skcm", TF.collection = "CollecTRI",
                                     file_name = NULL, final_training = FALSE,
                                     min_targets_size = 5, minMod = 10,
                                     corr_mod = 0.7, corr = 0.7) {
  if (is.null(structure)) {
    # Training mode: run full CellTFusion
    structure <- CellTFusion(
      raw.counts       = t(data),
      deconv           = deconv,
      normalized       = normalized,
      coldata          = coldata,
      universe         = universe,
      paths            = paths,
      task             = task,
      cancer_type      = cancer_type,
      TF.collection    = TF.collection,
      min_targets_size = min_targets_size,
      minMod           = minMod,
      corr_mod         = corr_mod,
      corr             = corr,
      file_name        = file_name,
      return           = final_training,
      verbose          = FALSE
    )
    features <- data.frame(structure$Latent_spaces$Z)
  } else {
    # Projection mode: map test samples into training latent space
    features <- data.frame(project_test_factors(structure, deconv))
  }
  list(features = features, structure = structure)
}

This pattern ensures the TF module structure is learned only from training samples and consistently applied to test samples at each fold.