Code colour: Luca’s protocol, and Bastien’s comments and perspectives
2.2.1Background Noise Removal and SummarizedExperiment stantardised storage
Eliminate barcodes for which the combined counts of the 4 controls per barcode are below a given threshold, here 5.
Listing 2.1: Remove background noise.
Code
barcode_files<-list.files("./data/barcode-counts/", pattern ="exp.*\\.csv", full.names =TRUE)thresh_background<-4## ./data/pheno_data_metadata_2025-04-18.csvpheno_data<-readr::read_csv(paste0("./data/pheno_data_metadata_",today_date, ".csv"))## filename <- grep( "exp200921", barcode_files, value = TRUE)[2]barcode_counts_aggregated<-purrr::map(barcode_files, function(filename){barcode_counts<-readr::read_csv(filename)experience_name<-filename|>basename()|>tools::file_path_sans_ext()## 1) explicitly change replicates name to enable 1-1 mapping ----if(experience_name=="exp200921_dose response osim"){barcode_counts<-barcode_counts|>dplyr::rename_with( .fn =~gsub("exp200921_", "exp200921_dose_response_", .x), .cols =starts_with("Contro"))}replicates_names<-setdiff(colnames(barcode_counts), "barcode_id")## 2) Remove lowly expressed barcodes ----control_index<-pheno_data|>dplyr::filter(Batch_ID==experience_name&Compound=="Control"&Replicates_ID%in%replicates_names)|>dplyr::pull(Replicates_ID)signicant_barcodes_index<-which(rowSums(barcode_counts[, control_index])>thresh_background)filtered_barcode_counts<-barcode_counts[signicant_barcodes_index,]message(paste("\n\nWe are considering experiment:", experience_name, "with", nrow(filtered_barcode_counts), "barcode IDs kept.\n\n"))return(filtered_barcode_counts)})|>purrr::reduce(~inner_join(.x, .y, by ="barcode_id"))## Deal with specific duplicated colnames, resulting from control duplicates## barcode_counts_aggregated <- barcode_counts_aggregated |>## dplyr::select(!ends_with(".y")) |> ## dplyr::rename_with(.fn = ~ gsub("\\.x$", "", .x),## .cols = dplyr::ends_with(".x"))## 3) keep only replicates present in pheno_databarcode_counts_aggregated<-barcode_counts_aggregated|>select(all_of(c("barcode_id", pheno_data$Replicates_ID)))
After filtering for background noise, and merging all experiences based on shared barcode IDs, 4629 unique barcode IDs are kept, for 532 samples, see Listing 2.1 for details.
Density plot of controls + justify why + check absence of outliers, to evaluate the relevance of this threshold, HTSFilter, comparison with existing filtering approaches.
2.2.2Normalisation
Normalize barcodes so that the total number of counts per sample is equal to 100,000. This operation is performed in Listing 2.2.
Listing 2.2: Normalise by \(100000\).
Code
## 1) normalise by total number of counts ----barcode_counts_normalised<-barcode_counts_aggregated|>mutate(across( .cols =where(is.numeric), .fns =~(.x/sum(.x))*1e5))
Close to two existing normalisation methods: Counts Per Million (CPM) which additionally scales raw counts by total library size and multiplies by \(1,000,000\) and Total Count Scaling (TCS): Scales raw counts by the total number of reads (or mapped reads) in each sample, then multiplies by a fixed number (e.g., 100,000). Would compare other normalisation approaches + ignore biological or technical biases + generate MA plots for verifying the mean-variance correction trend + not suitable for DEGs analyses. Apply concatenation phase of all samples before normalising. Preprocessing and normalisation functions for transcriptomics: a general overview
2.2.3 Prepare SummarizedExperiment objects for normalised data
Calculate mean of the 4 controls, followed by Differential Represented Barcode Analysis: binarise each replicate, assigning a 1 if Fold change is above 3 with respect to Mean value, and 0 otherwise
Listing 2.6: Compute the averaged value for control replicates, then binarise for each compound. A 1 denoting a significant positive enrichment for a given barcode_id.
Code
## Step 1: Compute control means per barcode_id and Batch ----threshold_FC<-3control_means<-barcode_counts_long|>filter(Compound=="Control")|>group_by(Batch_ID, barcode_id)|>summarise(Control_Mean =mean(Barcode_Counts), .groups ="drop")## Step 2: Join control mean to full dataset ----barcode_discretised_replicates<-barcode_counts_long|>left_join(control_means, by =c("Batch_ID","barcode_id"))## Step 3: Compute Fold Change and discretise ----barcode_discretised_replicates<-barcode_discretised_replicates|>filter(Compound!="Control")|>mutate(Diff =Barcode_Counts-Control_Mean, Barcode_Counts =if_else(Diff>threshold_FC, 1, 0))|>## Final cleanup select(-Diff, -Control_Mean)|>dplyr::mutate(Barcode_Counts =as.logical(Barcode_Counts))
Batch effect correction for integrating across batches? // As Vera emphasised it out, why keeping only positive values? Negative are also interesting. Why not pairing \(p\)-values and fold-change (considering indeed really small sample sizes)? // All these operations can be performed with one run, using lm, and a model as such, \(\text{Expr} \sim 0 + \text{Gene} + \text{Batch} + \text{Drug}\), with a contr.treatment design matrix // Perform sensitivity analyses to evaluate the impact of threshold criteriaon for binarisation on downstream analyses. 1.
2.3.1SummarizedExperiment of binarised markers
SummarizedExperiment at the replicates level (controls being discarded)
Code
## Step 4: Save as a SummarizedExperiment ----barcode_discretised_replicates_mat<-barcode_discretised_replicates|>tidyr::pivot_wider(id_cols =barcode_id, names_from =Replicates_ID, values_from =Barcode_Counts)pheno_replicates_discretised<-pheno_data|>dplyr::filter(Replicates_ID%in%barcode_discretised_replicates$Replicates_ID)|>tibble::column_to_rownames("Replicates_ID")se_discretised_replicates<-SummarizedExperiment::SummarizedExperiment( assays =list(binarised =barcode_discretised_replicates_mat|>tibble::column_to_rownames("barcode_id")|>as.matrix()), rowData =barcode_metadata, colData =pheno_replicates_discretised)saveRDS(se_discretised_replicates, file =paste0("./results/compounds/se_discretised_per_replicate_",today_date, ".rds"))
Logical AND over replicates: drug by batch (and concentration and time).
Listing 2.7: Average the replicates, using a all operator.
Code
## 2) discretised correlation at the compound by batch level (averaging over replicates), with an all functionbarcode_discretised_compound_batch<-barcode_discretised_replicates|>tidyr::pivot_wider(id_cols =c(barcode_id), names_from =c(Batch_ID, Compound, Concentrations_ID, Duration_ID), names_sep =":", values_from =Barcode_Counts, values_fn=all)pheno_discretised_compound_batch<-pheno_compound_batch[setdiff(colnames(barcode_discretised_compound_batch), "barcode_id"), ]se_compound_batch_discretised<-SummarizedExperiment::SummarizedExperiment( assays =list(discretised =barcode_discretised_compound_batch|>tibble::column_to_rownames("barcode_id")|>as.matrix()), rowData =barcode_metadata, colData =pheno_discretised_compound_batch)saveRDS(se_compound_batch_discretised, file =paste0("./results/compounds/se_discretised_per_compound_by_batch_",today_date, ".rds"))
Average over compounds.
Listing 2.8: Aggregate at the compound level, assigning a 1 if barcode cell lines were found positiveliy enriched in all compounds.
Code
## 2) discretised correlation at the compound by batch level (averaging over replicates), with an all functionbarcode_discretised_compound<-barcode_discretised_replicates|>tidyr::pivot_wider(id_cols =c(barcode_id), names_from =c(Compound), names_sep =":", values_from =Barcode_Counts, values_fn=all)pheno_discretised_compound<-pheno_compound[setdiff(colnames(barcode_discretised_compound), "barcode_id"), ]se_compound_discretised<-SummarizedExperiment::SummarizedExperiment( assays =list(discretised =barcode_discretised_compound|>tibble::column_to_rownames("barcode_id")|>as.matrix()), rowData =barcode_metadata, colData =pheno_discretised_compound)saveRDS(se_compound_discretised, file =paste0("./results/compounds/se_discretised_per_compound_",today_date, ".rds"))
Test the Helmert contrast to evaluate compound dose response, or time-course replicates. Also For repeated measures across time, see One Way repeated measure ANOVA in R↩︎