1 Data Management
1.1 Reproducibility
I list below the R packages required to reproduce the analyses.
1.2 Barcode counts cleaning
Below, we convert each
.TABULARfile to its recommended.tsvfile extension, and rename unequivocally the firstcolnameintobarcode_id:
Code
barcode_files <- list.files("./data/barcode-counts/",
pattern = "exp.*\\.tabular",
full.names = TRUE)
## change format + barcode name
barcode_matrices <- lapply(barcode_files, function(old_filename) {
expression_matrix <- readxl::read_tsv(old_filename) |>
rename(barcode_id = 1)
new_filename <- old_filename |> basename() |> tools::file_path_sans_ext()
readr::write_tsv(expression_matrix,
file = file.path("data", "barcode-counts", paste0(new_filename, ".tsv")))
unlink(old_filename, force = TRUE)
})I rename the first colname of each expression file as
barcode_id, storing unequivocal tag.-
List of changes applied for easier mapping between replicates’ barcode profiles and samples’ metadata:
-
exp130921, switch fromOsimer1_001g_exp130921_run101121_28toOsimer1_001m_exp130921_run101121_28, from001gto001m(in-vivo context), and switch fromAzacyt1_1,5u_exp130921_run290921_40toAzacyt1_1.50u_exp130921_run290921_40,1,5uto1.50u1. -
exp200921: switch fromBafilo1_1,2n_exp200921_run111021_45toBafilo1_1.20n_exp200921_run111021_45,1,2nto1.20n. In addition, the 4 control replicates of dose-response batchexp200921_dose_responseare shared with standard experimentexp200921.csv. -
exp151121: switch fromMitomy1_002x_exp151121_run071221_37toMitomy1_0.02u_exp151121_run071221_37,002xto0.02u. -
exp070222: switch fromTunica1_0,5x_exp070222_run010322_05toTunica1_0.50u_exp070222_run010322_05,0,5xto0.50u. -
exp070222t25: more comprehensive, so we keep the oldest one (no replicatecDNAmissing). FromcDNA1_000u_exp281022_run141222_01toContro1_000u_exp281022_run141222_01! What doescDNArefer to as?. -
exp281022t25: one control and 3 Drug Compounds were sampled in T25 flasks.P42andP43are used for evaluating barcode drift, a technical artefact in relation with depth sequencing. Most of the replicates lack of significant effect due to insufficient dose.
-
exp281022gammerefers to the time course experiment.
We report in Table 1.1 a summary of the barcode counts (File location, batch_id, and dimensions).
Code
barcode_files <- list.files("./data/barcode-counts/",
pattern = "exp.*\\.tsv",
full.names = TRUE)
barcode_summaries <- purrr::map(barcode_files, function(filename) {
barcode_counts <- readr::read_tsv(filename, show_col_types = FALSE)
experience_summary <- tibble::tibble(Filename = basename(filename),
`Num Barcode IDs` = nrow(barcode_counts),
`Num Replicates` = ncol(barcode_counts) - 1)
return(experience_summary)
}) |>
purrr::list_rbind()
barcode_summaries <- barcode_summaries |>
dplyr::inner_join(readxl::read_excel("data/Table of compounds_whole_2025-07-01.xlsx",
sheet = "Batch Mapping"),
by = "Filename") |>
dplyr::relocate(Batch_ID, .after = Filename) |>
dplyr::relocate(`Num Barcode IDs`, `Num Replicates`, .after = `Run date`)
flextable::flextable(barcode_summaries) |>
bold(part = "header") |>
flextable::merge_v(j = c("Filename", "Batch_ID", "General Comments Per Batch"),
part = "body")
openxlsx::write.xlsx(x = barcode_summaries, asTable = TRUE,
file = paste0("./data/barcode_counts_summaries_", today_date, ".xlsx"))Filename |
Batch_ID |
Date |
Run date |
Num Barcode IDs |
Num Replicates |
General Comments Per Batch |
|---|---|---|---|---|---|---|
exp010821.tsv |
exp010821 |
010821 |
180821 |
1,168,726 |
22 |
2 Control Time Zeros + 4 replicates of X13271, compared to old |
exp040821.tsv |
exp040821 |
040821 |
180821 |
1,168,726 |
22 |
2 Control Time Zeros + 4 replicates of X13271, compared to old (same remark as exp040821) |
exp070222.tsv |
exp070222 |
070222 |
010322 |
822,697 |
70 |
2 Additional Time Control Zeros |
070222 |
060522 |
822,697 |
70 |
|||
exp130921.tsv |
exp130921 |
130921 |
290921 |
800,949 |
42 |
In new dataset, 2 additional Control Time Zeros and in vivo mixed with standard experiences. |
130921 |
111021 |
800,949 |
42 |
|||
exp130921_in_vivo |
130921 |
101121 |
800,949 |
42 |
||
exp151121.tsv |
exp151121 |
151121 |
071221 |
628,033 |
62 |
2 Additional Time Control Zeros |
151121 |
180122 |
628,033 |
62 |
|||
exp181021.tsv |
exp181021 |
181021 |
101121 |
631,355 |
61 |
|
181021 |
251121 |
631,355 |
61 |
|||
exp200921.tsv |
exp200921 |
200921 |
111021 |
518,133 |
58 |
Merged with dose response. |
exp200921_dose_response |
200921 |
101121 |
518,133 |
58 |
Merged with 200921 standard +2 Additional Time Control Zeros. Does not contain controls, so merged with exp200921!! |
|
exp220322.tsv |
exp220322 |
220322 |
060522 |
1,038,088 |
70 |
2 Additional Time Control Zeros + 2 additional replicates of XAV939 |
220322 |
200522 |
1,038,088 |
70 |
|||
exp271221.tsv |
exp271221 |
271221 |
140122 |
439,396 |
82 |
2 Additional Time Control Zeros |
271221 |
180122 |
439,396 |
82 |
|||
exp281022gammeWithCTR.tsv |
exp281022_time_course |
281022 |
110723 |
509,473 |
56 |
Previously missing 3m_osim4_100n_exp180123_run110723_44 replicate is avalaible again in new dataset -> now 4 replicates // However, all 4 controls, such as 9d_Contro2_000u_exp281022_run110723_02 (from 01 to 04) are missing in new dataset, while present in older dataset // Finally, 4 new 'T-75' controls appear in new dataset, but `CTRL3-T75_000u_exp281022_run110723_24` and `CTRL1-T75_000u_exp281022_run110723_22` only contain null values -> we choose to discard them!! // in contrast with other batches, newest file provides signficant barcode values, that are not null!! (around 1%).
|
071122 |
110723 |
509,473 |
56 |
Previously missing 3m_osim4_100n_exp180123_run110723_44 replicate is avalaible again in new dataset -> now 4 replicates // However, all 4 controls, such as 9d_Contro2_000u_exp281022_run110723_02 (from 01 to 04) are missing in new dataset, while present in older dataset // Finally, 4 new 'T-75' controls appear in new dataset, but `CTRL3-T75_000u_exp281022_run110723_24` and `CTRL1-T75_000u_exp281022_run110723_22` only contain null values -> we choose to discard them!! // in contrast with other batches, newest file provides signficant barcode values, that are not null!! (around 1%).
|
||
181122 |
110723 |
509,473 |
56 |
|||
021222 |
110723 |
509,473 |
56 |
|||
161222 |
110723 |
509,473 |
56 |
|||
180123 |
110723 |
509,473 |
56 |
|||
exp281022t25.tsv |
exp281022_t25 |
281022 |
141222 |
315,642 |
18 |
What's T25? 8 controls, for which reason? One mentioned as cDNA (first 4). Older dataset was more comprehensive, so we keep the oldest one. |
exp300821.tsv |
exp300821 |
300821 |
290921 |
413,335 |
31 |
Additional Sorafenib experience, with 005u concentration, 3X replicates only!! |
From Table 1.1:
-
Number of distinct time points thaws: 16, returned by
{r} length(unique(barcode_summaries$Date)) - Number of runs: 13, based on the total number of Barcode sequencing Runs.
-
Number of batches: 14, based on Batch IDs, and returned by
{r} length(unique(barcode_summaries$Batch_ID)) -
Technical covariates: 25. Is computed as the Cartesian product of batches per run, since both Date of experiment and sequencing could impact the output, as returned by
{r} nrow(barcode_summaries).
1.3 Phenotype metadata cleaning
Most of the R instructions reported in this section aim at rendering the Table of compounds file compliant with the tidy format, see Tip 1.1. These data-wrangling operations are split in 4 steps:
- In Listing 1.1, we homogenise
Concentrations,DateandDurationto ISO standards, while dealing with erroneous missing values generated by fused Excel cells.
Code
pheno_colnames <- c(Old_Pathway = "Pathways", Old_MoA = "...2",
Compound = "Samples", Replicates = "Replicates",
Concentrations_Formatted = "Concentrations", Date = "Experiment date",
`Run date` = "Run date", Duration_ID = "Duration",
Kept = "Kept", Comments = "Comments")
## prune irrelevant colnames ----
samples_metadata <- readxl::read_xlsx(paste0("data/Table of compounds_whole_", today_date, ".xlsx"),
sheet = "Experimental Design",
col_types = c(rep("text", 3), "numeric",
rep("text", 4), "logical", "text")) |>
dplyr::rename(dplyr::all_of(pheno_colnames)) |>
tidyr::fill(Old_Pathway, Old_MoA, .direction = "down") |>
tidyr::fill(Compound, .direction = "down") |>
dplyr::mutate(Old_Pathway = if_else(is.na(Old_Pathway), Old_MoA, Old_Pathway))
# integrate new pathway information from Luca ----
updated_MoA_pathway <- readxl::read_xlsx("data/Table of compounds_whole_2025-07-01.xlsx",
sheet = "Pathway Mapping")
samples_metadata <- samples_metadata |>
dplyr::inner_join(updated_MoA_pathway,
by = join_by(Compound))
## format concentrations, creating both 'Concentrations_ID' and 'Concentrations' in Moles ----
samples_metadata <- samples_metadata |>
dplyr::mutate(Concentrations = if_else(Compound == "Osimertinb+sorafenib", "015 uM", Concentrations_Formatted)) |>
tidyr::separate_wider_delim(Concentrations, delim = " ", names = c("Concentrations Value", "Unit"), cols_remove = FALSE) |>
dplyr::mutate(Unit = case_when(
Unit %in% c("uM", "µM", "µg/ml") ~ "u",
Unit == "nM" ~ "n",
Unit %in% c("mg/kg", "mM") ~ "m",
)) |>
dplyr::mutate(Concentrations=parse_number(`Concentrations Value`, trim_ws = TRUE,
locale = locale(decimal_mark = ",", grouping_mark = "."))) |>
dplyr::mutate(Concentrations = case_when(
Unit == "u" ~ Concentrations *10^-6,
Unit == "n" ~ Concentrations *10^-9,
Unit == "m" ~ Concentrations *10^-3)) |>
dplyr::mutate(`Concentrations Value`= parse_number(`Concentrations Value`, trim_ws = TRUE,
locale = locale(decimal_mark = ",", grouping_mark = ".")) |>
format_concentrations()) |>
tidyr::unite(col ="Concentrations_ID", `Concentrations Value`, Unit, sep = "")
## format Durations, creating both 'Duration_ID' and 'Duration' as an integer value in Days ----
samples_metadata <- samples_metadata |>
## extract last 6 numbers, as the only ones used for identification of samples downstream
dplyr::mutate(Duration = parse_number(Duration_ID, trim_ws = TRUE,
locale = locale(decimal_mark = ".")),
Duration = dplyr::if_else(grepl("m$", Duration_ID),
Duration*30, Duration))
rm(updated_MoA_pathway)- In Listing 1.2, we extract from each
replicate labelpatterns enabling its mapping toTable of compounds, and usessheet:'Drugs Mapping'to map the prefix to its completeDrugBank ID.
Code
batch_mapping <- readxl::read_xlsx("data/Table of compounds_whole_2025-07-01.xlsx",
sheet = "Batch Mapping")
barcode_counts_paths <- batch_mapping |>
dplyr::pull(Filename) |> unique()
## step 2) extract individual replicate IDs directly from sample experiences ----
ID_mapping <- purrr::map(barcode_counts_paths, function(Filename) {
## build path
counts_path <- file.path("data","barcode-counts",
paste0(Filename))
## extract replicate names (reading the header)
replicates_ID <- strsplit(readLines(counts_path, n = 1), split = "\t")[[1]] |>
str_subset("\\S") |> ## remove empty strings
str_subset("barcode_id", negate = TRUE)
## ID: combination of letters, numbers and -, followed by '[1-4]_', starting the sample's name
dataset_ID <- tibble::tibble(Filename = Filename,
Date = stringr::str_extract(replicates_ID,
"(?<=_exp)[[:alnum:]]{6}"),
Replicates_ID = replicates_ID,
`Run date` = stringr::str_extract(replicates_ID,
"(?<=_run)[[:alnum:]]{6}(?=_)"),
Concentrations_ID = stringr::str_extract(replicates_ID,
"(?<=[1-8]_)[[:digit:]]{3}[gmnux]{1}(?=_exp)")) |>
## account for second notation syntax for concentrations, namely for decimal values
dplyr::mutate(Concentrations_ID = if_else(is.na(Concentrations_ID),
stringr::str_extract(replicates_ID,
"(?<=[1-8]_)[0-9]{1}\\.[0-9]{2}[gmnux]{1}(?=_exp)"),
Concentrations_ID))
## Deal with time-course specific nomenclature
if (grepl("281022(.*)gamme", Filename)) {
dataset_ID <- dataset_ID |>
dplyr::mutate(Samples_ID = stringr::str_extract(replicates_ID,
"(?<=_)[[[:alnum:]]\\-]+(?=[1-8]_)"),
Duration_ID=stringr::str_extract(replicates_ID,
"^[[[:alnum:]]\\.]{1,3}[dm](?=_)"))
# deal with T-75 Controls
dataset_ID <- dataset_ID |>
dplyr::mutate(Samples_ID = if_else(grepl("^CTRL[[:digit:]]{1}-T75", Replicates_ID),
"CTRL", Samples_ID),
Duration_ID = if_else(grepl("^CTRL[[:digit:]]{1}-T75", Replicates_ID),
"0d", Duration_ID))
}
else {
dataset_ID <- dataset_ID |>
dplyr::mutate(Samples_ID = stringr::str_extract(replicates_ID,
"^[[[:alnum:]]\\-]+(?=[1-8]{1}_)"),
Duration_ID = if_else(grepl("^Temps0", Replicates_ID),"0d", "9d"))
}
return(dataset_ID)
}) |>
purrr::list_rbind()
## deal with specific p42 and p43
ID_mapping <- ID_mapping |>
dplyr::mutate(Samples_ID = if_else(grepl("^p42", Replicates_ID), "p42", Samples_ID),
Samples_ID = if_else(grepl("^p43", Replicates_ID), "p43", Samples_ID))
## detect bad extraction of patterns samples
# ID_mapping_missing <- ID_mapping |>
# filter(if_any(everything(), is.na))
## step 3) map short compound IDs with full compound names ----
mapping_compounds <- readxl::read_xlsx("data/Table of compounds_whole_2025-07-01.xlsx",
sheet = "Drugs Mapping") |>
select(-Comments)
ID_mapping <- ID_mapping |>
inner_join(mapping_compounds, by = "Samples_ID")
# ID_mapping_test <- ID_mapping |>
# anti_join(batch_mapping,
# by = dplyr::join_by(Filename, Date, `Run date`))
ID_mapping <- ID_mapping |>
dplyr::inner_join(batch_mapping,
by = dplyr::join_by(Filename, Date, `Run date`)) |>
dplyr::select(-`General Comments Per Batch`)
rm(batch_mapping, mapping_compounds)- In Table 1.2 is the final, curated metadata phenotype table, adhering to tidy principles Tip 1.1. This includes:
- Standardisation of Dates to
ISO 1860 Date format.
Code
## step 1) Use international date formats ----
replicates_metadata <- samples_metadata |>
dplyr::inner_join(ID_mapping,
by = dplyr::join_by(Compound, Date, `Run date`,
Duration_ID, Concentrations_ID)) |>
## convert to ISO 1860 Date format
mutate(Date = lubridate::dmy(Date) |> format(),
`Run date` = lubridate::dmy(`Run date`)) |>
# relocate colnames
dplyr::select(Batch_ID, Filename,
Pathway, Old_Pathway, MoA, Old_MoA,
Compound, Samples_ID, Kept,
Date, `Run date`, Duration, Duration_ID,
Concentrations_Formatted, Concentrations_ID, Concentrations,
Replicates, Replicates_ID, Comments)
replicates_metadata_shortened <- replicates_metadata |>
dplyr::select(-Old_Pathway, -Old_MoA, -Samples_ID,
-Duration_ID, -Concentrations_ID, -Replicates)
# replicates_metadata_deduplicated <- replicates_metadata |>
# dplyr::distinct(Replicates_ID, .keep_all = TRUE)
# all.equal(replicates_metadata, replicates_metadata_deduplicated)
## Step 2) save the tidy phenotype metatada table ----
readr::write_csv(replicates_metadata,
file = paste0("data/replicates_barcode_metadata_comprehensive_",
today_date,".csv"))
readr::write_csv(replicates_metadata_shortened,
file = paste0("data/replicates_barcode_metadata_", today_date,".csv"))
flextable::flextable(head(replicates_metadata_shortened)) |>
autofit() |>
bold(part = "header")Batch_ID |
Filename |
Pathway |
MoA |
Compound |
Kept |
Date |
Run date |
Duration |
Concentrations_Formatted |
Concentrations |
Replicates_ID |
Comments |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
exp010821 |
exp010821.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-08-01 |
2021-08-18 |
0 |
000 uM |
0 |
Temps01_000u_exp010821_run180821_01 |
|
exp010821 |
exp010821.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-08-01 |
2021-08-18 |
0 |
000 uM |
0 |
Temps02_000u_exp010821_run180821_02 |
|
exp040821 |
exp040821.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-08-04 |
2021-08-18 |
0 |
000 uM |
0 |
Temps01_000u_exp040821_run180821_23 |
|
exp040821 |
exp040821.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-08-04 |
2021-08-18 |
0 |
000 uM |
0 |
Temps02_000u_exp040821_run180821_24 |
|
exp130921 |
exp130921.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-09-13 |
2021-09-29 |
0 |
000 uM |
0 |
Temps01_000u_exp130921_run290921_48 |
|
exp130921_in_vivo |
exp130921.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-09-13 |
2021-11-10 |
0 |
000 uM |
0 |
Temps02_000u_exp130921_run101121_19 |
In vivo, tumoral samples |
- Each variable has its own column.
- Each observation has its own row – Every row corresponds to one observation.
- Each value has its own cell – Each cell contains a single, unique value.
This format streamlines data wrangling, and generally speaking, data analysis and visualisation. In other words, prefer simpler CSV formats for the experimental design, and avoid cell fusion in Excel documents. Finally, for the formatted tabular representation in documentations, I use flextable.
In-vivo = cells injected in living mice,
controlversusosermintinimb.↩︎