1 Data management: metadata correction and cleansing
1.1 Reproducibility
I list below the R packages required to reproduce the analyses.
1.2 Barcode counts cleaning
Below, we convert each
.TABULAR
file to its recommended.tsv
file extension, and rename unequivocally the firstcolname
intobarcode_id
:
Code
barcode_files <- list.files("./data/barcode-counts/",
pattern = "exp.*\\.tabular",
full.names = TRUE)
## change format + barcode name
barcode_matrices <- lapply(barcode_files, function(old_filename) {
expression_matrix <- readxl::read_tsv(old_filename) |>
rename(barcode_id = 1)
new_filename <- old_filename |> basename() |> tools::file_path_sans_ext()
readr::write_tsv(expression_matrix,
file = file.path("data", "barcode-counts", paste0(new_filename, ".tsv")))
unlink(old_filename, force = TRUE)
})
I rename the first colname of each expression file as
barcode_id
, storing unequivocal tag.-
List of changes applied for easier mapping between replicates’ barcode profiles and samples’ metadata:
-
exp130921
, switch fromOsimer1_001g_exp130921_run101121_28
toOsimer1_001m_exp130921_run101121_28
, from001g
to001m
(in-vivo context), and switch fromAzacyt1_1,5u_exp130921_run290921_40
toAzacyt1_1.50u_exp130921_run290921_40
,1,5u
to1.50u
1. -
exp200921
: switch fromBafilo1_1,2n_exp200921_run111021_45
toBafilo1_1.20n_exp200921_run111021_45
,1,2n
to1.20n
. In addition, the 4 control replicates of dose-response batchexp200921_dose_response
are shared with standard experimentexp200921.csv
. -
exp151121
: switch fromMitomy1_002x_exp151121_run071221_37
toMitomy1_0.02u_exp151121_run071221_37
,002x
to0.02u
. -
exp070222
: switch fromTunica1_0,5x_exp070222_run010322_05
toTunica1_0.50u_exp070222_run010322_05
,0,5x
to0.50u
. -
exp070222t25
: more comprehensive, so we keep the oldest one (no replicatecDNA
missing). FromcDNA1_000u_exp281022_run141222_01
toContro1_000u_exp281022_run141222_01
! What doescDNA
refer to as?. -
exp281022t25
: one control and 3 Drug Compounds were sampled in T25 flasks.P42
andP43
are used for evaluating barcode drift, a technical artefact in relation with depth sequencing. Most of the replicates lack of significant effect due to insufficient dose.
-
exp281022gamme
refers to the time course experiment.
We report in Table 1.1 a summary of the barcode counts (File location, batch_id
, and dimensions).
Code
barcode_files <- list.files("./data/barcode-counts/",
pattern = "exp.*\\.tsv",
full.names = TRUE)
barcode_summaries <- purrr::map(barcode_files, function(filename) {
barcode_counts <- readr::read_tsv(filename, show_col_types = FALSE)
experience_summary <- tibble::tibble(Filename = basename(filename),
`Num Barcode IDs` = nrow(barcode_counts),
`Num Replicates` = ncol(barcode_counts) - 1)
return(experience_summary)
}) |>
purrr::list_rbind()
barcode_summaries <- barcode_summaries |>
dplyr::inner_join(readxl::read_excel("data/Table of compounds_whole_2025-07-01.xlsx",
sheet = "Batch Mapping"),
by = "Filename") |>
dplyr::relocate(Batch_ID, .after = Filename) |>
dplyr::relocate(`Num Barcode IDs`, `Num Replicates`, .after = `Run date`)
flextable::flextable(barcode_summaries) |>
bold(part = "header") |>
flextable::merge_v(j = c("Filename", "Batch_ID", "General Comments Per Batch"),
part = "body")
openxlsx::write.xlsx(x = barcode_summaries, asTable = TRUE,
file = paste0("./data/barcode_counts_summaries_", today_date, ".xlsx"))
Filename |
Batch_ID |
Date |
Run date |
Num Barcode IDs |
Num Replicates |
General Comments Per Batch |
---|---|---|---|---|---|---|
exp010821.tsv |
exp010821 |
010821 |
180821 |
1,168,726 |
22 |
2 Control Time Zeros + 4 replicates of X13271, compared to old |
exp040821.tsv |
exp040821 |
040821 |
180821 |
1,168,726 |
22 |
2 Control Time Zeros + 4 replicates of X13271, compared to old (same remark as exp040821) |
exp070222.tsv |
exp070222 |
070222 |
010322 |
822,697 |
70 |
2 Additional Time Control Zeros |
070222 |
060522 |
822,697 |
70 |
|||
exp130921.tsv |
exp130921 |
130921 |
290921 |
800,949 |
42 |
In new dataset, 2 additional Control Time Zeros and in vivo mixed with standard experiences. |
130921 |
111021 |
800,949 |
42 |
|||
exp130921_in_vivo |
130921 |
101121 |
800,949 |
42 |
||
exp151121.tsv |
exp151121 |
151121 |
071221 |
628,033 |
62 |
2 Additional Time Control Zeros |
151121 |
180122 |
628,033 |
62 |
|||
exp181021.tsv |
exp181021 |
181021 |
101121 |
631,355 |
61 |
|
181021 |
251121 |
631,355 |
61 |
|||
exp200921.tsv |
exp200921 |
200921 |
111021 |
518,133 |
58 |
Merged with dose response. |
exp200921_dose_response |
200921 |
101121 |
518,133 |
58 |
Merged with 200921 standard +2 Additional Time Control Zeros. Does not contain controls, so merged with exp200921!! |
|
exp220322.tsv |
exp220322 |
220322 |
060522 |
1,038,088 |
70 |
2 Additional Time Control Zeros + 2 additional replicates of XAV939 |
220322 |
200522 |
1,038,088 |
70 |
|||
exp271221.tsv |
exp271221 |
271221 |
140122 |
439,396 |
82 |
2 Additional Time Control Zeros |
271221 |
180122 |
439,396 |
82 |
|||
exp281022gammeWithCTR.tsv |
exp281022_time_course |
281022 |
110723 |
509,473 |
56 |
Previously missing 3m_osim4_100n_exp180123_run110723_44 replicate is avalaible again in new dataset -> now 4 replicates // However, all 4 controls, such as 9d_Contro2_000u_exp281022_run110723_02 (from 01 to 04) are missing in new dataset, while present in older dataset // Finally, 4 new 'T-75' controls appear in new dataset, but `CTRL3-T75_000u_exp281022_run110723_24` and `CTRL1-T75_000u_exp281022_run110723_22` only contain null values -> we choose to discard them!! // in contrast with other batches, newest file provides signficant barcode values, that are not null!! (around 1%).
|
071122 |
110723 |
509,473 |
56 |
Previously missing 3m_osim4_100n_exp180123_run110723_44 replicate is avalaible again in new dataset -> now 4 replicates // However, all 4 controls, such as 9d_Contro2_000u_exp281022_run110723_02 (from 01 to 04) are missing in new dataset, while present in older dataset // Finally, 4 new 'T-75' controls appear in new dataset, but `CTRL3-T75_000u_exp281022_run110723_24` and `CTRL1-T75_000u_exp281022_run110723_22` only contain null values -> we choose to discard them!! // in contrast with other batches, newest file provides signficant barcode values, that are not null!! (around 1%).
|
||
181122 |
110723 |
509,473 |
56 |
|||
021222 |
110723 |
509,473 |
56 |
|||
161222 |
110723 |
509,473 |
56 |
|||
180123 |
110723 |
509,473 |
56 |
|||
exp281022t25.tsv |
exp281022_t25 |
281022 |
141222 |
315,642 |
22 |
What's T25? 8 controls, for which reason? One mentioned as cDNA (first 4). Older dataset was more comprehensive, so we keep the oldest one. |
exp300821.tsv |
exp300821 |
300821 |
290921 |
413,335 |
31 |
Additional Sorafenib experience, with 005u concentration, 3X replicates only!! |
From Table 1.1:
-
Number of distinct time points thaws: 16, returned by
{r} length(unique(barcode_summaries$Date))
- Number of runs: 13, based on the total number of Barcode sequencing Runs.
-
Number of batches: 14, based on Batch IDs, and returned by
{r} length(unique(barcode_summaries$Batch_ID))
-
Technical covariates: 25. Is computed as the Cartesian product of batches per run, since both Date of experiment and sequencing could impact the output, as returned by
{r} nrow(barcode_summaries)
.
1.3 DrugSimDB
to map each compound a Pathway and MoA
- Map
DrugBank-ID
to their raw generic drug name, withdrug_id_mapping.tsv
.
Code
# check after Samples_ID, to highlight duplicates
compound_metadata <- readr::read_csv(paste0("./data/replicates_barcode_metadata_comprehensive_",
today_date, ".csv"),
show_col_types = FALSE) |>
dplyr::filter(!Pathway %in% c("Control", "Drug combinations", "Novel compound", "Control - time zero")) |>
dplyr::distinct(Pathway, Compound)
utils::download.file(url = "https://raw.githubusercontent.com/iit-Demokritos/drug_id_mapping/refs/heads/main/drug-mappings.tsv",
destfile = "./data/drugbank/drug-mappings.tsv")
drug_mappings <- readr::read_tsv("./data/drugbank/drug-mappings.tsv")
compound_metadata_mapped <- compound_metadata |>
fuzzyjoin::stringdist_left_join(drug_mappings |>
dplyr::select(drugbankId, name),
by = c(Compound = "name"),
# https://www.rdocumentation.org/packages/stringdist/versions/0.9.15/topics/stringdist-metrics, Damerau-Levenshtein distance
method = "osa",
max_dist = 2,
distance_col = "distance") |>
dplyr::rename(Generic_Name = name, Compound_Luca = Compound) |>
dplyr::relocate(Generic_Name, .after = Compound_Luca)
openxlsx::write.xlsx(compound_metadata_mapped,
file = "./data/drugbank/mapped_drugs_barcode.xlsx",
asTable = TRUE)
Example:
DrugBankID: DB12218
to itsgeneric name: Capivasertib
.-
Retrieve raw drugbank XML, to map each generic drug name to its corresponding
DrugBank-ID
.- Create a free account on
Synapse
, a website dedicated to open-source. - Use
drugbankR
package to 1) convert thexml
file format ofDrugBank
into a tidy dataframe, and 2) useSQLite
syntax to explore it. - (Alternative): Perform API requests directly on
DrugBank
, but it requires to log in as an academic, and might be slower than querying a local database.
- Create a free account on
1.3.1 DrugSimDB
and similarity scores:
Update drug-drug similarity measures based on the latest version of
Drugbank
, a tutorial mixingshell
commands and R scripts.Main Similarity scores resulting from running DD steps of
DrugBank
. Similarity scores cover 6 items:Chemical structure
,Targets
,Gene Ontology
,Cellular component
,Molecular Function
andBiological process of induced pathways
.
Code
library(httr)
library(jsonlite)
# Get InChIKey from PubChem for Levofloxacin (or use webchem)
inchikey <- "GSDSWSVVBLHKDQ-UHFFFAOYSA-N"
# Use UniChem to get DrugBank ID
url <- paste0("https://www.ebi.ac.uk/unichem/rest/inchikey/", inchikey)
res <- fromJSON(url)
url_base <- "https://go.drugbank.com/drugs"
drug_identifier <- "DB00358"
drug_detailled_information <- httr2::request(url_base) |>
httr2::req_url_path_append(drug_identifier) |>
httr2::req_perform() |>
httr2::resp_body_json()
url_base <- paste0("https://go.drugbank.com/drugs", drug_identifier)
res <- jsonlite::fromJSON(url_base)
renv::install("girke-lab/drugbankR")
drugbank_dataframe <- drugbankR::dbxml2df( xmlfile = "data/drugbank/drugbank.xml",
version = "5.1.3")
1.4 Phenotype metadata cleaning
Most of the R instructions reported in this section aim at rendering the Table of compounds
file compliant with the tidy
format, see Tip 1.1. These data-wrangling operations are split in 4 steps:
- In Listing 1.1, we homogenise
Concentrations
,Date
andDuration
to ISO standards, while dealing with erroneous missing values generated by fused Excel cells.
Code
pheno_colnames <- c(Old_Pathway = "Pathways", Old_MoA = "...2",
Compound = "Samples", Replicates = "Replicates",
Concentrations_Formatted = "Concentrations", Date = "Experiment date",
`Run date` = "Run date", Duration_ID = "Duration",
Kept = "Kept", Comments = "Comments")
## prune irrelevant colnames ----
samples_metadata <- readxl::read_xlsx("data/Table of compounds_whole_2025-07-01.xlsx",
sheet = "Experimental Design",
col_types = c(rep("text", 3), "numeric",
rep("text", 4), "logical", "text")) |>
dplyr::rename(dplyr::all_of(pheno_colnames)) |>
tidyr::fill(Old_Pathway, Old_MoA, .direction = "down") |>
tidyr::fill(Compound, .direction = "down") |>
dplyr::mutate(Old_Pathway = if_else(is.na(Old_Pathway), Old_MoA, Old_Pathway))
# integrate new pathway information from Luca ----
updated_MoA_pathway <- readxl::read_xlsx("data/Table of compounds_whole_2025-07-01.xlsx",
sheet = "Pathway Mapping")
samples_metadata <- samples_metadata |>
dplyr::inner_join(updated_MoA_pathway,
by = join_by(Compound))
## format concentrations, creating both 'Concentrations_ID' and 'Concentrations' in Moles ----
samples_metadata <- samples_metadata |>
dplyr::mutate(Concentrations = if_else(Compound == "Osimertinb+sorafenib", "015 uM", Concentrations_Formatted)) |>
tidyr::separate_wider_delim(Concentrations, delim = " ", names = c("Concentrations Value", "Unit"), cols_remove = FALSE) |>
dplyr::mutate(Unit = case_when(
Unit %in% c("uM", "µM", "µg/ml") ~ "u",
Unit == "nM" ~ "n",
Unit %in% c("mg/kg", "mM") ~ "m",
)) |>
dplyr::mutate(Concentrations=parse_number(`Concentrations Value`, trim_ws = TRUE,
locale = locale(decimal_mark = ",", grouping_mark = "."))) |>
dplyr::mutate(Concentrations = case_when(
Unit == "u" ~ Concentrations *10^-6,
Unit == "n" ~ Concentrations *10^-9,
Unit == "m" ~ Concentrations *10^-3)) |>
dplyr::mutate(`Concentrations Value`= parse_number(`Concentrations Value`, trim_ws = TRUE,
locale = locale(decimal_mark = ",", grouping_mark = ".")) |>
format_concentrations()) |>
tidyr::unite(col ="Concentrations_ID", `Concentrations Value`, Unit, sep = "")
## format Durations, creating both 'Duration_ID' and 'Duration' as an integer value in Days ----
samples_metadata <- samples_metadata |>
## extract last 6 numbers, as the only ones used for identification of samples downstream
dplyr::mutate(Duration = parse_number(Duration_ID, trim_ws = TRUE,
locale = locale(decimal_mark = ".")),
Duration = dplyr::if_else(grepl("m$", Duration_ID),
Duration*30, Duration))
rm(updated_MoA_pathway)
- In Listing 1.2, we extract from each
replicate label
patterns enabling its mapping toTable of compounds
, and usessheet:'Drugs Mapping'
to map the prefix to its completeDrugBank ID
.
Code
batch_mapping <- readxl::read_xlsx("data/Table of compounds_whole_2025-07-01.xlsx",
sheet = "Batch Mapping")
barcode_counts_paths <- batch_mapping |>
dplyr::pull(Filename) |> unique()
## step 2) extract individual replicate IDs directly from sample experiences ----
ID_mapping <- purrr::map(barcode_counts_paths, function(Filename) {
## build path
counts_path <- file.path("data","barcode-counts",
paste0(Filename))
## extract replicate names (reading the header)
replicates_ID <- strsplit(readLines(counts_path, n = 1), split = "\t")[[1]] |>
str_subset("\\S") |> ## remove empty strings
str_subset("barcode_id", negate = TRUE)
## ID: combination of letters, numbers and -, followed by '[1-4]_', starting the sample's name
dataset_ID <- tibble::tibble(Filename = Filename,
Date = stringr::str_extract(replicates_ID,
"(?<=_exp)[[:alnum:]]{6}"),
Replicates_ID = replicates_ID,
`Run date` = stringr::str_extract(replicates_ID,
"(?<=_run)[[:alnum:]]{6}(?=_)"),
Concentrations_ID = stringr::str_extract(replicates_ID,
"(?<=[1-8]_)[[:digit:]]{3}[gmnux]{1}(?=_exp)")) |>
## account for second notation syntax for concentrations, namely for decimal values
dplyr::mutate(Concentrations_ID = if_else(is.na(Concentrations_ID),
stringr::str_extract(replicates_ID,
"(?<=[1-8]_)[0-9]{1}\\.[0-9]{2}[gmnux]{1}(?=_exp)"),
Concentrations_ID))
## Deal with time-course specific nomenclature
if (grepl("281022(.*)gamme", Filename)) {
dataset_ID <- dataset_ID |>
dplyr::mutate(Samples_ID = stringr::str_extract(replicates_ID,
"(?<=_)[[[:alnum:]]\\-]+(?=[1-8]_)"),
Duration_ID=stringr::str_extract(replicates_ID,
"^[[[:alnum:]]\\.]{1,3}[dm](?=_)"))
# deal with T-75 Controls
dataset_ID <- dataset_ID |>
dplyr::mutate(Samples_ID = if_else(grepl("^CTRL[[:digit:]]{1}-T75", Replicates_ID), "CTRL", Samples_ID),
Duration_ID = if_else(grepl("^CTRL[[:digit:]]{1}-T75", Replicates_ID), "9d", Duration_ID))
}
else {
dataset_ID <- dataset_ID |>
dplyr::mutate(Samples_ID = stringr::str_extract(replicates_ID,
"^[[[:alnum:]]\\-]+(?=[1-8]{1}_)"),
Duration_ID = if_else(grepl("^Temps0", Replicates_ID),"0d", "9d"))
}
return(dataset_ID)
}) |>
purrr::list_rbind()
## deal with specific p42 and p43
ID_mapping <- ID_mapping |>
dplyr::mutate(Samples_ID = if_else(grepl("^p42", Replicates_ID), "p42", Samples_ID),
Samples_ID = if_else(grepl("^p43", Replicates_ID), "p43", Samples_ID))
## detect bad extraction of patterns samples
# ID_mapping_missing <- ID_mapping |>
# filter(if_any(everything(), is.na))
## step 3) map short compound IDs with full compound names ----
mapping_compounds <- readxl::read_xlsx("data/Table of compounds_whole_2025-07-01.xlsx",
sheet = "Drugs Mapping")
ID_mapping <- ID_mapping |>
inner_join(mapping_compounds, by = "Samples_ID")
ID_mapping <- ID_mapping |>
dplyr::inner_join(batch_mapping,
by = dplyr::join_by(Filename, Date, `Run date`)) |>
dplyr::select(-`General Comments Per Batch`)
rm(batch_mapping, mapping_compounds)
- In Table 1.2 is the final, curated metadata phenotype table, adhering to tidy principles Tip 1.1. This includes:
- Standardisation of Dates to
ISO 1860 Date format
.
Code
## step 1) Use international date formats ----
replicates_metadata <- samples_metadata |>
dplyr::inner_join(ID_mapping,
by = dplyr::join_by(Compound, Date, `Run date`,
Duration_ID, Concentrations_ID)) |>
## convert to ISO 1860 Date format
mutate(Date = lubridate::dmy(Date) |> format(),
`Run date` = lubridate::dmy(`Run date`)) |>
# relocate colnames
dplyr::select(Batch_ID, Filename,
Pathway, Old_Pathway, MoA, Old_MoA,
Compound, Samples_ID, Kept,
Date, `Run date`, Duration, Duration_ID,
Concentrations_Formatted, Concentrations_ID, Concentrations,
Replicates, Replicates_ID, Comments)
replicates_metadata_shortened <- replicates_metadata |>
dplyr::select(-Old_Pathway, -Old_MoA, -Samples_ID,
-Duration_ID, -Concentrations_ID, -Replicates)
# replicates_metadata_deduplicated <- replicates_metadata |>
# dplyr::distinct(Replicates_ID, .keep_all = TRUE)
# all.equal(replicates_metadata, replicates_metadata_deduplicated)
## Step 2) save the tidy phenotype metatada table ----
readr::write_csv(replicates_metadata,
file = paste0("data/replicates_barcode_metadata_comprehensive_",
today_date,".csv"))
readr::write_csv(replicates_metadata_shortened,
file = paste0("data/replicates_barcode_metadata_", today_date,".csv"))
flextable::flextable(head(replicates_metadata_shortened)) |>
autofit() |>
bold(part = "header")
Batch_ID |
Filename |
Pathway |
MoA |
Compound |
Kept |
Date |
Run date |
Duration |
Concentrations_Formatted |
Concentrations |
Replicates_ID |
Comments |
---|---|---|---|---|---|---|---|---|---|---|---|---|
exp010821 |
exp010821.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-08-01 |
2021-08-18 |
0 |
000 uM |
0 |
Temps01_000u_exp010821_run180821_01 |
|
exp010821 |
exp010821.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-08-01 |
2021-08-18 |
0 |
000 uM |
0 |
Temps02_000u_exp010821_run180821_02 |
|
exp040821 |
exp040821.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-08-04 |
2021-08-18 |
0 |
000 uM |
0 |
Temps01_000u_exp040821_run180821_23 |
|
exp040821 |
exp040821.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-08-04 |
2021-08-18 |
0 |
000 uM |
0 |
Temps02_000u_exp040821_run180821_24 |
|
exp130921 |
exp130921.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-09-13 |
2021-09-29 |
0 |
000 uM |
0 |
Temps01_000u_exp130921_run290921_48 |
|
exp130921_in_vivo |
exp130921.tsv |
Control - time zero |
Control - time zero |
Control - time zero |
TRUE |
2021-09-13 |
2021-11-10 |
0 |
000 uM |
0 |
Temps02_000u_exp130921_run101121_19 |
In vivo, tumoral samples |
- Each variable has its own column.
- Each observation has its own row – Every row corresponds to one observation.
- Each value has its own cell – Each cell contains a single, unique value.
This format streamlines data wrangling, and generally speaking, data analysis and visualisation. In other words, prefer simpler CSV
formats for the experimental design, and avoid cell fusion in Excel documents. Finally, for the formatted tabular representation in documentations, I use flextable
.
1.4.1 Optional data quality checks
(Optional) Listing 1.3 (a) and Listing 1.4 (a) are data integrity checks to verify each replicate feature retrieved from its labelling is consistent with Table of compounds
, sheet:'Experimental Design'
.
-
Listing 1.3 (a) identifies all replicates that could not be mapped back to their metadata description stored under
Table of compounds
, `sheet:‘Drugs Mapping’.
Code
replicates_discarded <- dplyr::anti_join(ID_mapping, samples_metadata,
by = join_by(Date, `Run date`, Concentrations_ID,
Duration_ID, Compound))
pheno_data_unmapped <- dplyr::anti_join(samples_metadata,
ID_mapping,
by = join_by(Date, `Run date`, Concentrations_ID,
Duration_ID, Compound))
pheno_data_unmapped |>
arrange(Compound, Date, Concentrations_ID) |>
flextable() |>
bold(part = "header")
anti_join
between metadata table and barcode counts to identify replicates that could not be mapped back, and reciprocally.
-
Table 1.4 is used for checking discrepancies between the number of barcode replicates reported in
Table of compounds
and the number of replicates actually observed in barcode counts.
Code
replicate_inconsistencies <- samples_metadata |>
dplyr::inner_join(ID_mapping,
by = dplyr::join_by(Compound,
Date, `Run date`,
Duration_ID, Concentrations_ID)) |>
group_by(Batch_ID, Compound, Date, Replicates, Concentrations_ID,Duration_ID) |>
summarise(n = n()) |>
filter(n != Replicates)
flextable(replicate_inconsistencies) |>
autofit() |>
bold(part = "header")
In-vivo = cells injected in living mice,
control
versusosermintinimb
.↩︎