1 Data management: metadata correction and cleansing

1.1 Reproducibility

I list below the R packages required to reproduce the analyses.

Code

## data wrangling
library(dplyr)
library(tidyr)
library(stringr)
library(readr)
library(readxl)
library(purrr)

## reporting
library(flextable)

## auxiliary functions
source("R/utils.R")
today_date <- "2025-07-01"
# today_date <- format(Sys.Date(), "%Y-%m-%d")

1.2 Barcode counts cleaning

Original Google Drive repository.
Below, we convert each .TABULAR file to its recommended .tsv file extension, and rename unequivocally the first colname into barcode_id:

Code

barcode_files <- list.files("./data/barcode-counts/", 
                               pattern = "exp.*\\.tabular", 
                               full.names = TRUE)

## change format + barcode name
barcode_matrices <- lapply(barcode_files, function(old_filename) {
  expression_matrix <- readxl::read_tsv(old_filename) |> 
    rename(barcode_id = 1)
  
  new_filename <- old_filename |> basename() |> tools::file_path_sans_ext()
  
  readr::write_tsv(expression_matrix, 
                   file = file.path("data", "barcode-counts", paste0(new_filename, ".tsv")))
  unlink(old_filename, force = TRUE)
  
})

I rename the first colname of each expression file as barcode_id, storing unequivocal tag.
List of changes applied for easier mapping between replicates’ barcode profiles and samples’ metadata:
- exp130921, switch from Osimer1_001g_exp130921_run101121_28 to Osimer1_001m_exp130921_run101121_28, from 001g to 001m (in-vivo context), and switch from Azacyt1_1,5u_exp130921_run290921_40 to Azacyt1_1.50u_exp130921_run290921_40, 1,5u to 1.50u¹.
- exp200921: switch from Bafilo1_1,2n_exp200921_run111021_45 to Bafilo1_1.20n_exp200921_run111021_45, 1,2n to 1.20n. In addition, the 4 control replicates of dose-response batch exp200921_dose_response are shared with standard experiment exp200921.csv.
- exp151121: switch from Mitomy1_002x_exp151121_run071221_37 to Mitomy1_0.02u_exp151121_run071221_37, 002x to 0.02u.
- exp070222: switch from Tunica1_0,5x_exp070222_run010322_05 to Tunica1_0.50u_exp070222_run010322_05, 0,5x to 0.50u.
- exp070222t25: more comprehensive, so we keep the oldest one (no replicate cDNA missing). From cDNA1_000u_exp281022_run141222_01 to Contro1_000u_exp281022_run141222_01! What does cDNA refer to as?.
- exp281022t25: one control and 3 Drug Compounds were sampled in T25 flasks. P42 and P43 are used for evaluating barcode drift, a technical artefact in relation with depth sequencing. Most of the replicates lack of significant effect due to insufficient dose.
exp281022gamme refers to the time course experiment.

We report in Table 1.1 a summary of the barcode counts (File location, batch_id, and dimensions).

Code

barcode_files <- list.files("./data/barcode-counts/",
                            pattern = "exp.*\\.tsv",
                            full.names = TRUE)
barcode_summaries <- purrr::map(barcode_files, function(filename) {
  barcode_counts <- readr::read_tsv(filename, show_col_types = FALSE)
  experience_summary <- tibble::tibble(Filename = basename(filename),
                                       `Num Barcode IDs` = nrow(barcode_counts),
                                       `Num Replicates` = ncol(barcode_counts) - 1)
  return(experience_summary)
}) |>
  purrr::list_rbind()

barcode_summaries <- barcode_summaries |>
  dplyr::inner_join(readxl::read_excel("data/Table of compounds_whole_2025-07-01.xlsx",
    sheet = "Batch Mapping"),
    by = "Filename") |>
  dplyr::relocate(Batch_ID, .after = Filename) |>
  dplyr::relocate(`Num Barcode IDs`, `Num Replicates`, .after = `Run date`)

flextable::flextable(barcode_summaries) |>
  bold(part = "header") |> 
  flextable::merge_v(j = c("Filename", "Batch_ID", "General Comments Per Batch"),
                      part = "body")

openxlsx::write.xlsx(x = barcode_summaries, asTable = TRUE,
                     file = paste0("./data/barcode_counts_summaries_", today_date, ".xlsx"))

Filename	Batch_ID	Date	Run date	Num Barcode IDs	Num Replicates	General Comments Per Batch
exp010821.tsv	exp010821	010821	180821	1,168,726	22	2 Control Time Zeros + 4 replicates of X13271, compared to old
exp040821.tsv	exp040821	040821	180821	1,168,726	22	2 Control Time Zeros + 4 replicates of X13271, compared to old (same remark as exp040821)
exp070222.tsv	exp070222	070222	010322	822,697	70	2 Additional Time Control Zeros
exp070222.tsv	exp070222	070222	060522	822,697	70	2 Additional Time Control Zeros
exp130921.tsv	exp130921	130921	290921	800,949	42	In new dataset, 2 additional Control Time Zeros and in vivo mixed with standard experiences.
	exp130921	130921	111021	800,949	42
	exp130921_in_vivo	130921	101121	800,949	42
exp151121.tsv	exp151121	151121	071221	628,033	62	2 Additional Time Control Zeros
exp151121.tsv	exp151121	151121	180122	628,033	62
exp181021.tsv	exp181021	181021	101121	631,355	61
exp181021.tsv	exp181021	181021	251121	631,355	61
exp200921.tsv	exp200921	200921	111021	518,133	58	Merged with dose response.
exp200921.tsv	exp200921_dose_response	200921	101121	518,133	58	Merged with 200921 standard +2 Additional Time Control Zeros. Does not contain controls, so merged with exp200921!!
exp220322.tsv	exp220322	220322	060522	1,038,088	70	2 Additional Time Control Zeros + 2 additional replicates of XAV939
exp220322.tsv	exp220322	220322	200522	1,038,088	70
exp271221.tsv	exp271221	271221	140122	439,396	82	2 Additional Time Control Zeros
exp271221.tsv	exp271221	271221	180122	439,396	82	2 Additional Time Control Zeros
exp281022gammeWithCTR.tsv	exp281022_time_course	281022	110723	509,473	56	Previously missing 3m_osim4_100n_exp180123_run110723_44 replicate is avalaible again in new dataset -> now 4 replicates // However, all 4 controls, such as 9d_Contro2_000u_exp281022_run110723_02 (from 01 to 04) are missing in new dataset, while present in older dataset // Finally, 4 new 'T-75' controls appear in new dataset, but `CTRL3-T75_000u_exp281022_run110723_24` and `CTRL1-T75_000u_exp281022_run110723_22` only contain null values -> we choose to discard them!! // in contrast with other batches, newest file provides signficant barcode values, that are not null!! (around 1%). Accordingly, plug-in old controls at 9 days in the newest format.
		071122	110723	509,473	56	Previously missing 3m_osim4_100n_exp180123_run110723_44 replicate is avalaible again in new dataset -> now 4 replicates // However, all 4 controls, such as 9d_Contro2_000u_exp281022_run110723_02 (from 01 to 04) are missing in new dataset, while present in older dataset // Finally, 4 new 'T-75' controls appear in new dataset, but `CTRL3-T75_000u_exp281022_run110723_24` and `CTRL1-T75_000u_exp281022_run110723_22` only contain null values -> we choose to discard them!! // in contrast with other batches, newest file provides signficant barcode values, that are not null!! (around 1%). Accordingly, plug-in old controls at 9 days in the newest format.
		181122	110723	509,473	56
		021222	110723	509,473	56
		161222	110723	509,473	56
		180123	110723	509,473	56
exp281022t25.tsv	exp281022_t25	281022	141222	315,642	22	What's T25? 8 controls, for which reason? One mentioned as cDNA (first 4). Older dataset was more comprehensive, so we keep the oldest one.
exp300821.tsv	exp300821	300821	290921	413,335	31	Additional Sorafenib experience, with 005u concentration, 3X replicates only!!

Table 1.1: Main Batch features.

From Table 1.1:

Number of distinct time points thaws: 16, returned by {r} length(unique(barcode_summaries$Date))
Number of runs: 13, based on the total number of Barcode sequencing Runs.
Number of batches: 14, based on Batch IDs, and returned by {r} length(unique(barcode_summaries$Batch_ID))
Technical covariates: 25. Is computed as the Cartesian product of batches per run, since both Date of experiment and sequencing could impact the output, as returned by {r} nrow(barcode_summaries).

1.3 `DrugSimDB` to map each compound a Pathway and MoA

Map DrugBank-ID to their raw generic drug name, with drug_id_mapping.tsv.

Code

# check after Samples_ID, to highlight duplicates

compound_metadata <- readr::read_csv(paste0("./data/replicates_barcode_metadata_comprehensive_",
                                     today_date, ".csv"), 
                              show_col_types = FALSE) |> 
  dplyr::filter(!Pathway %in% c("Control", "Drug combinations", "Novel compound", "Control - time zero")) |> 
  dplyr::distinct(Pathway, Compound)

utils::download.file(url = "https://raw.githubusercontent.com/iit-Demokritos/drug_id_mapping/refs/heads/main/drug-mappings.tsv", 
                     destfile = "./data/drugbank/drug-mappings.tsv")
drug_mappings <- readr::read_tsv("./data/drugbank/drug-mappings.tsv")

compound_metadata_mapped <- compound_metadata |> 
  fuzzyjoin::stringdist_left_join(drug_mappings |> 
                                    dplyr::select(drugbankId, name),
                                  by = c(Compound = "name"),
 # https://www.rdocumentation.org/packages/stringdist/versions/0.9.15/topics/stringdist-metrics, Damerau-Levenshtein distance                             
                                  method = "osa", 
                                  max_dist = 2,
                                  distance_col = "distance") |> 
  dplyr::rename(Generic_Name = name, Compound_Luca = Compound) |> 
  dplyr::relocate(Generic_Name, .after = Compound_Luca)

openxlsx::write.xlsx(compound_metadata_mapped,
                     file = "./data/drugbank/mapped_drugs_barcode.xlsx", 
                     asTable = TRUE)

Example: DrugBankID: DB12218 to its generic name: Capivasertib.
Retrieve raw drugbank XML, to map each generic drug name to its corresponding DrugBank-ID.
1. Create a free account on Synapse, a website dedicated to open-source.
2. Use drugbankR package to 1) convert the xml file format of DrugBank into a tidy dataframe, and 2) use SQLite syntax to explore it.
3. (Alternative): Perform API requests directly on DrugBank, but it requires to log in as an academic, and might be slower than querying a local database.

1.3.1 `DrugSimDB` and similarity scores:

Update drug-drug similarity measures based on the latest version of Drugbank, a tutorial mixing shell commands and R scripts.
Main Similarity scores resulting from running DD steps of DrugBank. Similarity scores cover 6 items: Chemical structure,Targets, Gene Ontology, Cellular component, Molecular Function and Biological process of induced pathways.

Code

library(httr)
library(jsonlite)

# Get InChIKey from PubChem for Levofloxacin (or use webchem)
inchikey <- "GSDSWSVVBLHKDQ-UHFFFAOYSA-N"

# Use UniChem to get DrugBank ID
url <- paste0("https://www.ebi.ac.uk/unichem/rest/inchikey/", inchikey)
res <- fromJSON(url)


url_base <- "https://go.drugbank.com/drugs"
drug_identifier <- "DB00358"
drug_detailled_information <- httr2::request(url_base) |>
  httr2::req_url_path_append(drug_identifier) |>
  httr2::req_perform() |>
  httr2::resp_body_json()

url_base <- paste0("https://go.drugbank.com/drugs", drug_identifier)
res <- jsonlite::fromJSON(url_base)




renv::install("girke-lab/drugbankR")
drugbank_dataframe <- drugbankR::dbxml2df( xmlfile = "data/drugbank/drugbank.xml",
                                           version = "5.1.3")

1.4 Phenotype metadata cleaning

Most of the R instructions reported in this section aim at rendering the Table of compounds file compliant with the tidy format, see Tip 1.1. These data-wrangling operations are split in 4 steps:

In Listing 1.1, we homogenise Concentrations, Date and Duration to ISO standards, while dealing with erroneous missing values generated by fused Excel cells.

Listing 1.1: Coerce original Excel file reporting experimental design to tidy format.

Code

pheno_colnames <- c(Old_Pathway = "Pathways", Old_MoA = "...2", 
                    Compound = "Samples", Replicates = "Replicates",
                    Concentrations_Formatted = "Concentrations", Date = "Experiment date",
                    `Run date` = "Run date", Duration_ID = "Duration", 
                    Kept = "Kept", Comments = "Comments")

## prune irrelevant colnames ----

samples_metadata <- readxl::read_xlsx("data/Table of compounds_whole_2025-07-01.xlsx",
                                sheet = "Experimental Design",
                                col_types = c(rep("text", 3), "numeric", 
                                              rep("text", 4), "logical", "text")) |>
  dplyr::rename(dplyr::all_of(pheno_colnames)) |>
  tidyr::fill(Old_Pathway, Old_MoA, .direction = "down") |> 
  tidyr::fill(Compound, .direction = "down") |>
  dplyr::mutate(Old_Pathway = if_else(is.na(Old_Pathway), Old_MoA, Old_Pathway)) 

# integrate new pathway information from Luca ----
updated_MoA_pathway <- readxl::read_xlsx("data/Table of compounds_whole_2025-07-01.xlsx", 
                                                       sheet = "Pathway Mapping")
samples_metadata <- samples_metadata |> 
  dplyr::inner_join(updated_MoA_pathway,
                   by = join_by(Compound)) 

## format concentrations, creating both 'Concentrations_ID' and 'Concentrations' in Moles ----
samples_metadata <- samples_metadata |> 
  dplyr::mutate(Concentrations = if_else(Compound == "Osimertinb+sorafenib", "015 uM", Concentrations_Formatted)) |>
  tidyr::separate_wider_delim(Concentrations, delim = " ", names = c("Concentrations Value", "Unit"), cols_remove = FALSE) |>
  dplyr::mutate(Unit = case_when(
    Unit %in% c("uM", "µM", "µg/ml") ~ "u",
    Unit == "nM" ~ "n", 
    Unit %in% c("mg/kg", "mM") ~ "m", 
  )) |> 
  dplyr::mutate(Concentrations=parse_number(`Concentrations Value`, trim_ws = TRUE, 
                                            locale = locale(decimal_mark = ",", grouping_mark = "."))) |> 
   dplyr::mutate(Concentrations = case_when(
    Unit == "u" ~ Concentrations *10^-6,
    Unit == "n" ~ Concentrations *10^-9, 
    Unit == "m" ~ Concentrations *10^-3)) |> 
  dplyr::mutate(`Concentrations Value`= parse_number(`Concentrations Value`, trim_ws = TRUE, 
                                            locale = locale(decimal_mark = ",", grouping_mark = ".")) |> 
                  format_concentrations()) |> 
  tidyr::unite(col ="Concentrations_ID", `Concentrations Value`, Unit, sep = "")

## format Durations, creating both 'Duration_ID' and 'Duration' as an integer value in Days ----
samples_metadata <- samples_metadata |> 
  ## extract last 6 numbers, as the only ones used for identification of samples downstream
  dplyr::mutate(Duration = parse_number(Duration_ID, trim_ws = TRUE, 
                                        locale = locale(decimal_mark = ".")), 
                Duration = dplyr::if_else(grepl("m$",  Duration_ID),
                                          Duration*30, Duration)) 

rm(updated_MoA_pathway)

In Listing 1.2, we extract from each replicate label patterns enabling its mapping to Table of compounds, and uses sheet:'Drugs Mapping' to map the prefix to its complete DrugBank ID.

Listing 1.2: Retrive batches of interest.

Code

batch_mapping <- readxl::read_xlsx("data/Table of compounds_whole_2025-07-01.xlsx",
                                       sheet = "Batch Mapping")
barcode_counts_paths <- batch_mapping |>
  dplyr::pull(Filename) |> unique()

## step 2) extract individual replicate IDs directly from sample experiences ----
ID_mapping <- purrr::map(barcode_counts_paths, function(Filename) {
  ## build path
  counts_path <- file.path("data","barcode-counts", 
                               paste0(Filename))
  
  ## extract replicate names (reading the header)
  replicates_ID <- strsplit(readLines(counts_path, n = 1), split = "\t")[[1]] |>
    str_subset("\\S") |> ## remove empty strings
    str_subset("barcode_id", negate = TRUE)

  ## ID: combination of letters, numbers and -, followed by '[1-4]_', starting the sample's name
  dataset_ID <- tibble::tibble(Filename = Filename, 
                               Date = stringr::str_extract(replicates_ID,
                                                           "(?<=_exp)[[:alnum:]]{6}"),
                               Replicates_ID = replicates_ID,
                               `Run date` = stringr::str_extract(replicates_ID,
                                                                 "(?<=_run)[[:alnum:]]{6}(?=_)"),
                               Concentrations_ID = stringr::str_extract(replicates_ID, 
                                                                        "(?<=[1-8]_)[[:digit:]]{3}[gmnux]{1}(?=_exp)"))  |> 
    ## account for second notation syntax for concentrations, namely for decimal values
    dplyr::mutate(Concentrations_ID = if_else(is.na(Concentrations_ID),
                                              stringr::str_extract(replicates_ID,
                                                                   "(?<=[1-8]_)[0-9]{1}\\.[0-9]{2}[gmnux]{1}(?=_exp)"),
                                              Concentrations_ID))
  
  ## Deal with time-course specific nomenclature
  if (grepl("281022(.*)gamme", Filename)) {
    dataset_ID <- dataset_ID |> 
      dplyr::mutate(Samples_ID = stringr::str_extract(replicates_ID,
                                                      "(?<=_)[[[:alnum:]]\\-]+(?=[1-8]_)"), 
                    Duration_ID=stringr::str_extract(replicates_ID,
                                                     "^[[[:alnum:]]\\.]{1,3}[dm](?=_)"))
    # deal with T-75 Controls
    dataset_ID <- dataset_ID |> 
      dplyr::mutate(Samples_ID = if_else(grepl("^CTRL[[:digit:]]{1}-T75", Replicates_ID), "CTRL", Samples_ID), 
                    Duration_ID = if_else(grepl("^CTRL[[:digit:]]{1}-T75", Replicates_ID), "9d", Duration_ID))
  }
  else {
    dataset_ID <- dataset_ID |> 
      dplyr::mutate(Samples_ID = stringr::str_extract(replicates_ID,
                                                      "^[[[:alnum:]]\\-]+(?=[1-8]{1}_)"), 
                    Duration_ID = if_else(grepl("^Temps0", Replicates_ID),"0d", "9d"))
  }
                                 
  return(dataset_ID)
}) |>
  purrr::list_rbind()

## deal with specific p42 and p43
ID_mapping <- ID_mapping |>
  dplyr::mutate(Samples_ID = if_else(grepl("^p42", Replicates_ID), "p42", Samples_ID),
                Samples_ID = if_else(grepl("^p43", Replicates_ID), "p43", Samples_ID))

## detect bad extraction of patterns samples
# ID_mapping_missing <- ID_mapping |>
#   filter(if_any(everything(), is.na))

## step 3) map short compound IDs with full compound names ----
mapping_compounds <- readxl::read_xlsx("data/Table of compounds_whole_2025-07-01.xlsx",
                                sheet = "Drugs Mapping")
ID_mapping <- ID_mapping |> 
  inner_join(mapping_compounds, by = "Samples_ID")

ID_mapping <- ID_mapping |> 
  dplyr::inner_join(batch_mapping, 
                    by = dplyr::join_by(Filename, Date, `Run date`)) |> 
  dplyr::select(-`General Comments Per Batch`)

rm(batch_mapping, mapping_compounds)

In Table 1.2 is the final, curated metadata phenotype table, adhering to tidy principles Tip 1.1. This includes:

Standardisation of Dates to ISO 1860 Date format.

Listing 1.3: Save the final metadata annotations for barcodes.

Code

## step 1) Use international date formats ----
replicates_metadata <- samples_metadata |> 
  dplyr::inner_join(ID_mapping,
                    by = dplyr::join_by(Compound, Date, `Run date`,
                                        Duration_ID, Concentrations_ID)) |> 
  ## convert to ISO 1860 Date format
  mutate(Date = lubridate::dmy(Date) |> format(),
         `Run date` = lubridate::dmy(`Run date`)) |> 
  # relocate colnames
  dplyr::select(Batch_ID, Filename, 
                Pathway, Old_Pathway, MoA, Old_MoA, 
                Compound, Samples_ID, Kept,
                Date, `Run date`, Duration, Duration_ID,
                Concentrations_Formatted, Concentrations_ID, Concentrations, 
                Replicates, Replicates_ID, Comments)

replicates_metadata_shortened <- replicates_metadata |> 
  dplyr::select(-Old_Pathway, -Old_MoA, -Samples_ID, 
                -Duration_ID, -Concentrations_ID, -Replicates)

# replicates_metadata_deduplicated <- replicates_metadata |>
#   dplyr::distinct(Replicates_ID, .keep_all = TRUE)
# all.equal(replicates_metadata, replicates_metadata_deduplicated)


## Step 2) save the tidy phenotype metatada table ----
readr::write_csv(replicates_metadata,
                 file = paste0("data/replicates_barcode_metadata_comprehensive_", 
                               today_date,".csv"))
readr::write_csv(replicates_metadata_shortened,
                 file = paste0("data/replicates_barcode_metadata_", today_date,".csv"))

flextable::flextable(head(replicates_metadata_shortened)) |> 
  autofit() |> 
  bold(part = "header")

Batch_ID	Filename	Pathway	MoA	Compound	Kept	Date	Run date	Replicates_ID	Comments
exp010821	exp010821.tsv	Control - time zero	Control - time zero	Control - time zero	TRUE	2021-08-01	2021-08-18	Temps01_000u_exp010821_run180821_01
exp010821	exp010821.tsv	Control - time zero	Control - time zero	Control - time zero	TRUE	2021-08-01	2021-08-18	Temps02_000u_exp010821_run180821_02
exp040821	exp040821.tsv	Control - time zero	Control - time zero	Control - time zero	TRUE	2021-08-04	2021-08-18	Temps01_000u_exp040821_run180821_23
exp040821	exp040821.tsv	Control - time zero	Control - time zero	Control - time zero	TRUE	2021-08-04	2021-08-18	Temps02_000u_exp040821_run180821_24
exp130921	exp130921.tsv	Control - time zero	Control - time zero	Control - time zero	TRUE	2021-09-13	2021-09-29	Temps01_000u_exp130921_run290921_48
exp130921_in_vivo	exp130921.tsv	Control - time zero	Control - time zero	Control - time zero	TRUE	2021-09-13	2021-11-10	Temps02_000u_exp130921_run101121_19	In vivo, tumoral samples

Table 1.2

Tip 1.1: Tidy Data Format (Key Principles)

Each variable has its own column.
Each observation has its own row – Every row corresponds to one observation.
Each value has its own cell – Each cell contains a single, unique value.

This format streamlines data wrangling, and generally speaking, data analysis and visualisation. In other words, prefer simpler CSV formats for the experimental design, and avoid cell fusion in Excel documents. Finally, for the formatted tabular representation in documentations, I use flextable.

1.4.1 Optional data quality checks

(Optional) Listing 1.3 (a) and Listing 1.4 (a) are data integrity checks to verify each replicate feature retrieved from its labelling is consistent with Table of compounds, sheet:'Experimental Design'.

Listing 1.3 (a) identifies all replicates that could not be mapped back to their metadata description stored under Table of compounds, `sheet:‘Drugs Mapping’.

(a) Extract discarded replicates

Code

replicates_discarded <- dplyr::anti_join(ID_mapping, samples_metadata, 
                                         by = join_by(Date, `Run date`, Concentrations_ID,
                                                      Duration_ID, Compound))

pheno_data_unmapped <- dplyr::anti_join(samples_metadata,
                                        ID_mapping,
                                        by = join_by(Date, `Run date`, Concentrations_ID,
                                                     Duration_ID, Compound))

pheno_data_unmapped |> 
  arrange(Compound, Date, Concentrations_ID) |> 
  flextable() |> 
  bold(part = "header")

Table 1.3: Uses anti_join between metadata table and barcode counts to identify replicates that could not be mapped back, and reciprocally.

Table 1.4 is used for checking discrepancies between the number of barcode replicates reported in Table of compounds and the number of replicates actually observed in barcode counts.

Code

replicate_inconsistencies <- samples_metadata |> 
  dplyr::inner_join(ID_mapping, 
                    by = dplyr::join_by(Compound, 
                                        Date, `Run date`, 
                                        Duration_ID, Concentrations_ID)) |>
  group_by(Batch_ID, Compound, Date, Replicates, Concentrations_ID,Duration_ID) |>
  summarise(n = n()) |>
  filter(n != Replicates)

flextable(replicate_inconsistencies) |>
  autofit() |>
  bold(part = "header")

(a) Secound round of data wrangling quality controls on counts versus phenotype data, focusing on divergent number of replicates.

Table 1.4: Secound round of data wrangling quality controls on counts versus phenotype data, focusing on divergent number of replicates.

In-vivo = cells injected in living mice, control versus osermintinimb.↩︎

1.1 Reproducibility

1.2 Barcode counts cleaning

1.3 DrugSimDB to map each compound a Pathway and MoA

1.3.1 DrugSimDB and similarity scores:

1.4 Phenotype metadata cleaning

1.4.1 Optional data quality checks

1.3 `DrugSimDB` to map each compound a Pathway and MoA

1.3.1 `DrugSimDB` and similarity scores: