--- author: | Christoph Helma^1^, David Vorgrimmler^1^, Denis Gebele^1^, Martin Gütlein^2^, Benoit Schilter^3^, Elena Lo Piparo^3^ title: | Modeling Chronic Toxicity: A comparison of experimental variability with read across predictions include-before: ^1^ in silico toxicology gmbh, Basel, Switzerland\newline^2^ Inst. f. Computer Science, Johannes Gutenberg Universität Mainz, Germany\newline^3^ Chemical Food Safety Group, Nestlé Research Center, Lausanne, Switzerland keywords: (Q)SAR, read-across, LOAEL date: \today abstract: " " documentclass: achemso bibliography: references.bibtex bibliographystyle: achemso output: pdf_document: fig_caption: yes ... Introduction ============ Christoph + Elena + Benoit The quality and reproducibility of (Q)SAR and read-across predictions is a controversial topic in the toxicological risk-assessment community. Although model predictions can be validated with various procedures it is rarely possible to put the results into the context of experimental variability, because replicate experiments are rarely available. With missing information about the variability of experimental toxicity data it is hard to judge the performance of predictive models and it is tempting for model developments to use aggressive model optimisation methods that lead to impressive validation results, but also to overfitted models with little practical relevance. In this study we intent to compare model predictions with experimental variability with chronic oral rat lowest adverse effect levels (LOAEL) as toxicity endpoint. We are using two datasets, one from [@mazzatorta08] (*Mazzatorta* dataset) and one from the Swiss Federal Office of TODO (*Swiss Federal Office* dataset). Elena: do you have a reference and the name of the department? ```{r echo=F} t = read.csv("data/test.csv") ``` `r length(unique(t$SMILES))` compounds are common in both datasets and we use them as a test set in our investigation. For this test set we will - compare the structural diversity of both datasets - compare the LOAEL values in both datasets - build prediction models based on the Mazzatorta, Swiss Federal Office datasets and a combination of both - predict LOAELs of the training set - compare predictions with experimental variability With this investigation we also want to support the idea of reproducible research, by providing all datasets and programs that have been used to generate this manuscript under a TODO license. A self-contained docker image with all program dependencies required for the reproduction of these results is available from TODO. Source code and datasets for the reproduction of this manuscript can be downloaded from the GitHub repository TODO. The lazar framework [@Maunz2013] is also available under a GPL License from https://github.com/opentox/lazar. TODO: github tags Elena: please check if this is publication strategy is ok for the Swiss Federal Office Materials and Methods ===================== Datasets -------- ```{r echo=F} m = read.csv("data/mazzatorta.csv",header=T) s = read.csv("data/swiss.csv",header=T) t = read.csv("data/test.csv",header=T) c = read.csv("data/combined.csv",header=T) ``` ### Mazzatorta dataset The first dataset (*Mazzatorta* dataset for further reference) originates from the publication of [@mazzatorta08]. It contains chronic (> 180 days) lowest observed effect levels (LOAEL) for rats (*Rattus norvegicus*) after oral (gavage, diet, drinking water) administration. The Mazzatorta dataset consists of `r length(m$SMILES)` LOAEL values for `r length(unique(m$SMILES))` unique chemical structures. ### Swiss Federal Office dataset Elena + Swiss Federal Office contribution (input) The Swiss Federal Office dataset consists of `r length(s$SMILES)` LOAEL values for `r length(unique(s$SMILES))` unique chemical structures. ### Preprocessing Chemical structures in both datasets were initially represented as SMILES strings [@doi:10.1021/ci00057a005]. Syntactically incorrect and missing SMILES were generated from other identifiers (e.g names, CAS numbers). Unique smiles from the OpenBabel library [@OBoyle2011] were used for the identification of duplicated structures. Studies with undefined or empty LOAEL entries were removed from the datasets. LOAEL values were converted to mmol/kg_bw/day units. For prediction, validation and visualisation purposes -log10 transformations are used. David: please check if we have missed something ### Derived datasets Two derived datasets were obtained from the original datasets: The *test* dataset contains data of compounds that occur in both datasets. Exact duplications of LOAEL values were removed, because it is very likely that they originate from the same study. The test dataset has `r length(t$SMILES)` LOAEL values for `r length(unique(t$SMILES))` unique chemical structures. The *combined* dataset is the union of the Mazzatorta and the Swiss Federal Office dataset and it is used to build predictive models. Exact LOAEL duplications were removed, as for the test dataset. The combined dataset has `r length(c$SMILES)` LOAEL values for `r length(unique(c$SMILES))` unique chemical structures. Algorithms ---------- In this study we are using the modular lazar (*la*zy *s*tructure *a*ctivity *r*elationships) framework [@Maunz2013] for model development and validation. lazar follows the following basic workflow: For a given chemical structure lazar - searches in a database for similar structures (*neighbors*) with experimental data, - builds a local QSAR model with these neighbors and - uses this model to predict the unknown activity of the query compound. This procedure resembles an automated version of *read across* predictions in toxicology, in machine learning terms it would be classified as a *k-nearest-neighbor* algorithm. Apart from this basic workflow lazar is completely modular and allows the researcher to use any algorithm for similarity searches and local QSAR modelling. Within this study we are using the following algorithms: ### Neighbor identification Similarity calculations are based on MolPrint2D fingerprints [@doi:10.1021/ci034207y] from the OpenBabel chemoinformatics library [@OBoyle2011]. The MolPrint2D fingerprint uses atom environments as molecular representation, which resemble basically the chemical concept of functional groups. For each atom in a molecule it represents the chemical environment using the atom types of connected atoms. MolPrint2D fingerprints are generated dynamically from chemical structures and do not rely on predefined lists of fragments (such as OpenBabel FP3, FP4 or MACCs fingerprints or lists of toxocophores/toxicophobes). This has the advantage the they may capture substructures of toxicological relevance that are not included in other fingerprints. Preliminary experiments have shown that predictions with MolPrint2D fingerprints are indeed more accurate than other OpenBabel fingerprints. From MolPrint2D fingerprints we can construct a feature vector with all atom environments of a compound, which can be used to calculate chemical similarities. [//]: # https://openbabel.org/docs/dev/FileFormats/MolPrint2D_format.html#molprint2d-format The chemical similarity between two compounds A and B is expressed as the proportion between atom environments common in both structures $A \cap B$ and the total number of atom environments $A \cup B$ (Jaccard/Tanimoto index, [@eq:jaccard]). $$ sim = \frac{|A \cap B|}{|A \cup B|} $$ {#eq:jaccard} A threshold of $sim < 0.1$ is used for the identification of neighbors for local QSAR models. Compounds with the same structure as the query structure are eliminated from the neighbors to obtain an unbiased prediction. ### Local QSAR models and predictions Only similar compounds (*neighbors*) are used for local QSAR models. In this investigation we are using a weighted partial least squares regression (PLS) algorithm for the prediction of quantitative properties. First all fingerprint features with identical values across all neighbors are removed. The reamining set of features is used as descriptors for creating a local weighted PLS model with atom environments as descriptors and model similarities as weights. The `plsr` function of the `pls` R package [@pls] is used for this purpose. Finally the local PLS model is applied to predict the activity of the query compound. If PLS modelling or prediction fails, the program resorts to using the weighted mean of the neighbors LOAEL values, where the contribution of each neighbor is weighted by its similarity to the query compound. ### Validation Two types of validations are used within this study: For the comparison of experimental variability with predictive accuracies we are using a test set of compounds that occur in both datasets. The *Mazzatorta*, *Swiss Federal Office* and *combined* datasets are used as training data for read across predictions. In order to obtain unbiased predictions *all* information from the test compound is removed from the training set prior to predictions. This procedure is hardcoded into the prediction algorithm in order to prevent validation errors. TODO: treatment of duplicates In addition traditional 10-fold crossvalidation results are provided. Christoph: check if these specifications have changed at submission Results ======= ### Dataset comparison Christoph + Elena The main objective of this section is to compare the content of both databases in terms of structural composition and LOAEL values, to estimate the experimental variability of LOAEL values and to establish a baseline for evaluating prediction performance. #### Applicability domain ##### Ches-Mapper analysis Martin CheS-Mapper (Chemical Space Mapping and Visualization in 3D, http://ches-mapper.org/, @Gütlein2012) can be used to analyze the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. CheS-Mapper embeds a dataset into 3D space, such that compounds with similar feature values are close to each other. The following two screenshots visualise the comparison. The datasets are embeded into 3D Space based on structural fragments from three Smart list (OpenBabel FP3, OpenBabel FP4 and OpenBabel MACCS). ##### Distribution of functional groups Christoph [@fig:fg] shows the frequency of selected functional groups in both datasets. A complete table for 138 functional groups from OpenBabel FP4 fingerprints can be found in the appendix. ![Frequency of functional groups.](figure/functional-groups.pdf){#fig:fg} ### Experimental variability versus prediction uncertainty ```{r echo=FALSE} source('rmse.R') ``` Christoph Duplicated LOAEL values can be found in both datasets and there is a substantial overlap of compounds, with LOAEL values in both datasets. ##### Intra dataset variability TODO: read data from files The Mazzatorta dataset has 562 LOAEL values with 439 unique structures, the Swiss Federal Office dataset has 493 rat LOAEL values with 381 unique structures. [@fig:intra] shows the intra-dataset variability, where each vertical line represents a single compound and each dot represents an individual LOAEL value. The experimental variance of LOAEL values is similar in both datasets (p-value: 0.48). [//]: # p-value: 0.4750771581019402 [//]: # ![Intra dataset variability: Each vertical line represents a compound, dots are individual LOAEL values.](loael-dataset-comparison-all-compounds.pdf){#fig:intra} ##### Inter dataset variability [@fig:inter] shows the same situation for the combination of the Mazzatorta and Swiss Federal Office datasets. Obviously the experimental variability is larger than for individual datasets. [//]: # ![Inter dataset variability](loael-dataset-comparison-common-compounds.pdf){#fig:inter} ##### LOAEL correlation between datasets [@fig:corr] depicts the correlation between LOAEL data from both datasets (using means for multiple measurements). Identical values were removed from analysis. [//]: # MAE: 0.801626064534318 [//]: # with identical values ```{r echo=F} data <- read.csv("data/common-median.csv",header=T) cor <- cor.test(-log(data$mazzatorta),-log(data$swiss)) median.r.square <- round(cor(-log(data$mazzatorta),-log(data$swiss),use='complete')^2,2) median.rmse <- round(sqrt(mean((-log(data$mazzatorta)+log(data$swiss))^2)),2) ``` Correlation analysis shows a significant correlation (p-value < 2.2e-16) with r\^2: `r round(median.r.square,2)`, RMSE: `r round(median.rmse,2)` ### Local QSAR models Christoph In order to compare the perfomance of in silico models with experimental variability we are using compounds that occur in both datasets as a test set (`r system("./nr_compounds.sh data/common-test.csv",TRUE)` compounds, `r system("./nr_measurements.sh data/common-test.csv",TRUE)` measurements). The Mazzatorta, the Swiss Federal Office dataset and a combined dataset were used as training data. Predictions for the test set compounds were made after eliminating all information from the test compound from the corresponding training dataset. [@tbl:common-pred] summarizes the results: ![Comparison of experimental with predicted LOAEL values, each vertical line represents a compound.](figure/test-prediction.pdf){#fig:comp} ```{r echo=F} source("test-correlation.R") ``` Training data | $r^2$ | RMSE --------------|---------------------------|------------------------- Experimental | `r median.r.square` | `r median.rmse` Mazzatorta | `r mazzatorta.r_square` | `r mazzatorta.rmse` Swiss Federal Office |`r swiss.r_square` | `r swiss.rmse` Combined | `r combined.r_square` | `r combined.rmse` : Comparison of model predictions with experimental variability. {#tbl:common-pred} ```{r echo=F} source("crossvalidation.R") ``` Traditional 10-fold cross-validation results are summarised in [@tbl:cv]: Training dataset | $r^2$ | RMSE -----------------|-------|------ Mazzatorta | `r round(cv.mazzatorta.r_square,2)` | `r round(cv.mazzatorta.rmse,2)` Swiss Federal Office | `r round(cv.swiss.r_square,2)` | `r round(cv.swiss.rmse,2)` Combined | `r round(cv.combined.r_square,2)` | `r round(cv.combined.rmse,2)` : 10-fold crossvalidation results {#tbl:cv} ![Correlation of experimental with predicted LOAEL values (test set)](figure/test-correlation.pdf){} ![Correlation of experimental with predicted LOAEL values (10-fold crossvalidation)](figure/crossvalidation.pdf){} Discussion ========== ### Elena + Benoit ### Summary ======= References ==========