benoit text integration

author: Christoph Helma <helma@in-silico.ch> 2017-12-15 12:07:26 +0100
committer: Christoph Helma <helma@in-silico.ch> 2017-12-15 12:07:26 +0100
commit: d9e71e134ad35c315cf5df27bf9dca7423c08e85 (patch)
tree: 0c91fe78e25b494de877321244597cd838690d16
parent: d5867a7819b47cbed8e820c4d8bfbe0e44fcaf55 (diff)
3 files changed, 480 insertions, 185 deletions
diff --git a/loael.Rmd b/loael.Rmd
index 6def111..8916685 100644
--- a/loael.Rmd
+++ b/loael.Rmd
@@ -25,6 +25,11 @@ header-includes:
 ```{r echo=F}
 rsquare <- function(x,y) { cor(x,y,use='complete')^2 }
 rmse <-  function(x,y) { sqrt(mean((x-y)^2,na.rm=TRUE)) }
+
+m = read.csv("data/mazzatorta_log10.csv",header=T)
+s = read.csv("data/swiss_log10.csv",header=T)
+t = read.csv("data/test_log10.csv",header=T)
+c = read.csv("data/training_log10.csv",header=T)
 ```
 
 Introduction
@@ -346,7 +351,7 @@ baseline for evaluating prediction performance.
 fg = read.csv('data/functional-groups.csv',head=F)
 ```
 
-In order to compare the structural diversity of both datasets we have evaluated the
+In order to compare the structural diversity of both datasets we evaluated the
 frequency of functional groups from the OpenBabel FP4 fingerprint. [@fig:fg]
 shows the frequency of functional groups in both datasets. `r length(fg$V1)`
 functional groups with a frequency > 25 are depicted, the complete table for
@@ -366,7 +371,9 @@ used with different kinds of features. We have investigated structural as well
 as physico-chemical properties and concluded that both datasets are very
 similar, both in terms of chemical structures and physico-chemical properties. 
 
-The only statistically significant difference between both datasets, is that the Nestlé database contains more small compounds (61 structures with less than 11 atoms) than the Swiss dataset (19 small structures, p-value 3.7E-7).
+The only statistically significant difference between both datasets, is that
+the Nestlé database contains more small compounds (61 structures with less than
+11 atoms) than the FSVO-database (19 small structures, p-value 3.7E-7).
 
 <!--
 [@fig:ches-mapper-pc] shows an embedding that is based on physico-chemical (PC)
@@ -393,14 +400,14 @@ MolPrint2D features that are utilized for model building in this work.
 ### Experimental variability versus prediction uncertainty 
 
 Duplicated LOAEL values can be found in both datasets and there is
-a substantial number of `r length(unique(t$SMILES))` compounds occurring in
-both datasets.  These duplicates allow us to estimate the variability of
+a substantial number of `r length(unique(t$SMILES))` compounds with more than
+one LOAEL. These chemicals allow us to estimate the variability of
 experimental results within individual datasets and between datasets. Data with
 *identical* values (at five significant digits) in both datasets were excluded
 from variability analysis, because it it likely that they originate from the
 same experiments.
 
-##### Intra dataset variability
+##### Intra database variability
 
 ```{r echo=F}
 m.dupsmi <- unique(m$SMILES[duplicated(m$SMILES)])
@@ -432,15 +439,14 @@ c.mg = read.csv("data/all_mg_dup.csv",header=T)
 c.mg$sd <- ave(c.mg$LOAEL,c.mg$SMILES,FUN=sd)
 ```
 
-The Nestlé database has `r length(m$SMILES)` LOAEL values for
-`r length(levels(m$SMILES))` unique structures, `r m.dupnr`
-compounds have multiple measurements with a mean standard deviation (-log10 transformed values) of
-`r round(mean(m.dup$sd),2)`
-(`r round(mean(10^(-1*m.mg$sd)),2)` mg/kg_bw/day,
-`r round(mean(10^(-1*m.dup$sd)),2)` mmol/kg_bw/day)
+The Nestlé database has `r length(m$SMILES)` LOAEL values for `r
+length(levels(m$SMILES))` unique structures, `r m.dupnr` compounds have
+multiple measurements with a mean standard deviation (-log10 transformed
+values) of `r round(mean(m.dup$sd),2)` (`r round(mean(10^(-1*m.mg$sd)),2)`
+mg/kg_bw/day, `r round(mean(10^(-1*m.dup$sd)),2)` mmol/kg_bw/day)
 (@mazzatorta08, [@fig:intra]). 
 
-The Swiss Federal Office dataset has `r length(s$SMILES)` rat LOAEL values for
+The FSVO database has `r length(s$SMILES)` rat LOAEL values for
 `r length(levels(s$SMILES))` unique structures, `r s.dupnr` compounds have
 multiple measurements with a mean standard deviation (-log10 transformed values) of
 `r round(mean(s.dup$sd),2)`
@@ -458,9 +464,11 @@ The combined test set has a mean standard deviation (-log10 transformed values)
 
 ![Distribution and variability of LOAEL values in both datasets. Each vertical line represents a compound, dots are individual LOAEL values.](figures/dataset-variability.pdf){#fig:intra}
 
-##### Inter dataset variability
+##### Inter database variability
 
-[@fig:comp] shows the experimental LOAEL variability of compounds occurring in both datasets (i.e. the *test* dataset) colored in red (experimental). This is the baseline reference for the comparison with predicted values.
+[@fig:comp] shows the experimental LOAEL variability of compounds occurring in
+both datasets (i.e. the *test* dataset) colored in red (experimental). This is
+the baseline reference for the comparison with predicted values.
 
 ```{r echo=F}
 data <- read.csv("data/median-correlation.csv",header=T)
@@ -470,15 +478,17 @@ median.r.square <- round(rsquare(data$mazzatorta,data$swiss),2)
 median.rmse <- round(rmse(data$mazzatorta,data$swiss),2)
 ``` 
 
-[@fig:datacorr] depicts the correlation between LOAEL values from both datasets. As
-both datasets contain duplicates we are using medians for the correlation plot
-and statistics. Please note that the aggregation of duplicated measurements
-into a single median value hides a substantial portion of the experimental
-variability.  Correlation analysis shows a significant (p-value < 2.2e-16)
-correlation between the experimental data in both datasets with r\^2:
-`r round(median.r.square,2)`, RMSE: `r round(median.rmse,2)`
+[@fig:datacorr] depicts the correlation between LOAEL values from both
+datasets. As both datasets contain duplicates medians were used for the
+correlation plot and statistics. It should be kept in mind that the aggregation of duplicated
+measurements into a single median value hides a substantial portion of the
+experimental variability.  Correlation analysis shows a significant (p-value < 2.2e-16)
+correlation between the experimental data in both datasets with r\^2: `r
+round(median.r.square,2)`, RMSE: `r round(median.rmse,2)`
 
-![Correlation of median LOAEL values from Mazzatorta and Swiss datasets. Data with identical values in both datasets was removed from analysis.](figures/median-correlation.pdf){#fig:datacorr}
+![Correlation of median LOAEL values from Nestlé and FSVO databases. Data with
+  identical values in both databases was removed from
+  analysis.](figures/median-correlation.pdf){#fig:datacorr}
 
 ### Local QSAR models
 
@@ -497,12 +507,14 @@ incorrect_predictions = length(misclassifications$SMILES)
 correct_predictions = length(training$SMILES)-incorrect_predictions
 ```
 
-In order to compare the performance of in silico read across models with experimental
-variability we are using compounds that occur in both datasets as a test set
-(`r  length(t$SMILES)` measurements, `r  length(unique(t$SMILES))` compounds).
-`lazar` read across predictions
-were obtained for `r length(unique(t$SMILES))` compounds, `r  length(unique(t$SMILES)) - length(training$SMILES)`
-predictions failed, because no similar compounds were found in the training data (i.e. they were not covered by the applicability domain of the training data).
+In order to compare the performance of *in silico* read across models with
+experimental variability we are using compounds that occur in both datasets as
+a test set (`r  length(t$SMILES)` measurements, `r  length(unique(t$SMILES))`
+compounds). `lazar` read across predictions were obtained for `r
+length(unique(t$SMILES))` compounds, `r  length(unique(t$SMILES)) - length(training$SMILES)`
+predictions failed, because no similar compounds were found in the training
+data (i.e. they were not covered by the applicability domain of the training
+data).
 
 Experimental data and 95\% prediction intervals overlapped in
 `r round(100*correct_predictions/length(training$SMILES))`\% of the test examples.
@@ -514,9 +526,14 @@ Experimental data and 95\% prediction intervals did not overlap in `r incorrect_
 `r length(which(sign(misclassifications$Distance) == -1))` predictions too low (after -log10 transformation).
 -->
 
-[@fig:comp] shows a comparison of predicted with experimental values:
+[@fig:comp] shows a comparison of predicted with experimental values. Most
+predicted values were located within the experimental variability.
+
 
-![Comparison of experimental with predicted LOAEL values. Each vertical line represents a compound, dots are individual measurements (blue), predictions (green) or predictions far from the applicability domain, i.e. with warnings (red).](figures/test-prediction.pdf){#fig:comp}
+![Comparison of experimental with predicted LOAEL values. Each vertical line
+represents a compound, dots are individual measurements (blue), predictions
+(green) or predictions far from the applicability domain, i.e. with warnings
+(red).](figures/test-prediction.pdf){#fig:comp}
 
 Correlation analysis was performed between individual predictions and the
 median of experimental data.  All correlations are statistically highly
@@ -526,14 +543,17 @@ multiple measurements into a single median value hides experimental variability.
 
 Comparison    | $r^2$                     | RMSE    |  Nr. predicted
 --------------|---------------------------|---------|---------------
-Mazzatorta vs. Swiss dataset | `r median.r.square`      | `r median.rmse`           
+Nestlé vs. FSVO database | `r median.r.square`      | `r median.rmse`           
 AD close predictions vs. test median             | `r nowarnings.r_square` | `r nowarnings.rmse` | `r length(nowarnings$LOAEL_predicted)`/`r  length(unique(t$SMILES))`
 AD distant predictions vs. test median             | `r warnings.r_square` | `r warnings.rmse`  | `r length(warnings$LOAEL_predicted)`/`r  length(unique(t$SMILES))`
 All predictions vs. test median             | `r training.r_square` | `r training.rmse`  | `r length(training$LOAEL_predicted)`/`r  length(unique(t$SMILES))`
 
 : Comparison of model predictions with experimental variability. {#tbl:common-pred}
 
-![Correlation of experimental with predicted LOAEL values (test set). Green dots indicate predictions close to the applicability domain (i.e. without warnings), red dots indicate predictions far from the applicability domain (i.e. with warnings).](figures/prediction-test-correlation.pdf){#fig:corr}
+![Correlation of experimental with predicted LOAEL values (test set). Green
+dots indicate predictions close to the applicability domain (i.e. without
+warnings), red dots indicate predictions far from the applicability domain
+(i.e. with warnings).](figures/prediction-test-correlation.pdf){#fig:corr}
 
 ```{r echo=F}
 t0all = read.csv("data/training_log10-cv-0.csv",header=T)
@@ -567,9 +587,11 @@ cv.t2nowarnings.r_square = round(rsquare(t2nowarnings$LOAEL_measured_median,t2no
 cv.t2nowarnings.rmse = round(rmse(t2nowarnings$LOAEL_measured_median,t2nowarnings$LOAEL_predicted),2)
 ```
 
-For a further assessment of model performance three independent 
-10-fold cross-validations were performed. Results are summarised in [@tbl:cv] and [@fig:cv].
-All correlations of predicted with experimental values are statistically highly significant with a p-value < 2.2e-16.
+For a further assessment of model performance three independent 10-fold
+cross-validations were performed. Results are summarised in [@tbl:cv] and
+[@fig:cv]. All correlations of predicted with experimental values are
+statistically highly significant with a p-value < 2.2e-16. This is observed for
+compounds close and more distant to the applicability domain.
 
 Predictions  | $r^2$ | RMSE | Nr. predicted
 --|-------|------|----------------
@@ -598,12 +620,115 @@ All | `r round(cv.t2all.r_square,2)`  | `r round(cv.t2all.rmse,2)` | `r length(u
 
 ![](figures/crossvalidation2.pdf){#fig:cv2 height=30%}
 
-Correlation of predicted vs. measured values for three independent crossvalidations with *MP2D* fingerprint descriptors and local *random forest* models
+Correlation of predicted vs. measured values for three independent
+crossvalidations with MP2D fingerprint descriptors and local random forest
+models.
 </div>
 
 Discussion
 ==========
 
+It is currently acknowledged that there is a strong need for
+toxicological information on the multiple thousands of chemicals to
+which human may be exposed through food. These include for examples many
+chemicals in commerce, which could potentially find their way into food
+(Stanton and Kruszewski, 2016; Fowler et al., 2011), but also substances
+migrating from food contact materials (Grob et al., 2006), chemicals
+generated over food processing (Cottererill et al., 2008), environmental
+contaminants as well as inherent plant toxicants (Schilter et al.,
+2014b). For the vast majority of these chemicals, no toxicological data
+is available and consequently insight on their potential health risks is
+very difficult to obtain. It is recognized that testing all of them in
+standard animal studies is neither feasible from a resource perspective
+nor desirable because of ethical issues associated with animal
+experimentation. In addition, for many of these chemicals, risk may be
+very low and therefore testing may actually be irrelevant. In this
+context, the identification of chemicals of most concern on which
+limited resource available should focused is essential and computational
+toxicology is thought to play an important role for that.
+
+In order to establish the level of safety concern of food chemicals
+toxicologically not characterized, a methodology mimicking the process
+of chemical risk assessment, and supported by computational toxicology,
+was proposed (Schilter et al., 2014a). It is based on the calculation of
+margins of exposure (MoE) between predicted values of toxicity and
+exposure estimates. The level of safety concern of a chemical is then
+determined by the size of the MoE and its suitability to cover the
+uncertainties of the assessment. To be applicable, such an approach
+requires quantitative predictions of toxicological endpoints relevant
+for risk assessment. The present work focuses on prediction of chronic
+toxicity, a major and often pivotal endpoints of toxicological databases
+used for hazard identification and characterization of food chemicals.
+
+In a previous study, automated read-across like models for predicting
+carcinogenic potency were developed. In these models, substances in the
+training dataset similar to the query compounds are automatically
+identified and used to derive a quantitative TD50 value. The errors
+observed in these models were within the published estimation of
+experimental variability (Lo Piparo, et al., 2014). In the present
+study, a similar approach was applied to build models generating
+quantitative predictions of long-term toxicity. Two databases compiling
+chronic oral rat lowest adverse effect levels (LOAEL) as endpoint were
+available from different sources. <span id="dataset-comparison-1"
+class="anchor"></span>Our investigations clearly indicated that the
+Nestlé and FSVO databases are very similar in terms of chemical
+structures and properties as well as distribution of experimental LOAEL
+values. The only significant difference that we observed was that the
+Nestlé one has larger amount of small molecules, than the FSVO database.
+For this reason we pooled both dataset into a single training dataset
+for read across predictions.
+
+An early review of the databases revealed that 155 out of the 671
+chemicals available in the training datasets had at least two
+independent studies/LOAELs. These studies were exploited to generate
+information on the reproducibility of chronic animal studies and were
+used to evaluate prediction performance of the models in the context of
+experimental variability.Considerable variability in the experimental
+data was observed. Study design differences, including dose selection,
+dose spacing and route of administration are likely explanation of
+experimental variability. High experimental variability has an impact on
+model building and on model validation. First it influences model
+quality by introducing noise into the training data, secondly it
+influences accuracy estimates because predictions have to be compared
+against noisy data where "true" experimental values are unknown. This
+will become obvious in the next section, where comparison of predictions
+with experimental data is discussed.<span id="lazar-predictions"
+class="anchor"></span>The data obtained in the present study indicate
+that `lazar` generates reliable predictions for compounds within the
+applicability domain of the training data (i.e. predictions without
+warnings, which indicates a sufficient number of neighbors with
+similarity &gt; 0.5 to create local random forest models). Correlation
+analysis shows that errors ($\text{RMSE}$) and explained variance
+($r^{2}$) are comparable to experimental variability of the training
+data.
+
+Predictions with a warning (neighbor similarity &lt; 0.5 and &gt; 0.2 or
+weighted average predictions) are more uncertain. However, they still
+show a strong correlation with experimental data, but the errors are
+larger than for compounds within the applicability domain. Expected
+errors are displayed as 95% prediction intervals, which covers 100% of
+the experimental data. The main advantage of lowering the similarity
+threshold is that it allows to predict a much larger number of
+substances than with more rigorous applicability domain criteria. As
+each of this prediction could be problematic, they are flagged with a
+warning to alert risk assessors that further inspection is required.
+This can be done in the graphical interface
+(<https://lazar.in-silico.ch>) which provides intuitive means of
+inspecting the rationales and data used for read across predictions.
+
+Finally there is a substantial number of chemicals (37), where no
+predictions can be made, because no similar compounds in the training
+data are available. These compounds clearly fall beyond the
+applicability domain of the training dataset and in such cases
+predictions should not be used. In order to expand the domain of
+applicability, the possibility to design models based on shorter, less
+than chonic studies should be studied. It is likely that more substances
+reflecting a wider chemical domain may be available. To predict such
+shorter duration endpoints would also be valuable for chronic toxicy
+since evidence suggest that exposure duration has little impact on the
+levels of NOAELs/LOAELs (Zarn et al., 2011, 2013).
+
+<!--
 Elena + Benoit
 
 ### Dataset comparison
@@ -646,6 +771,7 @@ Finally there is a substantial number of compounds
 (`r length(unique(t$SMILES))-length(training$LOAEL_predicted)`),
 where no predictions can be made, because there are no similar compounds in the training data. These compounds clearly fall beyond the applicability domain of the training dataset 
  and in such cases it is preferable to avoid predictions instead of random guessing.
+-->
 
 Elena: Should we add a GUI screenshot?
 
@@ -690,10 +816,18 @@ with an experimental median of `r med` and a prediction interval of `r pred` +/-
 Summary
 =======
 
+In conclusion, we could
+demonstrate that `lazar` predictions within the applicability domain of
+the training data have the same variability as the experimental training
+data. In such cases experimental investigations can be substituted with
+*in silico* predictions. Predictions with a lower similarity threshold can
+still give usable results, but the errors to be expected are higher and
+a manual inspection of prediction results is highly recommended.
+
+<!--
 We could demonstrate that `lazar` predictions within the applicability domain of the training data have the same variability as the experimental training data. In such cases experimental investigations can be substituted with in silico predictions.
 Predictions with a lower similarity threshold can still give usable results, but the errors to be expected are higher and a manual inspection of prediction results is highly recommended.
 
-<!--
 - beware of over-optimisations and the race for "better" validation results
 - reproducible research
 -->
diff --git a/loael.md b/loael.md
index c2bfb5b..60ec21d 100644
--- a/loael.md
+++ b/loael.md
@@ -1,9 +1,9 @@
 ---
 author: |
-    Christoph Helma^1^, David Vorgrimmler^1^, Denis Gebele^1^, Martin Gütlein^2^, Benoit Schilter^3^, Elena Lo Piparo^3^
+    Christoph Helma^1^, David Vorgrimmler^1^, Denis Gebele^1^, Martin G<c3><bc>tlein^2^, Benoit Schilter^3^, Elena Lo Piparo^3^
 title: |
     Modeling Chronic Toxicity: A comparison of experimental variability with read across predictions
-include-before: ^1^ in silico toxicology gmbh,  Basel, Switzerland\newline^2^ Inst. f. Computer Science, Johannes Gutenberg Universität Mainz, Germany\newline^3^ Chemical Food Safety Group, Nestlé Research Center, Lausanne, Switzerland
+include-before: ^1^ in silico toxicology gmbh,  Basel, Switzerland\newline^2^ Inst. f. Computer Science, Johannes Gutenberg Universit<c3><a4>t Mainz, Germany\newline^3^ Chemical Food Safety Group, Nestl<c3><a9> Research Center, Lausanne, Switzerland
 keywords: (Q)SAR, read-across, LOAEL
 date: \today
 abstract: " "
@@ -27,58 +27,82 @@ header-includes:
 Introduction
 ============
 
-Elena + Benoit
-
-The quality and reproducibility of (Q)SAR and  read-across predictions is a
-controversial topic in the toxicological risk-assessment community. Although
-model predictions can be validated with various procedures it is rarely
-possible to put the results into the context of experimental variability,
-because replicate experiments are usually not available.
-
-With missing information about the variability of experimental toxicity data it
-is hard to judge the performance of predictive models objectively and it is tempting for
-model developers to use aggressive model optimisation methods that lead to
-impressive validation results, but also to overfitted models with little
-practical relevance.
-
-In this study we intent to compare model predictions with experimental
-variability with chronic oral rat lowest adverse effect levels (LOAEL) as
-toxicity endpoint.  We are using two datasets, one from [@mazzatorta08]
-(*Mazzatorta* dataset) and one from the Swiss Federal Office of TODO (*Swiss
-Federal Office* dataset).
-
-Elena: do you have a reference and the name of the department?
-
-
-
-155 compounds are common in both datasets and we use
-them as a *test* set in our investigation. For the Mazzatorta and Swiss Federal Office datasets we will
-
-- compare the structural diversity of both datasets
-- compare the LOAEL values in both datasets
-- build prediction models 
-- predict LOAELs of the test set
-- compare predictions with experimental variability
-
-With this investigation we also want to support the idea of reproducible
-research, by providing all datasets and programs that have been used to
-generate this manuscript under 
-GPL3 licenses.
-
-A self-contained docker image with all programs, libraries and data required for the
-reproduction of these results is available from <https://hub.docker.com/r/insilicotox/loael-paper/>.
+Relying on standard animal toxicological testing for chemical hazard
+identification and characterization is increasingly questioned on both
+scientific and ethical grounds. In addition, it appears obvious that
+from a resource perspective, the capacity of standard toxicology to
+address the safety of thousands of untested chemicals (Fowler et al.,
+2011) to which human may be exposed is very limited. It has also been
+recognized that getting rapid insight on toxicity of chemicals in case
+of emergency safety incidents or for early prioritization in research
+and development (safety by design) is a big challenge mainly because of
+the time and cost constraints associated with the generation of relevant
+animal data. In this context, alternative approaches to obtain timely
+and fit-for-purpose toxicological information are being developed.
+Amongst others, non-testing, structure-activity based *in silico*
+toxicology methods (also called computational toxicology) are considered
+highly promising. Importantly, they are raising more and more interests
+and getting increased acceptance in various regulatory (e.g. ECHA, 2008;
+EFSA, 2016, 2014; Health Canada, 2016; OECD, 2015) and industrial (e.g.
+Stanton and Kruszewski, 2016; Lo Piparo et al., 2011) frameworks.
+
+For a long time already, computational methods have been an integral
+part of pharmaceutical discovery pipelines, while in chemical food
+safety their actual potentials emerged only recently (Lo Piparo et al.,
+2011). In this later field, an application considered critical is in the
+establishment of levels of safety concern in order to rapidly and
+efficiently manage toxicologically uncharacterized chemicals identified
+in food. This requires a risk-based approach to benchmark exposure with
+a quantitative value of toxicity relevant for risk assessment (Schilter
+et al., 2014a). Since most of the time chemical food safety deals with
+life-long exposures to relatively low levels of chemicals, and because
+long-term toxicity studies are often the most sensitive in food
+toxicology databases, predicting chronic toxicity is of prime
+importance. Up to now, read across and quantitative structure-activity
+relationship (QSAR) have been the most used *in silico* approaches to
+obtain quantitative predictions of chronic toxicity.
+
+The quality and reproducibility of (Q)SAR and read-across predictions
+has been a continuous and controversial topic in the toxicological
+risk-assessment community. Although model predictions can be validated
+with various procedures, to review results in context of experimental
+variability has actually been rarely done or attempted. With missing
+information about the variability of experimental toxicity data it is
+hard to judge the performance of predictive models objectively and it is
+tempting for model developers to use aggressive model optimisation
+methods that lead to impressive validation results, but also to
+overfitted models with little practical relevance.
+
+In the present study, automatic read-across like models were built to
+generate quantitative predictions of long-term toxicity. Two databases
+compiling chronic oral rat lowest adverse effect levels (LOAEL) as
+endpoint were used. An early review of the databases revealed that many
+chemicals had at least two independent studies/LOAELs. These studies
+were exploited to generate information on the reproducibility of chronic
+animal studies and were used to evaluate prediction performance of the
+models in the context of experimental variability.
+
+An important limitation often raised for computational toxicology is the
+lack of transparency on published models and consequently on the
+difficulty for the scientific community to reproduce and apply them. To
+overcome these issues, all databases and programs that have been used to
+generate this manuscript are made available under GPL3 licenses.
+
+A self-contained docker image with all programs, libraries and data
+required for the reproduction of these results is available from
+<https://hub.docker.com/r/insilicotox/loael-paper/>.
 
 Source code and datasets for the reproduction of this manuscript can be
-downloaded from the GitHub repository <https://github.com/opentox/loael-paper>. The lazar framework [@Maunz2013] is
-also available under a GPL3 License from <https://github.com/opentox/lazar>.
+downloaded from the GitHub repository
+<https://github.com/opentox/loael-paper>. The lazar framework [@Maunz2013]
+is also available under a GPL3 License from
+<https://github.com/opentox/lazar>.
 
 A graphical webinterface for `lazar` model predictions and validation results
 is publicly accessible at <https://lazar.in-silico.ch>, models presented in
 this manuscript will be included in future versions. Source code for the GUI
 can be obtained from <https://github.com/opentox/lazar-gui>.
 
-Elena: please check if this is publication strategy is ok for the Swiss Federal Office
-
 Materials and Methods
 =====================
 
@@ -89,35 +113,39 @@ and datasets, links to source code and data sources are included in the text.
 Datasets
 --------
 
-### Mazzatorta dataset
+### Nestl<U+FFFD><U+FFFD> database
 
-The first dataset (*Mazzatorta* dataset for further reference) originates from
-the publication of [@mazzatorta08]. It contains chronic (> 180 days) lowest
+The first database (Nestl<U+FFFD><U+FFFD> database for further reference) originates
+from the publication of [@mazzatorta08]. It contains chronic (> 180 days) lowest
 observed effect levels (LOAEL) for rats (*Rattus norvegicus*) after oral
-(gavage, diet, drinking water) administration.  The Mazzatorta dataset consists
+(gavage, diet, drinking water) administration.  The Nestl<U+FFFD><U+FFFD> database consists
 of 567 LOAEL values for 445 unique
 chemical structures.
-The Mazzatorta dataset can be obtained from the following GitHub links: [original data](https://github.com/opentox/loael-paper/blob/submission/data/LOAEL_mg_corrected_smiles_mmol.csv),
+The Nestl<U+FFFD><U+FFFD> database can be obtained from the following GitHub links: [original data](https://github.com/opentox/loael-paper/blob/submission/data/LOAEL_mg_corrected_smiles_mmol.csv),
 [unique smiles](https://github.com/opentox/loael-paper/blob/submission/data/mazzatorta.csv),
 [-log10 transfomed LOAEL](https://github.com/opentox/loael-paper/blob/submission/data/mazzatorta_log10.csv).
 
-### Swiss Federal Office dataset
+### Swiss Food Safety and Veterinary Office (FSVO) database
 
-Elena + Swiss Federal Office contribution (input)
+Publicly available data from pesticide evaluations of chronic rat
+toxicity studies from the European Food Safety Authority (EFSA) (EFSA,
+2014), the Joint FAO/WHO Meeting on Pesticide Residues (JMPR) (WHO,
+2011) and the US EPA (US EPA, 2011) were compiled to form the
+FSVO-database. Only studies providing both an experimental NOAEL and an
+experimental LOAEL were included. The LOAELs were taken as they were
+reported in the evaluations. Further details on the database are
+described elsewhere (Zarn et al., 2011; Zarn et al., 2013). The
+FSVO-database consists of 493 rat LOAEL values for 381 unique chemical
+structures. It can be obtained from the following GitHub links:
 
-The original Swiss Federal Office dataset has chronic toxicity data for rats,
-mice and multi generation effects. For the purpose of this study only rat LOAEL
-data with oral administration was used. This leads to the *Swiss Federal
-Office*  dataset with 493 rat LOAEL values for
-381 unique chemical structures.
-The Swiss dataset can be obtained from the following GitHub links: [original data](https://github.com/opentox/loael-paper/blob/submission/data/NOAEL-LOAEL_SMILES_rat_chron.csv), 
+[original data](https://github.com/opentox/loael-paper/blob/submission/data/NOAEL-LOAEL_SMILES_rat_chron.csv), 
 [unique smiles and mmol/kg_bw/day units](https://github.com/opentox/loael-paper/blob/submission/data/swiss.csv),
 [-log10 transfomed LOAEL](https://github.com/opentox/loael-paper/blob/submission/data/swiss_log10.csv).
 
 ### Preprocessing
 
 Chemical structures (represented as SMILES [@doi:10.1021/ci00057a005]) in both
-datasets were checked for correctness. Syntactically incorrect and missing
+datasets were checked for correctness. When syntactically incorrect or missing
 SMILES were generated from other identifiers (e.g names, CAS numbers). Unique
 smiles from the OpenBabel library [@OBoyle2011] were used for the
 identification of duplicated structures. 
@@ -129,26 +157,25 @@ significant digits. For prediction, validation and visualisation purposes
 
 ### Derived datasets
 
-Two derived datasets were obtained from the original datasets: 
+Two derived datasets were obtained from the original databases:
 
-The [*test* dataset](https://github.com/opentox/loael-paper/blob/submission/data/test_log10.csv)
-contains data from compounds that occur in both datasets.
-LOAEL values equal at five significant digits were considered as duplicates
-originating from the same study/publication and only one instance was kept in
-the test dataset.  The test dataset has
-375 LOAEL values for 155 unique
-chemical structures and was used for
+The [*test*
+dataset](https://github.com/opentox/loael-paper/blob/submission/data/test_log10.csv)
+contains data from compounds that occur in both databases. LOAEL values equal
+at five significant digits were considered as duplicates originating from the
+same study/publication and only one instance was kept in the test dataset. The
+test dataset has 375 LOAEL values for `r
+length(unique(t$SMILES))` unique chemical structures and was used for
 
 - evaluating experimental variability
-- comparing model predictions with experimental variaility.
-
-The [*training* dataset](https://github.com/opentox/loael-paper/blob/submission/data/training_log10.csv)
-is the union of the Mazzatorta and the Swiss Federal
-Office dataset and it is used to build predictive models. LOAEL duplicates were
-removed using the same criteria as for the test dataset.  The
-training dataset 
-has 998 LOAEL values for 671 unique
-chemical structures.
+- comparing model predictions with experimental variability.
+
+The [*training*
+dataset](https://github.com/opentox/loael-paper/blob/submission/data/training_log10.csv)
+is the union of the Nestl<U+FFFD><U+FFFD> and the FSVO databases and it was used to build
+predictive models. LOAEL duplicates were removed using the same criteria as for
+the test dataset.  The training dataset has 998 LOAEL values
+for 671 unique chemical structures.
 
 Algorithms
 ----------
@@ -210,7 +237,11 @@ threshold) and the number of predictable compounds (low threshold). As it is in
 many practical cases desirable to make predictions even in the absence of
 closely related neighbors, we follow a tiered approach: 
 
-First a similarity threshold of 0.5 is used to collect neighbors, to create a local QSAR model and to make a prediction for the query compound. If any of this steps fail, the procedure is repeated with a similarity threshold of 0.2 and the prediction is flagged with a warning that it might be out of the applicability domain of the training data.
+First a similarity threshold of 0.5 is used to collect neighbors, to create
+a local QSAR model and to make a prediction for the query compound. If any of
+this steps fail, the procedure is repeated with a similarity threshold of 0.2
+and the prediction is flagged with a warning that it might be out of the
+applicability domain of the training data.
 
 Compounds with the same structure as the query structure are automatically
 [eliminated from neighbors](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb#L180-L257)
@@ -220,27 +251,27 @@ duplicates.
 ### Local QSAR models and predictions
 
 Only similar compounds (*neighbors*) above the threshold are used for local
-QSAR models.  In this investigation we are using
-[weighted random forests regression (RF)](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/caret.rb#L7-L78)
-for the prediction of quantitative
-properties.  First all uninformative fingerprints (i.e. features with identical
-values across all neighbors) are removed.  The remaining set of features is
-used as descriptors for creating a local weighted RF model with atom
-environments as descriptors and model similarities as weights. The RF method
-from the `caret` R package [@Kuhn08] is used for this purpose.  Models are
-trained with the default `caret` settings, optimizing the number of RF
-components by bootstrap resampling.
-
-Finally the local RF model is applied to
-[predict the activity](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb#L194-L272)
-of the query
-compound. The RMSE of bootstrapped local model predictions is used to construct 95\%
-prediction intervals at 1.96*RMSE.
-
-If RF modelling or prediction fails, the program resorts to using the
-[weighted mean](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/regression.rb#L6-L16)
+QSAR models.  In this investigation we are using [weighted random forests
+regression
+(RF)](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/caret.rb#L7-L78)
+for the prediction of quantitative properties.  First all uninformative
+fingerprints (i.e. features with identical values across all neighbors) are
+removed.  The remaining set of features is used as descriptors for creating
+a local weighted RF model with atom environments as descriptors and model
+similarities as weights. The RF method from the `caret` R package [@Kuhn08] is
+used for this purpose.  Models are trained with the default `caret` settings,
+optimizing the number of RF components by bootstrap resampling.
+
+Finally the local RF model is applied to [predict the
+activity](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb#L194-L272)
+of the query compound. The RMSE of bootstrapped local model predictions is used
+to construct 95\% prediction intervals at 1.96*RMSE.
+
+If RF modelling or prediction fails, the program resorts to using the [weighted
+mean](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/regression.rb#L6-L16)
 of the neighbors LOAEL values, where the contribution of each neighbor is
-weighted by its similarity to the query compound. In this case the prediction is also flagged with a warning.
+weighted by its similarity to the query compound. In this case the prediction
+is also flagged with a warning.
 
 ### Applicability domain
 
@@ -262,20 +293,21 @@ interval associated with each prediction.
 ### Validation
 
 For the comparison of experimental variability with predictive accuracies we
-are using a test set of compounds that occur in both datasets. Unbiased read
-across predictions are obtained from the *training* dataset, by
-[removing *all* information](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb#L234-L238) 
-from the test compound from the training set prior to predictions.
-This procedure is hardcoded into the prediction algorithm in order to prevent
+are using a test set of compounds that occur in both databases. Unbiased read
+across predictions are obtained from the *training* dataset, by [removing *all*
+information](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb#L234-L238)
+from the test compound from the training set prior to predictions. This
+procedure is hardcoded into the prediction algorithm in order to prevent
 validation errors. As we have only a single test set no model or parameter
 optimisations were performed in order to avoid overfitting a single dataset.
 
-Results from 3 repeated
-[10-fold crossvalidations](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/crossvalidation.rb#L85-L93)
-with independent training/test
-set splits are provided as additional information to the test set results.
+Results from 3 repeated [10-fold
+crossvalidations](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/crossvalidation.rb#L85-L93)
+with independent training/test set splits are provided as additional
+information to the test set results.
 
-The final model for production purposes was trained with all available LOAEL data (Mazzatorta and Swiss Federal Office datasets combined).
+The final model for production purposes was trained with all available LOAEL
+data (Nestl<U+FFFD><U+FFFD> and FSVO databases combined).
 
 ## Availability
 
@@ -309,7 +341,7 @@ baseline for evaluating prediction performance.
 
 
 
-In order to compare the structural diversity of both datasets we have evaluated the
+In order to compare the structural diversity of both datasets we evaluated the
 frequency of functional groups from the OpenBabel FP4 fingerprint. [@fig:fg]
 shows the frequency of functional groups in both datasets. 139
 functional groups with a frequency > 25 are depicted, the complete table for
@@ -329,7 +361,9 @@ used with different kinds of features. We have investigated structural as well
 as physico-chemical properties and concluded that both datasets are very
 similar, both in terms of chemical structures and physico-chemical properties. 
 
-The only statistically significant difference between both datasets, is that the Mazzatorta dataset contains more small compounds (61 structures with less than 11 atoms) than the Swiss dataset (19 small structures, p-value 3.7E-7).
+The only statistically significant difference between both datasets, is that
+the Nestl<U+FFFD><U+FFFD> database contains more small compounds (61 structures with less than
+11 atoms) than the FSVO-database (19 small structures, p-value 3.7E-7).
 
 <!--
 [@fig:ches-mapper-pc] shows an embedding that is based on physico-chemical (PC)
@@ -342,10 +376,10 @@ Martin: please explain light colors at bottom of histograms
 In this example, CheS-Mapper applied a principal components analysis to map
 compounds according to their physico-chemical (PC) feature values into 3D
 space. Both datasets have in general very similar PC feature values. As an
-exception, the Mazzatorta dataset includes most of the tiny compound
+exception, the Nestl<U+FFFD><U+FFFD> database includes most of the tiny compound
 structures: we have selected the 78 smallest compounds (with 10 atoms and less,
 marked with a blue box in the screen-shot) and found that 61 of these compounds
-occur in the Mazzatorta dataset, whereas only 19 are contained in the Swiss
+occur in the Nestl<U+FFFD><U+FFFD> database, whereas only 19 are contained in the Swiss
 dataset (p-value 3.7E-7).
 
 This result was confirmed for structural features (fingerprints) including
@@ -356,26 +390,25 @@ MolPrint2D features that are utilized for model building in this work.
 ### Experimental variability versus prediction uncertainty 
 
 Duplicated LOAEL values can be found in both datasets and there is
-a substantial number of 155 compounds occurring in
-both datasets.  These duplicates allow us to estimate the variability of
+a substantial number of 155 compounds with more than
+one LOAEL. These chemicals allow us to estimate the variability of
 experimental results within individual datasets and between datasets. Data with
 *identical* values (at five significant digits) in both datasets were excluded
 from variability analysis, because it it likely that they originate from the
 same experiments.
 
-##### Intra dataset variability
+##### Intra database variability
 
 
 
-The Mazzatorta dataset has 567 LOAEL values for
-445 unique structures, 93
-compounds have multiple measurements with a mean standard deviation (-log10 transformed values) of
-0.32
-(0.56 mg/kg_bw/day,
-0.56 mmol/kg_bw/day)
+The Nestl<U+FFFD><U+FFFD> database has 567 LOAEL values for `r
+length(levels(m$SMILES))` unique structures, 93 compounds have
+multiple measurements with a mean standard deviation (-log10 transformed
+values) of 0.32 (0.56
+mg/kg_bw/day, 0.56 mmol/kg_bw/day)
 (@mazzatorta08, [@fig:intra]). 
 
-The Swiss Federal Office dataset has 493 rat LOAEL values for
+The FSVO database has 493 rat LOAEL values for
 381 unique structures, 91 compounds have
 multiple measurements with a mean standard deviation (-log10 transformed values) of
 0.29
@@ -393,32 +426,38 @@ The combined test set has a mean standard deviation (-log10 transformed values)
 
 ![Distribution and variability of LOAEL values in both datasets. Each vertical line represents a compound, dots are individual LOAEL values.](figures/dataset-variability.pdf){#fig:intra}
 
-##### Inter dataset variability
+##### Inter database variability
 
-[@fig:comp] shows the experimental LOAEL variability of compounds occurring in both datasets (i.e. the *test* dataset) colored in red (experimental). This is the baseline reference for the comparison with predicted values.
+[@fig:comp] shows the experimental LOAEL variability of compounds occurring in
+both datasets (i.e. the *test* dataset) colored in red (experimental). This is
+the baseline reference for the comparison with predicted values.
 
 
 
-[@fig:datacorr] depicts the correlation between LOAEL values from both datasets. As
-both datasets contain duplicates we are using medians for the correlation plot
-and statistics. Please note that the aggregation of duplicated measurements
-into a single median value hides a substantial portion of the experimental
-variability.  Correlation analysis shows a significant (p-value < 2.2e-16)
-correlation between the experimental data in both datasets with r\^2:
-0.52, RMSE: 0.59
+[@fig:datacorr] depicts the correlation between LOAEL values from both
+datasets. As both datasets contain duplicates medians were used for the
+correlation plot and statistics. It should be kept in mind that the aggregation of duplicated
+measurements into a single median value hides a substantial portion of the
+experimental variability.  Correlation analysis shows a significant (p-value < 2.2e-16)
+correlation between the experimental data in both datasets with r\^2: `r
+round(median.r.square,2)`, RMSE: 0.59
 
-![Correlation of median LOAEL values from Mazzatorta and Swiss datasets. Data with identical values in both datasets was removed from analysis.](figures/median-correlation.pdf){#fig:datacorr}
+![Correlation of median LOAEL values from Nestl<U+FFFD><U+FFFD> and FSVO databases. Data with
+  identical values in both databases was removed from
+  analysis.](figures/median-correlation.pdf){#fig:datacorr}
 
 ### Local QSAR models
 
 
 
-In order to compare the performance of in silico read across models with experimental
-variability we are using compounds that occur in both datasets as a test set
-(375 measurements, 155 compounds).
-`lazar` read across predictions
-were obtained for 155 compounds, 37
-predictions failed, because no similar compounds were found in the training data (i.e. they were not covered by the applicability domain of the training data).
+In order to compare the performance of *in silico* read across models with
+experimental variability we are using compounds that occur in both datasets as
+a test set (375 measurements, 155
+compounds). `lazar` read across predictions were obtained for `r
+length(unique(t$SMILES))` compounds, 37
+predictions failed, because no similar compounds were found in the training
+data (i.e. they were not covered by the applicability domain of the training
+data).
 
 Experimental data and 95\% prediction intervals overlapped in
 100\% of the test examples.
@@ -430,9 +469,14 @@ Experimental data and 95\% prediction intervals did not overlap in 0 cases
 0 predictions too low (after -log10 transformation).
 -->
 
-[@fig:comp] shows a comparison of predicted with experimental values:
+[@fig:comp] shows a comparison of predicted with experimental values. Most
+predicted values were located within the experimental variability.
+
 
-![Comparison of experimental with predicted LOAEL values. Each vertical line represents a compound, dots are individual measurements (blue), predictions (green) or predictions far from the applicability domain, i.e. with warnings (red).](figures/test-prediction.pdf){#fig:comp}
+![Comparison of experimental with predicted LOAEL values. Each vertical line
+represents a compound, dots are individual measurements (blue), predictions
+(green) or predictions far from the applicability domain, i.e. with warnings
+(red).](figures/test-prediction.pdf){#fig:comp}
 
 Correlation analysis was performed between individual predictions and the
 median of experimental data.  All correlations are statistically highly
@@ -442,20 +486,25 @@ multiple measurements into a single median value hides experimental variability.
 
 Comparison    | $r^2$                     | RMSE    |  Nr. predicted
 --------------|---------------------------|---------|---------------
-Mazzatorta vs. Swiss dataset | 0.52      | 0.59           
+Nestl<U+FFFD><U+FFFD> vs. FSVO database | 0.52      | 0.59           
 AD close predictions vs. test median             | 0.48 | 0.56 | 34/155
 AD distant predictions vs. test median             | 0.38 | 0.68  | 84/155
 All predictions vs. test median             | 0.4 | 0.65  | 118/155
 
 : Comparison of model predictions with experimental variability. {#tbl:common-pred}
 
-![Correlation of experimental with predicted LOAEL values (test set). Green dots indicate predictions close to the applicability domain (i.e. without warnings), red dots indicate predictions far from the applicability domain (i.e. with warnings).](figures/prediction-test-correlation.pdf){#fig:corr}
+![Correlation of experimental with predicted LOAEL values (test set). Green
+dots indicate predictions close to the applicability domain (i.e. without
+warnings), red dots indicate predictions far from the applicability domain
+(i.e. with warnings).](figures/prediction-test-correlation.pdf){#fig:corr}
 
 
 
-For a further assessment of model performance three independent 
-10-fold cross-validations were performed. Results are summarised in [@tbl:cv] and [@fig:cv].
-All correlations of predicted with experimental values are statistically highly significant with a p-value < 2.2e-16.
+For a further assessment of model performance three independent 10-fold
+cross-validations were performed. Results are summarised in [@tbl:cv] and
+[@fig:cv]. All correlations of predicted with experimental values are
+statistically highly significant with a p-value < 2.2e-16. This is observed for
+compounds close and more distant to the applicability domain.
 
 Predictions  | $r^2$ | RMSE | Nr. predicted
 --|-------|------|----------------
@@ -484,17 +533,120 @@ All | 0.45  | 0.77 | 477/671
 
 ![](figures/crossvalidation2.pdf){#fig:cv2 height=30%}
 
-Correlation of predicted vs. measured values for three independent crossvalidations with *MP2D* fingerprint descriptors and local *random forest* models
+Correlation of predicted vs. measured values for three independent
+crossvalidations with MP2D fingerprint descriptors and local random forest
+models.
 </div>
 
 Discussion
 ==========
 
+It is currently acknowledged that there is a strong need for
+toxicological information on the multiple thousands of chemicals to
+which human may be exposed through food. These include for examples many
+chemicals in commerce, which could potentially find their way into food
+(Stanton and Kruszewski, 2016; Fowler et al., 2011), but also substances
+migrating from food contact materials (Grob et al., 2006), chemicals
+generated over food processing (Cottererill et al., 2008), environmental
+contaminants as well as inherent plant toxicants (Schilter et al.,
+2014b). For the vast majority of these chemicals, no toxicological data
+is available and consequently insight on their potential health risks is
+very difficult to obtain. It is recognized that testing all of them in
+standard animal studies is neither feasible from a resource perspective
+nor desirable because of ethical issues associated with animal
+experimentation. In addition, for many of these chemicals, risk may be
+very low and therefore testing may actually be irrelevant. In this
+context, the identification of chemicals of most concern on which
+limited resource available should focused is essential and computational
+toxicology is thought to play an important role for that.
+
+In order to establish the level of safety concern of food chemicals
+toxicologically not characterized, a methodology mimicking the process
+of chemical risk assessment, and supported by computational toxicology,
+was proposed (Schilter et al., 2014a). It is based on the calculation of
+margins of exposure (MoE) between predicted values of toxicity and
+exposure estimates. The level of safety concern of a chemical is then
+determined by the size of the MoE and its suitability to cover the
+uncertainties of the assessment. To be applicable, such an approach
+requires quantitative predictions of toxicological endpoints relevant
+for risk assessment. The present work focuses on prediction of chronic
+toxicity, a major and often pivotal endpoints of toxicological databases
+used for hazard identification and characterization of food chemicals.
+
+In a previous study, automated read-across like models for predicting
+carcinogenic potency were developed. In these models, substances in the
+training dataset similar to the query compounds are automatically
+identified and used to derive a quantitative TD50 value. The errors
+observed in these models were within the published estimation of
+experimental variability (Lo Piparo, et al., 2014). In the present
+study, a similar approach was applied to build models generating
+quantitative predictions of long-term toxicity. Two databases compiling
+chronic oral rat lowest adverse effect levels (LOAEL) as endpoint were
+available from different sources. <span id="dataset-comparison-1"
+class="anchor"></span>Our investigations clearly indicated that the
+Nestl<U+FFFD><U+FFFD> and FSVO databases are very similar in terms of chemical
+structures and properties as well as distribution of experimental LOAEL
+values. The only significant difference that we observed was that the
+Nestl<U+FFFD><U+FFFD> one has larger amount of small molecules, than the FSVO database.
+For this reason we pooled both dataset into a single training dataset
+for read across predictions.
+
+An early review of the databases revealed that 155 out of the 671
+chemicals available in the training datasets had at least two
+independent studies/LOAELs. These studies were exploited to generate
+information on the reproducibility of chronic animal studies and were
+used to evaluate prediction performance of the models in the context of
+experimental variability.Considerable variability in the experimental
+data was observed. Study design differences, including dose selection,
+dose spacing and route of administration are likely explanation of
+experimental variability. High experimental variability has an impact on
+model building and on model validation. First it influences model
+quality by introducing noise into the training data, secondly it
+influences accuracy estimates because predictions have to be compared
+against noisy data where "true" experimental values are unknown. This
+will become obvious in the next section, where comparison of predictions
+with experimental data is discussed.<span id="lazar-predictions"
+class="anchor"></span>The data obtained in the present study indicate
+that `lazar` generates reliable predictions for compounds within the
+applicability domain of the training data (i.e. predictions without
+warnings, which indicates a sufficient number of neighbors with
+similarity &gt; 0.5 to create local random forest models). Correlation
+analysis shows that errors ($\text{RMSE}$) and explained variance
+($r^{2}$) are comparable to experimental variability of the training
+data.
+
+Predictions with a warning (neighbor similarity &lt; 0.5 and &gt; 0.2 or
+weighted average predictions) are more uncertain. However, they still
+show a strong correlation with experimental data, but the errors are
+larger than for compounds within the applicability domain. Expected
+errors are displayed as 95% prediction intervals, which covers 100% of
+the experimental data. The main advantage of lowering the similarity
+threshold is that it allows to predict a much larger number of
+substances than with more rigorous applicability domain criteria. As
+each of this prediction could be problematic, they are flagged with a
+warning to alert risk assessors that further inspection is required.
+This can be done in the graphical interface
+(<https://lazar.in-silico.ch>) which provides intuitive means of
+inspecting the rationales and data used for read across predictions.
+
+Finally there is a substantial number of chemicals (37), where no
+predictions can be made, because no similar compounds in the training
+data are available. These compounds clearly fall beyond the
+applicability domain of the training dataset and in such cases
+predictions should not be used. In order to expand the domain of
+applicability, the possibility to design models based on shorter, less
+than chonic studies should be studied. It is likely that more substances
+reflecting a wider chemical domain may be available. To predict such
+shorter duration endpoints would also be valuable for chronic toxicy
+since evidence suggest that exposure duration has little impact on the
+levels of NOAELs/LOAELs (Zarn et al., 2011, 2013).
+
+<!--
 Elena + Benoit
 
 ### Dataset comparison
 
-Our investigations clearly indicate that the Mazzatorta and Swiss Federal Office datasets are very similar in terms of chemical structures and properties and the distribution of experimental LOAEL values. The only significant difference that we have observed was that the Mazzatorta dataset has larger amount of small molecules, than the Swiss Federal Office dataset. For this reason we have pooled both dataset into a single training dataset for read across predictions.
+Our investigations clearly indicate that the Mazzatorta and Swiss Federal Office datasets are very similar in terms of chemical structures and properties and the distribution of experimental LOAEL values. The only significant difference that we have observed was that the Nestl<U+FFFD><U+FFFD> database has larger amount of small molecules, than the Swiss Federal Office dataset. For this reason we have pooled both dataset into a single training dataset for read across predictions.
 
 [@fig:intra] and [@fig:corr] and [@tbl:common-pred] show however considerable
 variability in the experimental data. High experimental variability has an
@@ -532,6 +684,7 @@ Finally there is a substantial number of compounds
 (37),
 where no predictions can be made, because there are no similar compounds in the training data. These compounds clearly fall beyond the applicability domain of the training dataset 
  and in such cases it is preferable to avoid predictions instead of random guessing.
+-->
 
 Elena: Should we add a GUI screenshot?
 
@@ -564,10 +717,18 @@ with an experimental median of 1.91 and a prediction interval of 3.48 +/- 1.57.
 Summary
 =======
 
+In conclusion, we could
+demonstrate that `lazar` predictions within the applicability domain of
+the training data have the same variability as the experimental training
+data. In such cases experimental investigations can be substituted with
+*in silico* predictions. Predictions with a lower similarity threshold can
+still give usable results, but the errors to be expected are higher and
+a manual inspection of prediction results is highly recommended.
+
+<!--
 We could demonstrate that `lazar` predictions within the applicability domain of the training data have the same variability as the experimental training data. In such cases experimental investigations can be substituted with in silico predictions.
 Predictions with a lower similarity threshold can still give usable results, but the errors to be expected are higher and a manual inspection of prediction results is highly recommended.
 
-<!--
 - beware of over-optimisations and the race for "better" validation results
 - reproducible research
 -->
diff --git a/loael.pdf b/loael.pdf
index b58594a..a9cc71e 100644
--- a/loael.pdf
+++ b/loael.pdf
author	Christoph Helma <helma@in-silico.ch>	2017-12-15 12:07:26 +0100
committer	Christoph Helma <helma@in-silico.ch>	2017-12-15 12:07:26 +0100
commit	d9e71e134ad35c315cf5df27bf9dca7423c08e85 (patch)
tree	0c91fe78e25b494de877321244597cd838690d16
parent	d5867a7819b47cbed8e820c4d8bfbe0e44fcaf55 (diff)