From 1aa8093ea8f182ec7cc9aae626f494a1e14c8c84 Mon Sep 17 00:00:00 2001 From: Christoph Helma Date: Tue, 13 Mar 2018 15:06:05 +0100 Subject: text revisions --- Makefile | 3 + loael.Rmd | 65 ++++++++++++---------- loael.md | 186 ++++++++++++++++---------------------------------------------- loael.pdf | Bin 683927 -> 684984 bytes loael.tex | 78 +++++++++++++++----------- 5 files changed, 132 insertions(+), 200 deletions(-) diff --git a/Makefile b/Makefile index 4da3cc1..9a154ae 100644 --- a/Makefile +++ b/Makefile @@ -18,6 +18,9 @@ loael.md: loael.Rmd $(figures) $(datasets) $(validations) loael.docx: loael.md pandoc -s --bibliography=references.bibtex --latex-engine=pdflatex --filter pandoc-crossref --filter pandoc-citeproc -o loael.docx loael.md +loael.txt: loael.md + pandoc -s --bibliography=references.bibtex --latex-engine=pdflatex --filter pandoc-crossref --filter pandoc-citeproc -o loael.txt loael.md + # Figures figures/functional-groups.pdf: data/functional-groups-reduced4R.csv diff --git a/loael.Rmd b/loael.Rmd index c39a3f7..190a10f 100644 --- a/loael.Rmd +++ b/loael.Rmd @@ -14,10 +14,11 @@ keywords: (Q)SAR, read-across, LOAEL, experimental variability date: \today abstract: | This study compares the accuracy of (Q)SAR/read-across predictions with the - experimental variability of chronic LOAEL values from *in vivo* experiments. - We could demonstrate that predictions of the `lazar` algrorithm within - the applicability domain of the training data have the same variability as - the experimental training data. Predictions with a lower similarity threshold + experimental variability of chronic lowest-observed-adverse-effect levels + (LOAELs) from *in vivo* experiments. We could demonstrate that predictions of + the lazy structure-activity relationships (`lazar`) algorithm within the + applicability domain of the training data have the same variability as the + experimental training data. Predictions with a lower similarity threshold (i.e. a larger distance from the applicability domain) are also significantly better than random guessing, but the errors to be expected are higher and a manual inspection of prediction results is highly recommended. @@ -96,10 +97,12 @@ methods that lead to impressive validation results, but also to overfitted models with little practical relevance. In the present study, automatic read-across like models were built to generate -quantitative predictions of long-term toxicity. Two databases compiling chronic -oral rat Lowest Adverse Effect Levels (LOAEL) as endpoint were used. An early -review of the databases revealed that many chemicals had at least two -independent studies/LOAELs. These studies were exploited to generate +quantitative predictions of long-term toxicity. The aim of the work was not to +predict the nature of the toxicological effects of chemicals, but to obtain +quantitative values which could be compared to exposure. Two databases +compiling chronic oral rat Lowest Adverse Effect Levels (LOAEL) as endpoint +were used. An early review of the databases revealed that many chemicals had at +least two independent studies/LOAELs. These studies were exploited to generate information on the reproducibility of chronic animal studies and were used to evaluate prediction performance of the models in the context of experimental variability. @@ -228,9 +231,7 @@ MolPrint2D fingerprints are generated dynamically from chemical structures and do not rely on predefined lists of fragments (such as OpenBabel FP3, FP4 or MACCs fingerprints or lists of toxocophores/toxicophobes). This has the advantage that they may capture substructures of toxicological relevance that -are not included in other fingerprints. Unpublished experiments have shown -that predictions with MolPrint2D fingerprints are indeed more accurate than -other OpenBabel fingerprints. +are not included in other fingerprints. From MolPrint2D fingerprints we can construct a feature vector with all atom environments of a compound, which can be used to calculate chemical @@ -254,6 +255,7 @@ closely related neighbors, we follow a tiered approach: - If any of these steps fails, the procedure is repeated with a similarity threshold of 0.2 and the prediction is flagged with a warning that it might be out of the applicability domain of the training data. +- Similarity thresholds of 0.5 and 0.2 are the default values chosen by the software developers and remained unchanged during the course of these experiments. Compounds with the same structure as the query structure are automatically [eliminated from neighbors](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb#L180-L257) @@ -276,7 +278,7 @@ optimizing the number of RF components by bootstrap resampling. Finally the local RF model is applied to [predict the activity](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb#L194-L272) -of the query compound. The RMSE of bootstrapped local model predictions is used +of the query compound. The root-mean-square error (RMSE) of bootstrapped local model predictions is used to construct 95\% prediction intervals at 1.96*RMSE. The width of the prediction interval indicates the expected prediction accuracy. The "true" value of a prediction should be with 95\% probability within the prediction interval. If RF modelling or prediction fails, the program resorts to using the [weighted @@ -624,17 +626,17 @@ limited resource available should focused is essential and computational toxicology is thought to play an important role for that. In order to establish the level of safety concern of food chemicals -toxicologically not characterized, a methodology mimicking the process -of chemical risk assessment, and supported by computational toxicology, -was proposed [@Schilter2014]. It is based on the calculation of -margins of exposure (MoE) between predicted values of toxicity and -exposure estimates. The level of safety concern of a chemical is then +toxicologically not characterized, a methodology mimicking the process of +chemical risk assessment, and supported by computational toxicology, was +proposed [@Schilter2014]. It is based on the calculation of margins of exposure +(MoE) that is the ratio between the predicted chronic toxicity value (LOAEL) +and exposure estimate. The level of safety concern of a chemical is then determined by the size of the MoE and its suitability to cover the -uncertainties of the assessment. To be applicable, such an approach -requires quantitative predictions of toxicological endpoints relevant -for risk assessment. The present work focuses on the prediction of chronic -toxicity, a major and often pivotal endpoint of toxicological databases -used for hazard identification and characterization of food chemicals. +uncertainties of the assessment. To be applicable, such an approach requires +quantitative predictions of toxicological endpoints relevant for risk +assessment. The present work focuses on the prediction of chronic toxicity, +a major and often pivotal endpoint of toxicological databases used for hazard +identification and characterization of food chemicals. In a previous study, automated read-across like models for predicting carcinogenic potency were developed. In these models, substances in the @@ -734,13 +736,18 @@ where no predictions can be made, because there are no similar compounds in the Summary ======= -In conclusion, we could -demonstrate that `lazar` predictions within the applicability domain of -the training data have the same variability as the experimental training -data. In such cases experimental investigations can be substituted with -*in silico* predictions. Predictions with a lower similarity threshold can -still give usable results, but the errors to be expected are higher and -a manual inspection of prediction results is highly recommended. +In conclusion, we could demonstrate that `lazar` predictions within the +applicability domain of the training data have the same variability as the +experimental training data. In such cases experimental investigations can be +substituted with *in silico* predictions. Predictions with a lower similarity +threshold can still give usable results, but the errors to be expected are +higher and a manual inspection of prediction results is highly recommended. +Anyway, our suggested workflow includes always the visual inspection of the +chemical structures of the neighbors selected by the model. Indeed it will +strength the prediction confidence (if the input structure looks very similar +to the neighbors selected to build the model) or it can drive to the conclusion +to use read-across with the most similar compound of the database (in case not +enough similar compounds to build the model are present in the database). References ========== diff --git a/loael.md b/loael.md index 8d68575..0b22ee9 100644 --- a/loael.md +++ b/loael.md @@ -14,10 +14,11 @@ keywords: (Q)SAR, read-across, LOAEL, experimental variability date: \today abstract: | This study compares the accuracy of (Q)SAR/read-across predictions with the - experimental variability of chronic LOAEL values from *in vivo* experiments. - We could demonstrate that predictions of the `lazar` algrorithm within - the applicability domain of the training data have the same variability as - the experimental training data. Predictions with a lower similarity threshold + experimental variability of chronic lowest-observed-adverse-effect levels + (LOAELs) from *in vivo* experiments. We could demonstrate that predictions of + the lazy structure-activity relationships (`lazar`) algorithm within the + applicability domain of the training data have the same variability as the + experimental training data. Predictions with a lower similarity threshold (i.e. a larger distance from the applicability domain) are also significantly better than random guessing, but the errors to be expected are higher and a manual inspection of prediction results is highly recommended. @@ -87,44 +88,27 @@ tempting for model developers to use aggressive model optimisation methods that lead to impressive validation results, but also to overfitted models with little practical relevance. -In the present study, automatic read-across like models were built to -generate quantitative predictions of long-term toxicity. Two databases -compiling chronic oral rat Lowest Adverse Effect Levels (LOAEL) as -endpoint were used. An early review of the databases revealed that many -chemicals had at least two independent studies/LOAELs. These studies -were exploited to generate information on the reproducibility of chronic -animal studies and were used to evaluate prediction performance of the -models in the context of experimental variability. +In the present study, automatic read-across like models were built to generate +quantitative predictions of long-term toxicity. The aim of the work was not to +predict the nature of the toxicological effects of chemicals, but to obtain +quantitative values which could be compared to exposure. Two databases +compiling chronic oral rat Lowest Adverse Effect Levels (LOAEL) as endpoint +were used. An early review of the databases revealed that many chemicals had at +least two independent studies/LOAELs. These studies were exploited to generate +information on the reproducibility of chronic animal studies and were used to +evaluate prediction performance of the models in the context of experimental +variability. An important limitation often raised for computational toxicology is the lack of transparency on published models and consequently on the difficulty for the scientific community to reproduce and apply them. To overcome these issues, -source code for all programs and libraries and the data that have been used to generate this -manuscript are made available under GPL3 licenses. Data and compiled -programs with all dependencies for the reproduction of results in this manuscript are available as -a self-contained docker image. All data, tables and figures in this manuscript -was generated directly from experimental results using the `R` package `knitR`. - - - +source code for all programs and libraries and the data that have been used to +generate this manuscript are made available under GPL3 licenses. Data and +compiled programs with all dependencies for the reproduction of results in this +manuscript are available as a self-contained docker image. All data, tables and +figures in this manuscript was generated directly from experimental results +using the `R` package `knitR`. + Materials and Methods ===================== @@ -239,9 +223,7 @@ MolPrint2D fingerprints are generated dynamically from chemical structures and do not rely on predefined lists of fragments (such as OpenBabel FP3, FP4 or MACCs fingerprints or lists of toxocophores/toxicophobes). This has the advantage that they may capture substructures of toxicological relevance that -are not included in other fingerprints. Unpublished experiments have shown -that predictions with MolPrint2D fingerprints are indeed more accurate than -other OpenBabel fingerprints. +are not included in other fingerprints. From MolPrint2D fingerprints we can construct a feature vector with all atom environments of a compound, which can be used to calculate chemical @@ -265,6 +247,7 @@ closely related neighbors, we follow a tiered approach: - If any of these steps fails, the procedure is repeated with a similarity threshold of 0.2 and the prediction is flagged with a warning that it might be out of the applicability domain of the training data. +- Similarity thresholds of 0.5 and 0.2 are the default values chosen by the software developers and remained unchanged during the course of these experiments. Compounds with the same structure as the query structure are automatically [eliminated from neighbors](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb#L180-L257) @@ -287,7 +270,7 @@ optimizing the number of RF components by bootstrap resampling. Finally the local RF model is applied to [predict the activity](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb#L194-L272) -of the query compound. The RMSE of bootstrapped local model predictions is used +of the query compound. The root-mean-square error (RMSE) of bootstrapped local model predictions is used to construct 95\% prediction intervals at 1.96*RMSE. The width of the prediction interval indicates the expected prediction accuracy. The "true" value of a prediction should be with 95\% probability within the prediction interval. If RF modelling or prediction fails, the program resorts to using the [weighted @@ -389,28 +372,6 @@ The only statistically significant difference between both databases is that the Nestlé database contains more small compounds (61 structures with less than 11 non-hydrogen atoms) than the FSVO-database (19 small structures, chi-square test: p-value 3.7E-7). - - - ### Experimental variability versus prediction uncertainty Duplicated LOAEL values can be found in both databases and there is @@ -491,17 +452,9 @@ data). In 100\% of the test examples experimental LOAEL values were located within the 95\% prediction intervals. - - [@fig:comp] shows a comparison of predicted with experimental values. Most predicted values were located within the experimental variability. - ![Comparison of experimental with predicted LOAEL values. Each vertical line represents a compound, dots are individual measurements (blue), predictions (green) or predictions far from the applicability domain, i.e. with warnings @@ -551,10 +504,6 @@ All | 0.45 | 0.77 | 477/671 : Results from 3 independent 10-fold crossvalidations {#tbl:cv} - -
![](figures/crossvalidation0.pdf){#fig:cv0 height=30%} @@ -590,17 +539,17 @@ limited resource available should focused is essential and computational toxicology is thought to play an important role for that. In order to establish the level of safety concern of food chemicals -toxicologically not characterized, a methodology mimicking the process -of chemical risk assessment, and supported by computational toxicology, -was proposed [@Schilter2014]. It is based on the calculation of -margins of exposure (MoE) between predicted values of toxicity and -exposure estimates. The level of safety concern of a chemical is then +toxicologically not characterized, a methodology mimicking the process of +chemical risk assessment, and supported by computational toxicology, was +proposed [@Schilter2014]. It is based on the calculation of margins of exposure +(MoE) that is the ratio between the predicted chronic toxicity value (LOAEL) +and exposure estimate. The level of safety concern of a chemical is then determined by the size of the MoE and its suitability to cover the -uncertainties of the assessment. To be applicable, such an approach -requires quantitative predictions of toxicological endpoints relevant -for risk assessment. The present work focuses on the prediction of chronic -toxicity, a major and often pivotal endpoint of toxicological databases -used for hazard identification and characterization of food chemicals. +uncertainties of the assessment. To be applicable, such an approach requires +quantitative predictions of toxicological endpoints relevant for risk +assessment. The present work focuses on the prediction of chronic toxicity, +a major and often pivotal endpoint of toxicological databases used for hazard +identification and characterization of food chemicals. In a previous study, automated read-across like models for predicting carcinogenic potency were developed. In these models, substances in the @@ -668,25 +617,6 @@ shorter duration endpoints would also be valuable for chronic toxicy since evidence suggest that exposure duration has little impact on the levels of NOAELs/LOAELs [@Zarn2011, @Zarn2013]. - - ### `lazar` predictions [@tbl:common-pred], [@tbl:cv], [@fig:comp], [@fig:corr] and [@fig:cv] clearly @@ -716,43 +646,21 @@ Finally there is a substantial number of compounds where no predictions can be made, because there are no similar compounds in the training data. These compounds clearly fall beyond the applicability domain of the training dataset and in such cases it is preferable to avoid predictions instead of random guessing. - - Summary ======= -In conclusion, we could -demonstrate that `lazar` predictions within the applicability domain of -the training data have the same variability as the experimental training -data. In such cases experimental investigations can be substituted with -*in silico* predictions. Predictions with a lower similarity threshold can -still give usable results, but the errors to be expected are higher and -a manual inspection of prediction results is highly recommended. +In conclusion, we could demonstrate that `lazar` predictions within the +applicability domain of the training data have the same variability as the +experimental training data. In such cases experimental investigations can be +substituted with *in silico* predictions. Predictions with a lower similarity +threshold can still give usable results, but the errors to be expected are +higher and a manual inspection of prediction results is highly recommended. +Anyway, our suggested workflow includes always the visual inspection of the +chemical structures of the neighbors selected by the model. Indeed it will +strength the prediction confidence (if the input structure looks very similar +to the neighbors selected to build the model) or it can drive to the conclusion +to use read-across with the most similar compound of the database (in case not +enough similar compounds to build the model are present in the database). References ========== diff --git a/loael.pdf b/loael.pdf index 3effcef..3b966b5 100644 Binary files a/loael.pdf and b/loael.pdf differ diff --git a/loael.tex b/loael.tex index f9ab237..19b9895 100644 --- a/loael.tex +++ b/loael.tex @@ -100,14 +100,15 @@ \maketitle \begin{abstract} This study compares the accuracy of (Q)SAR/read-across predictions with -the experimental variability of chronic LOAEL values from \emph{in vivo} -experiments. We could demonstrate that predictions of the \texttt{lazar} -algrorithm within the applicability domain of the training data have the -same variability as the experimental training data. Predictions with a -lower similarity threshold (i.e.~a larger distance from the -applicability domain) are also significantly better than random -guessing, but the errors to be expected are higher and a manual -inspection of prediction results is highly recommended. +the experimental variability of chronic lowest-observed-adverse-effect +levels (LOAELs) from \emph{in vivo} experiments. We could demonstrate +that predictions of the lazy structure-activity relationships +(\texttt{lazar}) algorithm within the applicability domain of the +training data have the same variability as the experimental training +data. Predictions with a lower similarity threshold (i.e.~a larger +distance from the applicability domain) are also significantly better +than random guessing, but the errors to be expected are higher and a +manual inspection of prediction results is highly recommended. \end{abstract} \textsuperscript{1} in silico toxicology gmbh, Basel, @@ -166,13 +167,16 @@ methods that lead to impressive validation results, but also to overfitted models with little practical relevance. In the present study, automatic read-across like models were built to -generate quantitative predictions of long-term toxicity. Two databases -compiling chronic oral rat Lowest Adverse Effect Levels (LOAEL) as -endpoint were used. An early review of the databases revealed that many -chemicals had at least two independent studies/LOAELs. These studies -were exploited to generate information on the reproducibility of chronic -animal studies and were used to evaluate prediction performance of the -models in the context of experimental variability. +generate quantitative predictions of long-term toxicity. The aim of the +work was not to predict the nature of the toxicological effects of +chemicals, but to obtain quantitative values which could be compared to +exposure. Two databases compiling chronic oral rat Lowest Adverse Effect +Levels (LOAEL) as endpoint were used. An early review of the databases +revealed that many chemicals had at least two independent +studies/LOAELs. These studies were exploited to generate information on +the reproducibility of chronic animal studies and were used to evaluate +prediction performance of the models in the context of experimental +variability. An important limitation often raised for computational toxicology is the lack of transparency on published models and consequently on the @@ -334,8 +338,6 @@ structures and do not rely on predefined lists of fragments (such as OpenBabel FP3, FP4 or MACCs fingerprints or lists of toxocophores/toxicophobes). This has the advantage that they may capture substructures of toxicological relevance that are not included in other -fingerprints. Unpublished experiments have shown that predictions with -MolPrint2D fingerprints are indeed more accurate than other OpenBabel fingerprints. From MolPrint2D fingerprints we can construct a feature vector with all @@ -367,6 +369,10 @@ absence of closely related neighbors, we follow a tiered approach: similarity threshold of 0.2 and the prediction is flagged with a warning that it might be out of the applicability domain of the training data. +\item + Similarity thresholds of 0.5 and 0.2 are the default values chosen by + the software developers and remained unchanged during the course of + these experiments. \end{itemize} Compounds with the same structure as the query structure are @@ -393,11 +399,12 @@ resampling. Finally the local RF model is applied to \href{https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb\#L194-L272}{predict -the activity} of the query compound. The RMSE of bootstrapped local -model predictions is used to construct 95\% prediction intervals at -1.96*RMSE. The width of the prediction interval indicates the expected -prediction accuracy. The ``true'' value of a prediction should be with -95\% probability within the prediction interval. +the activity} of the query compound. The root-mean-square error (RMSE) +of bootstrapped local model predictions is used to construct 95\% +prediction intervals at 1.96*RMSE. The width of the prediction interval +indicates the expected prediction accuracy. The ``true'' value of a +prediction should be with 95\% probability within the prediction +interval. If RF modelling or prediction fails, the program resorts to using the \href{https://github.com/opentox/lazar/blob/loael-paper.submission/lib/regression.rb\#L6-L16}{weighted @@ -724,15 +731,15 @@ In order to establish the level of safety concern of food chemicals toxicologically not characterized, a methodology mimicking the process of chemical risk assessment, and supported by computational toxicology, was proposed (Schilter et al. 2014). It is based on the calculation of -margins of exposure (MoE) between predicted values of toxicity and -exposure estimates. The level of safety concern of a chemical is then -determined by the size of the MoE and its suitability to cover the -uncertainties of the assessment. To be applicable, such an approach -requires quantitative predictions of toxicological endpoints relevant -for risk assessment. The present work focuses on the prediction of -chronic toxicity, a major and often pivotal endpoint of toxicological -databases used for hazard identification and characterization of food -chemicals. +margins of exposure (MoE) that is the ratio between the predicted +chronic toxicity value (LOAEL) and exposure estimate. The level of +safety concern of a chemical is then determined by the size of the MoE +and its suitability to cover the uncertainties of the assessment. To be +applicable, such an approach requires quantitative predictions of +toxicological endpoints relevant for risk assessment. The present work +focuses on the prediction of chronic toxicity, a major and often pivotal +endpoint of toxicological databases used for hazard identification and +characterization of food chemicals. In a previous study, automated read-across like models for predicting carcinogenic potency were developed. In these models, substances in the @@ -845,7 +852,14 @@ variability as the experimental training data. In such cases experimental investigations can be substituted with \emph{in silico} predictions. Predictions with a lower similarity threshold can still give usable results, but the errors to be expected are higher and a -manual inspection of prediction results is highly recommended. +manual inspection of prediction results is highly recommended. Anyway, +our suggested workflow includes always the visual inspection of the +chemical structures of the neighbors selected by the model. Indeed it +will strength the prediction confidence (if the input structure looks +very similar to the neighbors selected to build the model) or it can +drive to the conclusion to use read-across with the most similar +compound of the database (in case not enough similar compounds to build +the model are present in the database). \section*{References}\label{references} \addcontentsline{toc}{section}{References} -- cgit v1.2.3