From 160d9d4489a4d9ebed72db46c5bbe94f1d8131bc Mon Sep 17 00:00:00 2001 From: Christoph Helma Date: Tue, 2 Jan 2018 16:44:15 +0100 Subject: typos and inconsistencies fixed --- loael.Rmd | 44 +++++++++++++++++--------------- loael.md | 44 +++++++++++++++++--------------- loael.pdf | Bin 428000 -> 433140 bytes loael.tex | 86 ++++++++++++++++++++++++++++++++++++++++++++++---------------- 4 files changed, 112 insertions(+), 62 deletions(-) diff --git a/loael.Rmd b/loael.Rmd index 3c4ae5e..47ff2c6 100644 --- a/loael.Rmd +++ b/loael.Rmd @@ -1,5 +1,5 @@ --- -title: 'Modeling Chronic Toxicity: A comparison of experimental variability with read across predictions' +title: 'Modeling Chronic Toxicity: A comparison of experimental variability with (Q)SAR/read-across predictions' author: - Christoph Helma^1^ - David Vorgrimmler^1^ @@ -13,7 +13,7 @@ date: \today abstract: | This study compares the accuracy of (Q)SAR/read-across predictions with the experimental variability of chronic LOAEL values from *in vivo* experiments. - We could demonstrate that predictions of the `lazar` lazar algrorithm within + We could demonstrate that predictions of the `lazar` algrorithm within the applicability domain of the training data have the same variability as the experimental training data. Predictions with a lower similarity threshold (i.e. a larger distance from the applicability domain) are also significantly @@ -79,8 +79,8 @@ Since most of the time chemical food safety deals with life-long exposures to relatively low levels of chemicals, and because long-term toxicity studies are often the most sensitive in food toxicology databases, predicting chronic toxicity is of prime -importance. Up to now, read across and quantitative structure-activity -relationship (QSAR) have been the most used *in silico* approaches to +importance. Up to now, read-across and Quantitative Structure Activity +Relationships (QSAR) have been the most used *in silico* approaches to obtain quantitative predictions of chronic toxicity. The quality and reproducibility of (Q)SAR and read-across predictions @@ -96,7 +96,7 @@ overfitted models with little practical relevance. In the present study, automatic read-across like models were built to generate quantitative predictions of long-term toxicity. Two databases -compiling chronic oral rat lowest adverse effect levels (LOAEL) as +compiling chronic oral rat Lowest Adverse Effect Levels (LOAEL) as endpoint were used. An early review of the databases revealed that many chemicals had at least two independent studies/LOAELs. These studies were exploited to generate information on the reproducibility of chronic @@ -180,8 +180,8 @@ SMILES were generated from other identifiers (e.g names, CAS numbers). Unique smiles from the OpenBabel library [@OBoyle2011] were used for the identification of duplicated structures. -Studies with undefined or empty LOAEL entries were removed from the databases -LOAEL values were converted to mmol/kg_bw/day units and rounded to five +Studies with undefined or empty LOAEL entries were removed from the databases. +LOAEL values were converted to mmol/kg bw/day units and rounded to five significant digits. For prediction, validation and visualisation purposes -log10 transformations are used. @@ -229,7 +229,7 @@ a *k-nearest-neighbor* algorithm. Apart from this basic workflow lazar is completely modular and allows the researcher to use any algorithm for similarity searches and local QSAR -modelling. Within this study we are using the following algorithms: +modelling. Algorithms used within this study are described in the following sections. ### Neighbor identification @@ -245,7 +245,7 @@ of connected atoms. MolPrint2D fingerprints are generated dynamically from chemical structures and do not rely on predefined lists of fragments (such as OpenBabel FP3, FP4 or MACCs fingerprints or lists of toxocophores/toxicophobes). This has the -advantage the they may capture substructures of toxicological relevance that +advantage that they may capture substructures of toxicological relevance that are not included in other fingerprints. Unpublished experiments have shown that predictions with MolPrint2D fingerprints are indeed more accurate than other OpenBabel fingerprints. @@ -267,11 +267,11 @@ threshold) and the number of predictable compounds (low threshold). As it is in many practical cases desirable to make predictions even in the absence of closely related neighbors, we follow a tiered approach: -First a similarity threshold of 0.5 is used to collect neighbors, to create -a local QSAR model and to make a prediction for the query compound. If any of -this steps fail, the procedure is repeated with a similarity threshold of 0.2 -and the prediction is flagged with a warning that it might be out of the -applicability domain of the training data. +- First a similarity threshold of 0.5 is used to collect neighbors, to create + a local QSAR model and to make a prediction for the query compound. +- If any of these steps fails, the procedure is repeated with a similarity + threshold of 0.2 and the prediction is flagged with a warning that it might + be out of the applicability domain of the training data. Compounds with the same structure as the query structure are automatically [eliminated from neighbors](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb#L180-L257) @@ -316,7 +316,7 @@ distant from the applicability domain. Quantitative applicability domain information can be obtained from the similarities of individual neighbors. Local regression models consider neighbor similarities to the query compound, -by weighting the contribution of each neighbor is by its similarity. The +by weighting the contribution of each neighbor is by similarity. The variability of local model predictions is reflected in the 95\% prediction interval associated with each prediction. @@ -375,7 +375,7 @@ fg = read.csv('data/functional-groups.csv',head=F) In order to compare the structural diversity of both databases we evaluated the frequency of functional groups from the OpenBabel FP4 fingerprint. [@fig:fg] -shows the frequency of functional groups in both databases `r length(fg$V1)` +shows the frequency of functional groups in both databases. `r length(fg$V1)` functional groups with a frequency > 25 are depicted, the complete table for all functional groups can be found in the supplemental material at [GitHub](https://github.com/opentox/loael-paper/blob/submission/data/functional-groups.csv). @@ -535,7 +535,7 @@ correct_predictions = length(training$SMILES)-incorrect_predictions ``` In order to compare the performance of *in silico* read across models with -experimental variability we are using compounds with multiple measurements as +experimental variability we used compounds with multiple measurements as a test set (`r length(t$SMILES)` measurements, `r length(unique(t$SMILES))` compounds). `lazar` read across predictions were obtained for `r length(unique(t$SMILES))` compounds, `r length(unique(t$SMILES)) - length(training$SMILES)` @@ -543,8 +543,8 @@ predictions failed, because no similar compounds were found in the training data (i.e. they were not covered by the applicability domain of the training data). -Experimental data and 95\% prediction intervals overlapped in -`r round(100*correct_predictions/length(training$SMILES))`\% of the test examples. +In `r round(100*correct_predictions/length(training$SMILES))`\% of the test examples +experimental LOAEL values were located within the 95\% prediction intervals. ### `lazar` predictions diff --git a/loael.md b/loael.md index 0a1397b..ec3f743 100644 --- a/loael.md +++ b/loael.md @@ -1,5 +1,5 @@ --- -title: 'Modeling Chronic Toxicity: A comparison of experimental variability with read across predictions' +title: 'Modeling Chronic Toxicity: A comparison of experimental variability with (Q)SAR/read-across predictions' author: - Christoph Helma^1^ - David Vorgrimmler^1^ @@ -13,7 +13,7 @@ date: \today abstract: | This study compares the accuracy of (Q)SAR/read-across predictions with the experimental variability of chronic LOAEL values from *in vivo* experiments. - We could demonstrate that predictions of the `lazar` lazar algrorithm within + We could demonstrate that predictions of the `lazar` algrorithm within the applicability domain of the training data have the same variability as the experimental training data. Predictions with a lower similarity threshold (i.e. a larger distance from the applicability domain) are also significantly @@ -71,8 +71,8 @@ Since most of the time chemical food safety deals with life-long exposures to relatively low levels of chemicals, and because long-term toxicity studies are often the most sensitive in food toxicology databases, predicting chronic toxicity is of prime -importance. Up to now, read across and quantitative structure-activity -relationship (QSAR) have been the most used *in silico* approaches to +importance. Up to now, read-across and Quantitative Structure Activity +Relationships (QSAR) have been the most used *in silico* approaches to obtain quantitative predictions of chronic toxicity. The quality and reproducibility of (Q)SAR and read-across predictions @@ -88,7 +88,7 @@ overfitted models with little practical relevance. In the present study, automatic read-across like models were built to generate quantitative predictions of long-term toxicity. Two databases -compiling chronic oral rat lowest adverse effect levels (LOAEL) as +compiling chronic oral rat Lowest Adverse Effect Levels (LOAEL) as endpoint were used. An early review of the databases revealed that many chemicals had at least two independent studies/LOAELs. These studies were exploited to generate information on the reproducibility of chronic @@ -172,8 +172,8 @@ SMILES were generated from other identifiers (e.g names, CAS numbers). Unique smiles from the OpenBabel library [@OBoyle2011] were used for the identification of duplicated structures. -Studies with undefined or empty LOAEL entries were removed from the databases -LOAEL values were converted to mmol/kg_bw/day units and rounded to five +Studies with undefined or empty LOAEL entries were removed from the databases. +LOAEL values were converted to mmol/kg bw/day units and rounded to five significant digits. For prediction, validation and visualisation purposes -log10 transformations are used. @@ -221,7 +221,7 @@ a *k-nearest-neighbor* algorithm. Apart from this basic workflow lazar is completely modular and allows the researcher to use any algorithm for similarity searches and local QSAR -modelling. Within this study we are using the following algorithms: +modelling. Algorithms used within this study are described in the following sections. ### Neighbor identification @@ -237,7 +237,7 @@ of connected atoms. MolPrint2D fingerprints are generated dynamically from chemical structures and do not rely on predefined lists of fragments (such as OpenBabel FP3, FP4 or MACCs fingerprints or lists of toxocophores/toxicophobes). This has the -advantage the they may capture substructures of toxicological relevance that +advantage that they may capture substructures of toxicological relevance that are not included in other fingerprints. Unpublished experiments have shown that predictions with MolPrint2D fingerprints are indeed more accurate than other OpenBabel fingerprints. @@ -259,11 +259,11 @@ threshold) and the number of predictable compounds (low threshold). As it is in many practical cases desirable to make predictions even in the absence of closely related neighbors, we follow a tiered approach: -First a similarity threshold of 0.5 is used to collect neighbors, to create -a local QSAR model and to make a prediction for the query compound. If any of -this steps fail, the procedure is repeated with a similarity threshold of 0.2 -and the prediction is flagged with a warning that it might be out of the -applicability domain of the training data. +- First a similarity threshold of 0.5 is used to collect neighbors, to create + a local QSAR model and to make a prediction for the query compound. +- If any of these steps fails, the procedure is repeated with a similarity + threshold of 0.2 and the prediction is flagged with a warning that it might + be out of the applicability domain of the training data. Compounds with the same structure as the query structure are automatically [eliminated from neighbors](https://github.com/opentox/lazar/blob/loael-paper.submission/lib/model.rb#L180-L257) @@ -308,7 +308,7 @@ distant from the applicability domain. Quantitative applicability domain information can be obtained from the similarities of individual neighbors. Local regression models consider neighbor similarities to the query compound, -by weighting the contribution of each neighbor is by its similarity. The +by weighting the contribution of each neighbor is by similarity. The variability of local model predictions is reflected in the 95\% prediction interval associated with each prediction. @@ -365,7 +365,7 @@ baseline for evaluating prediction performance. In order to compare the structural diversity of both databases we evaluated the frequency of functional groups from the OpenBabel FP4 fingerprint. [@fig:fg] -shows the frequency of functional groups in both databases 139 +shows the frequency of functional groups in both databases. 139 functional groups with a frequency > 25 are depicted, the complete table for all functional groups can be found in the supplemental material at [GitHub](https://github.com/opentox/loael-paper/blob/submission/data/functional-groups.csv). @@ -478,7 +478,7 @@ correlation between the experimental data in both databases with r\^2: In order to compare the performance of *in silico* read across models with -experimental variability we are using compounds with multiple measurements as +experimental variability we used compounds with multiple measurements as a test set (375 measurements, 155 compounds). `lazar` read across predictions were obtained for 155 compounds, 37 @@ -486,8 +486,8 @@ predictions failed, because no similar compounds were found in the training data (i.e. they were not covered by the applicability domain of the training data). -Experimental data and 95\% prediction intervals overlapped in -100\% of the test examples. +In 100\% of the test examples +experimental LOAEL values were located within the 95\% prediction intervals. ### `lazar` predictions diff --git a/loael.pdf b/loael.pdf index b1b46fe..f4c1351 100644 Binary files a/loael.pdf and b/loael.pdf differ diff --git a/loael.tex b/loael.tex index b82a370..6b9a2fb 100644 --- a/loael.tex +++ b/loael.tex @@ -24,7 +24,7 @@ \usepackage[unicode=true]{hyperref} \PassOptionsToPackage{usenames,dvipsnames}{color} % color is loaded by hyperref \hypersetup{ - pdftitle={Modeling Chronic Toxicity: A comparison of experimental variability with read across predictions}, + pdftitle={Modeling Chronic Toxicity: A comparison of experimental variability with (Q)SAR/read-across predictions}, pdfauthor={Christoph Helma1; David Vorgrimmler1; Denis Gebele1; Martin Gütlein2; Benoit Schilter3; Elena Lo Piparo3}, pdfkeywords={(Q)SAR, read-across, LOAEL, experimental variability}, colorlinks=true, @@ -92,7 +92,7 @@ \newcommand*\listoflistings{\listof{codelisting}{List of Listings}} \title{Modeling Chronic Toxicity: A comparison of experimental variability with -read across predictions} +(Q)SAR/read-across predictions} \author{Christoph Helma\textsuperscript{1} \and David Vorgrimmler\textsuperscript{1} \and Denis Gebele\textsuperscript{1} \and Martin Gütlein\textsuperscript{2} \and Benoit Schilter\textsuperscript{3} \and Elena Lo Piparo\textsuperscript{3}} \date{\today} @@ -102,9 +102,9 @@ read across predictions} This study compares the accuracy of (Q)SAR/read-across predictions with the experimental variability of chronic LOAEL values from \emph{in vivo} experiments. We could demonstrate that predictions of the \texttt{lazar} -lazar algrorithm within the applicability domain of the training data -have the same variability as the experimental training data. Predictions -with a lower similarity threshold (i.e.~a larger distance from the +algrorithm within the applicability domain of the training data have the +same variability as the experimental training data. Predictions with a +lower similarity threshold (i.e.~a larger distance from the applicability domain) are also significantly better than random guessing, but the errors to be expected are higher and a manual inspection of prediction results is highly recommended. @@ -150,8 +150,8 @@ et al. 2014). Since most of the time chemical food safety deals with life-long exposures to relatively low levels of chemicals, and because long-term toxicity studies are often the most sensitive in food toxicology databases, predicting chronic toxicity is of prime -importance. Up to now, read across and quantitative structure-activity -relationship (QSAR) have been the most used \emph{in silico} approaches +importance. Up to now, read-across and Quantitative Structure Activity +Relationships (QSAR) have been the most used \emph{in silico} approaches to obtain quantitative predictions of chronic toxicity. The quality and reproducibility of (Q)SAR and read-across predictions @@ -167,7 +167,7 @@ overfitted models with little practical relevance. In the present study, automatic read-across like models were built to generate quantitative predictions of long-term toxicity. Two databases -compiling chronic oral rat lowest adverse effect levels (LOAEL) as +compiling chronic oral rat Lowest Adverse Effect Levels (LOAEL) as endpoint were used. An early review of the databases revealed that many chemicals had at least two independent studies/LOAELs. These studies were exploited to generate information on the reproducibility of chronic @@ -255,7 +255,7 @@ numbers). Unique smiles from the OpenBabel library (OBoyle et al. 2011) were used for the identification of duplicated structures. Studies with undefined or empty LOAEL entries were removed from the -databases LOAEL values were converted to mmol/kg\_bw/day units and +databases. LOAEL values were converted to mmol/kg bw/day units and rounded to five significant digits. For prediction, validation and visualisation purposes -log10 transformations are used. @@ -316,7 +316,8 @@ classified as a \emph{k-nearest-neighbor} algorithm. Apart from this basic workflow lazar is completely modular and allows the researcher to use any algorithm for similarity searches and local -QSAR modelling. Within this study we are using the following algorithms: +QSAR modelling. Algorithms used within this study are described in the +following sections. \subsubsection{Neighbor identification}\label{neighbor-identification} @@ -333,7 +334,7 @@ chemical environment using the atom types of connected atoms. MolPrint2D fingerprints are generated dynamically from chemical structures and do not rely on predefined lists of fragments (such as OpenBabel FP3, FP4 or MACCs fingerprints or lists of -toxocophores/toxicophobes). This has the advantage the they may capture +toxocophores/toxicophobes). This has the advantage that they may capture substructures of toxicological relevance that are not included in other fingerprints. Unpublished experiments have shown that predictions with MolPrint2D fingerprints are indeed more accurate than other OpenBabel @@ -357,11 +358,18 @@ threshold) and the number of predictable compounds (low threshold). As it is in many practical cases desirable to make predictions even in the absence of closely related neighbors, we follow a tiered approach: -First a similarity threshold of 0.5 is used to collect neighbors, to -create a local QSAR model and to make a prediction for the query -compound. If any of this steps fail, the procedure is repeated with a -similarity threshold of 0.2 and the prediction is flagged with a warning -that it might be out of the applicability domain of the training data. +\begin{itemize} +\tightlist +\item + First a similarity threshold of 0.5 is used to collect neighbors, to + create a local QSAR model and to make a prediction for the query + compound. +\item + If any of these steps fails, the procedure is repeated with a + similarity threshold of 0.2 and the prediction is flagged with a + warning that it might be out of the applicability domain of the + training data. +\end{itemize} Compounds with the same structure as the query structure are automatically @@ -413,7 +421,7 @@ domain. Quantitative applicability domain information can be obtained from the similarities of individual neighbors. Local regression models consider neighbor similarities to the query -compound, by weighting the contribution of each neighbor is by its +compound, by weighting the contribution of each neighbor is by similarity. The variability of local model predictions is reflected in the 95\% prediction interval associated with each prediction. @@ -471,7 +479,7 @@ baseline for evaluating prediction performance. In order to compare the structural diversity of both databases we evaluated the frequency of functional groups from the OpenBabel FP4 fingerprint. Figure~\ref{fig:fg} shows the frequency of functional -groups in both databases 139 functional groups with a frequency +groups in both databases. 139 functional groups with a frequency \textgreater{} 25 are depicted, the complete table for all functional groups can be found in the supplemental material at \href{https://github.com/opentox/loael-paper/blob/submission/data/functional-groups.csv}{GitHub}. @@ -574,15 +582,15 @@ analysis.}\label{fig:datacorr} \subsubsection{Local QSAR models}\label{local-qsar-models} In order to compare the performance of \emph{in silico} read across -models with experimental variability we are using compounds with -multiple measurements as a test set (375 measurements, 155 compounds). +models with experimental variability we used compounds with multiple +measurements as a test set (375 measurements, 155 compounds). \texttt{lazar} read across predictions were obtained for 155 compounds, 37 predictions failed, because no similar compounds were found in the training data (i.e.~they were not covered by the applicability domain of the training data). -Experimental data and 95\% prediction intervals overlapped in 100\% of -the test examples. +In 100\% of the test examples experimental LOAEL values were located +within the 95\% prediction intervals. Figure~\ref{fig:comp} shows a comparison of predicted with experimental values. Most predicted values were located within the experimental @@ -787,6 +795,40 @@ since evidence suggest that exposure duration has little impact on the levels of NOAELs/LOAELs (Zarn, Engeli, and Schlatter 2011, Zarn, Engeli, and Schlatter (2013)). +\subsubsection{\texorpdfstring{\texttt{lazar} +predictions}{lazar predictions}}\label{lazar-predictions} + +Table~\ref{tbl:common-pred}, Table~\ref{tbl:cv}, Figure~\ref{fig:comp}, +Figure~\ref{fig:corr} and Figure~\ref{fig:cv} clearly indicate that +\texttt{lazar} generates reliable predictions for compounds within the +applicability domain of the training data (i.e.~predictions without +warnings, which indicates a sufficient number of neighbors with +similarity \textgreater{} 0.5 to create local random forest models). +Correlation analysis (Table~\ref{tbl:common-pred}, Table~\ref{tbl:cv}) +shows, that errors (\(RMSE\)) and explained variance (\(r^2\)) are +comparable to experimental variability of the training data. + +Predictions with a warning (neighbor similarity \textless{} 0.5 and +\textgreater{} 0.2 or weighted average predictions) are a grey zone. +They still show a strong correlation with experimental data, but the +errors are larger than for compounds within the applicability domain +(Table~\ref{tbl:common-pred}, Table~\ref{tbl:cv}). Expected errors are +displayed as 95\% prediction intervals, which covers 100\% of the +experimental data. The main advantage of lowering the similarity +threshold is that it allows to predict a much larger number of +substances than with more rigorous applicability domain criteria. As +each of this prediction could be problematic, they are flagged with a +warning to alert risk assessors that further inspection is required. +This can be done in the graphical interface +(\url{https://lazar.in-silico.ch}) which provides intuitive means of +inspecting the rationales and data used for read across predictions. + +Finally there is a substantial number of compounds (37), where no +predictions can be made, because there are no similar compounds in the +training data. These compounds clearly fall beyond the applicability +domain of the training dataset and in such cases it is preferable to +avoid predictions instead of random guessing. --\textgreater{} + TODO: GUI screenshot \section{Summary}\label{summary} -- cgit v1.2.3