(source code for the manuscript and validation experiments)
Docker image
~ (container with manuscript, validation experiments, `lazar` libraries and third party dependencies)
Results
=======
### Dataset comparison
The main objective of this section is to compare the content of both
databases in terms of structural composition and LOAEL values, to
estimate the experimental variability of LOAEL values and to establish a
baseline for evaluating prediction performance.
##### Structural diversity
In order to compare the structural diversity of both datasets we have evaluated the
frequency of functional groups from the OpenBabel FP4 fingerprint. [@fig:fg]
shows the frequency of functional groups in both datasets. 139
functional groups with a frequency > 25 are depicted, the complete table for
all functional groups can be found in the supplemental
material at [GitHub](https://github.com/opentox/loael-paper/blob/submission/data/functional-groups.csv).
![Frequency of functional groups.](figures/functional-groups.pdf){#fig:fg}
This result was confirmed with a visual inspection using the
[CheS-Mapper](http://ches-mapper.org) (Chemical Space Mapping and
Visualization in 3D, @Guetlein2012)
tool.
CheS-Mapper can be used to analyze the relationship between the
structure of chemical compounds, their physico-chemical properties, and
biological or toxic effects. It depicts closely related (similar) compounds in 3D space and can be used with different kinds of features.
We have investigated structural as well as physico-chemical properties and
concluded that both datasets are very similar, both in terms of
chemical structures and physico-chemical properties.
The only statistically significant difference between both datasets, is that the Mazzatorta dataset contains more small compounds (61 structures with less than 11 atoms) than the Swiss dataset (19 small structures, p-value 3.7E-7).
### Experimental variability versus prediction uncertainty
Duplicated LOAEL values can be found in both datasets and there is a
substantial number of 155 compounds occurring in both
datasets. These duplicates allow us to estimate the variability of
experimental results within individual datasets and between datasets.
Data with *identical* values (at five significant digits) in both datasets were excluded from variability analysis, because it it likely that they originate from the same experiments.
##### Intra dataset variability
The Mazzatorta dataset has 567 LOAEL values for
445 unique structures, 93
compounds have multiple measurements with a mean standard deviation of
0.56 mmol/kg_bw/day (0.32 log10 units @mazzatorta08, [@fig:intra]).
The Swiss Federal Office dataset has 493 rat LOAEL values for
381 unique structures, 91 compounds have
multiple measurements with a mean standard deviation of
0.59 mmol/kg_bw/day (0.29 log10 units).
Standard deviations of both datasets do not show
a statistically significant difference with a p-value (t-test) of 0.21.
The combined test set has a mean standard deviation of 0.55 mmol/kg_bw/day (0.33 log10 units).
![Distribution and variability of LOAEL values in both datasets. Each vertical line represents a compound, dots are individual LOAEL values.](figures/dataset-variability.pdf){#fig:intra}
##### Inter dataset variability
[@fig:comp] shows the experimental LOAEL variability of compounds occurring in both datasets (i.e. the *test* dataset) colored in red (experimental). This is the baseline reference for the comparison with predicted values.
##### LOAEL correlation between datasets
[@fig:datacorr] depicts the correlation between LOAEL values from both datasets. As
both datasets contain duplicates we are using medians for the correlation plot
and statistics. Please note that the aggregation of duplicated measurements
into a single median value hides a substantial portion of the experimental
variability. Correlation analysis shows a significant (p-value < 2.2e-16)
correlation between the experimental data in both datasets with r\^2:
0.52, RMSE: 0.59
![Correlation of median LOAEL values from Mazzatorta and Swiss datasets. Data with identical values in both datasets was removed from analysis.](figures/median-correlation.pdf){#fig:datacorr}
### Local QSAR models
In order to compare the performance of in silico read across models with experimental
variability we are using compounds that occur in both datasets as a test set
(375 measurements, 155 compounds).
`lazar` read across predictions
were obtained for 155 compounds, 37
predictions failed, because no similar compounds were found in the training data (i.e. they were not covered by the applicability domain of the training data).
Experimental data and 95\% prediction intervals overlapped in
100\% of the test examples.
[@fig:comp] shows a comparison of predicted with experimental values:
![Comparison of experimental with predicted LOAEL values. Each vertical line represents a compound, dots are individual measurements (red), predictions (green) or prdictions with warnings (blue).](figures/test-prediction.pdf){#fig:comp}
Correlation analysis was performed between individual predictions and the
median of experimental data. All correlations are statistically highly
significant with a p-value < 2.2e-16. These results are presented in
[@fig:corr] and [@tbl:cv]. Please bear in mind that the aggregation of
multiple measurements into a single median value hides experimental variability.
Comparison | $r^2$ | RMSE | Nr. predicted
--------------|---------------------------|---------|---------------
Mazzatorta vs. Swiss dataset | 0.52 | 0.59
Predictions without warnings vs. test median | 0.48 | 0.56 | 34/155
Predictions with warnings vs. test median | 0.38 | 0.68 | 84/155
All predictions vs. test median | 0.4 | 0.65 | 118/155
: Comparison of model predictions with experimental variability. {#tbl:common-pred}
![Correlation of experimental with predicted LOAEL values (test set)](figures/prediction-test-correlation.pdf){#fig:corr}
For a further assessment of model performance three independent
10-fold cross-validations were performed. Results are summarised in [@tbl:cv] and [@fig:cv].
All correlations of predicted with experimental values are statistically highly significant with a p-value < 2.2e-16.
Predictions | $r^2$ | RMSE | Nr. predicted
--|-------|------|----------------
No warnings | 0.61 | 0.58 | 102/671
Warnings | 0.45 | 0.78 | 374/671
All | 0.47 | 0.74 | 476/671
| | |
No warnings | 0.59 | 0.6 | 101/671
Warnings | 0.45 | 0.77 | 376/671
All | 0.47 | 0.74 | 477/671
| | |
No warnings | 0.59 | 0.57 | 93/671
Warnings | 0.43 | 0.81 | 384/671
All | 0.45 | 0.77 | 477/671
: Results from 3 independent 10-fold crossvalidations {#tbl:cv}
![](figures/crossvalidation0.pdf){#fig:cv0 height=30%}
![](figures/crossvalidation1.pdf){#fig:cv1 height=30%}
![](figures/crossvalidation2.pdf){#fig:cv2 height=30%}
Correlation of predicted vs. measured values for five independent crossvalidations with *MP2D* fingerprint descriptors and local *random forest* models
Discussion
==========
Elena + Benoit
### Dataset comparison
Our investigations clearly indicate that the Mazzatorta and Swiss Federal Office datasets are very similar in terms of chemical structures and properties and the distribution of experimental LOAEL values. The only significant difference that we have observed was that the Mazzatorta dataset has larger amount of small molecules, than the Swiss Federal Office dataset. For this reason we have pooled both dataset into a single training dataset for read across predictions.
[@fig:intra] and [@fig:corr] and [@tbl:common-pred] show however considerable
variability in the experimental data. High experimental variability has an
impact on model building and on model validation. First it influences model
quality by introducing noise into the training data, secondly it influences
accuracy estimates because predictions have to be compared against noisy data
where "true" experimental values are unknown. This will become obvious in the
next section, where we compare predictions with experimental data.
### `lazar` predictions
[@tbl:common-pred], [@tbl:cv], [@fig:comp], [@fig:corr] and [@fig:cv] clearly
indicate that `lazar` generates reliable predictions for compounds within the
applicability domain of the training data (i.e. predictions without warnings,
which indicates a sufficient number of neighbors with similarity > 0.5 to
create local random forest models). Correlation analysis ([@tbl:common-pred],
[@tbl:cv]) shows, that errors ($RMSE$) and explained variance ($r^2$) are
comparable to experimental variability of the training data.
Predictions with a warning (neighbor similarity < 0.5 and > 0.2 or weighted
average predictions) are a grey zone. They still show a strong correlation with
experimental data, but the errors are larger than for compounds within the
applicability domain ([@tbl:common-pred], [@tbl:cv]). Expected errors are
displayed as 95\% prediction intervals, which covers
100\% of the experimental
data. The main advantage of lowering the similarity threshold is that it allows
to predict a much larger number of substances than with more rigorous
applicability domain criteria. As each of this prediction could be problematic,
they are flagged with a warning to alert risk assessors that further inspection
is required. This can be done in the graphical interface
() which provides intuitive means of inspecting the
rationales and data used for read across predictions.
Finally there is a substantial number of compounds
(37),
where no predictions can be made, because there are no similar compounds in the training data. These compounds clearly fall beyond the applicability domain of the training dataset
and in such cases it is preferable to avoid predictions instead of random guessing.
Summary
=======
We could demonstrate that `lazar` predictions within the applicability domain of the training data have the same variability as the experimental training data. In such cases experimental investigations can be substituted with in silico predictions.
Predictions with a lower similarity threshold can still give usable results, but the errors to be expected are higher and a manual inspection of prediction results is highly recommended.
References
==========