repeated crossvalidations

author: Christoph Helma <helma@in-silico.ch> 2016-03-02 11:20:26 +0100
committer: Christoph Helma <helma@in-silico.ch> 2016-03-02 11:20:26 +0100
commit: d3071896a7116670756199f0df7c2a618de2aea3 (patch)
tree: 2cf71d47232c08da2973452950e1f969c733478a /loael.Rmd
parent: 7424234dbf1d7ebdb7a15adaec71c8b6fb53890f (diff)
1 files changed, 49 insertions, 31 deletions
diff --git a/loael.Rmd b/loael.Rmd
index 5686698..8fb301f 100644
--- a/loael.Rmd
+++ b/loael.Rmd
@@ -41,7 +41,7 @@ Elena: do you have a reference and the name of the department?
 m = read.csv("data/mazzatorta.csv",header=T)
 s = read.csv("data/swiss.csv",header=T)
 t = read.csv("data/test.csv",header=T)
-c = read.csv("data/combined.csv",header=T)
+c = read.csv("data/training.csv",header=T)
 ```
 
 `r length(unique(t$SMILES))` compounds are common in both datasets and we use them as a test set in our investigation. For this test set we will
@@ -81,38 +81,42 @@ chemical structures.
 
 Elena + Swiss Federal Office contribution (input)
 
-The Swiss Federal Office dataset consists of `r length(s$SMILES)` LOAEL values
-for `r length(unique(s$SMILES))` unique chemical structures.
+The original Swiss Federal Office dataset has chronic toxicity data for rats,
+mice and multi generation effects. For the purpose of this study only rat LOAEL
+data was used. This leads to the *Swiss Federal Office*  dataset with `r length(s$SMILES)` rat LOAEL
+values for `r length(unique(s$SMILES))` unique chemical structures.
 
 ### Preprocessing
 
-Chemical structures in both datasets were initially represented as SMILES
-strings [@doi:10.1021/ci00057a005]. Syntactically incorrect and missing SMILES
-were generated from other identifiers (e.g names, CAS numbers). Unique smiles
-from the OpenBabel library [@OBoyle2011] were used for the identification of
-duplicated structures. 
+Chemical structures (represented as SMILES [@doi:10.1021/ci00057a005]) in both
+datasets were checked for correctness, syntactically incorrect and missing
+SMILES were generated from other identifiers (e.g names, CAS numbers). Unique
+smiles from the OpenBabel library [@OBoyle2011] were used for the
+identification of duplicated structures. 
 
 Studies with undefined or empty LOAEL entries were removed from the datasets.
-LOAEL values were converted to mmol/kg_bw/day units. For prediction, validation
-and visualisation purposes -log10 transformations are used.
-
-David: please check if we have missed something
+LOAEL values were converted to mmol/kg_bw/day units and rounded to five
+significant digits. For prediction, validation and visualisation purposes
+-log10 transformations are used.
 
 ### Derived datasets
 
 Two derived datasets were obtained from the original datasets: 
 
 The *test* dataset contains data of compounds that occur in both datasets.
-Exact duplications of LOAEL values were removed, because it is very likely that
-they originate from the same study.  The test dataset has `r length(t$SMILES)`
-LOAEL values for `r length(unique(t$SMILES))` unique chemical structures.
-
-The *combined* dataset is the union of the Mazzatorta and the Swiss Federal
-Office dataset and it is used to build predictive models. Exact LOAEL
-duplications were removed, as for the test dataset.  The combined dataset has
-`r length(c$SMILES)` LOAEL values for `r length(unique(c$SMILES))` unique
+LOAEL values equal at five significant digits were considered as duplicates
+originating from the same study/publication and only one instance was kept in
+the test dataset.  Exact duplications of LOAEL values were removed, because it
+is very likely that they originate from the same study.  The test dataset has
+`r length(t$SMILES)` LOAEL values for `r length(unique(t$SMILES))` unique
 chemical structures.
 
+The *training* dataset is the union of the Mazzatorta and the Swiss Federal
+Office dataset and it is used to build predictive models. LOAEL duplicates were
+removed, as for the test dataset.  The training dataset has `r
+length(c$SMILES)` LOAEL values for `r length(unique(c$SMILES))` unique chemical
+structures.
+
 Algorithms
 ----------
 
@@ -203,7 +207,7 @@ Prediction intervals were obtained from the `predict` function.
 
 For the comparison of experimental variability with predictive accuracies we
 are using a test set of compounds that occur in both datasets. The
-*Mazzatorta*, *Swiss Federal Office* and *combined* datasets are used as
+*Mazzatorta*, *Swiss Federal Office* and *training* datasets are used as
 training data for read across predictions. In order to obtain unbiased
 predictions *all* information from the test compound is removed from the
 training set prior to predictions. This procedure is hardcoded into the
@@ -231,11 +235,17 @@ baseline for evaluating prediction performance.
 ##### Ches-Mapper analysis
 
 We applied the visualization tool CheS-Mapper (Chemical Space Mapping and Visualization in 3D,
-http://ches-mapper.org, @Gütlein2012) to compare both datasets. CheS-Mapper can be used to analyze the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. It embeds a dataset into 3D space, such that compounds with similar feature values are close to each other. CheS-Mapper is generic and can be employed with different kinds of features. [@fig:ches-mapper-pc] shows an embedding that is based on physico-chemical (PC) descriptors: we determined that both datasets have very similar PC feature values.
+http://ches-mapper.org, @Gütlein2012) to compare both datasets. CheS-Mapper can be used to analyze the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. It embeds a dataset into 3D space, such that compounds with similar feature values are close to each other. CheS-Mapper is generic and can be employed with different kinds of features. [@fig:ches-mapper-pc] shows an embedding that is based on physico-chemical (PC) descriptors.
+
+![Compounds from the Mazzatorta and the Swiss Federal Office dataset are highlighted in red and green. Compounds that occur in both datasets are highlighted in magenta. ](figure/pc-small-compounds-highlighted.png){#fig:ches-mapper-pc}
+
+Martin: explain  light colors at bottom of histograms
 
+In this example, CheS-Mapper applied a principal components analysis to map compounds according to their physico-chemical (PC) feature values into 3D space. Both datasets have in general very similar PC feature values. As an exception, the Mazzatorta dataset includes most of the tiny compound structures: we have selected the 78 smallest compounds (with 10 atoms and less, marked with a blue box in the screen-shot) and found that 61 of these compounds occur in the Mazzatorta dataset, whereas only 19 are contained in the Swiss dataset (p-value 3.7E-7).
 
-We extended CheS-Mapper with a functionality to mine the same MolPrint2D features that are utilized for model building in this work. Applying a minimum frequency of 3 yields 760 distinguished MolPrint2D fragments for the composed dataset of 671 unique compounds. Again, a visual inspection confirmed that both datasets are structurally very similar. However, CheS-Mapper allows the detection of features that help to distinguish groups of selected compounds from the entire dataset. Hence, we found discriminating features for compounds that occur in only one of both datasets, and for the most active or in-active compounds (see [@tbl:molprint]). As an example, [@fig:ches-mapper-alert] shows 9 compounds that match a specific fragment (all other compounds in the dataset do not match this fragment) and have very low mean LOAEL values.
+This result was confirmed for structural features (fingerprints) including MolPrint2D features that are utilized for model building in this work.
 
+In general we concluded that both datasets are very similar, in terms of chemical structures and physico-chemical properties. 
 
 ##### Distribution of functional groups
 
@@ -311,9 +321,9 @@ The Mazzatorta, the Swiss Federal Office dataset and a combined dataset were use
 ![Comparison of experimental with predicted LOAEL values, each vertical line represents a compound, dots are individual measurements (red) or predictions (green).](figure/test-prediction.pdf){#fig:comp}
 
 ```{r echo=F}
-combined = read.csv("data/combined-test-predictions.csv",header=T)
-combined.r_square = round(rsquare(-log(combined$LOAEL_measured_median),-log(combined$LOAEL_predicted)),2)
-combined.rmse = round(rmse(-log(combined$LOAEL_measured_median),-log(combined$LOAEL_predicted)),2)
+training = read.csv("data/training-test-predictions.csv",header=T)
+training.r_square = round(rsquare(-log(training$LOAEL_measured_median),-log(training$LOAEL_predicted)),2)
+training.rmse = round(rmse(-log(training$LOAEL_measured_median),-log(training$LOAEL_predicted)),2)
 ```
 
 TODO: nr unpredicted, nr predictions outside of experimental values
@@ -325,16 +335,22 @@ These results are presented in [@fig:corr] and [@tbl:cv]. Please bear in mind th
 Training data | $r^2$                     | RMSE                    
 --------------|---------------------------|-------------------------
 Experimental | `r median.r.square`      | `r median.rmse`           
-Combined             | `r combined.r_square` | `r combined.rmse` 
+Combined             | `r training.r_square` | `r training.rmse` 
 
 : Comparison of model predictions with experimental variability. {#tbl:common-pred}
 
 ![Correlation of experimental with predicted LOAEL values (test set)](figure/test-correlation.pdf){#fig:corr}
 
 ```{r echo=F}
-combined = read.csv("data/combined-cv.csv",header=T)
-cv.combined.r_square = round(rsquare(-log(combined$LOAEL_measured_median),-log(combined$LOAEL_predicted)),2)
-cv.combined.rmse = round(rmse(-log(combined$LOAEL_measured_median),-log(combined$LOAEL_predicted)),2)
+t0 = read.csv("data/training-cv-0.csv",header=T)
+cv.t0.r_square = round(rsquare(-log(t0$LOAEL_measured_median),-log(t0$LOAEL_predicted)),2)
+cv.t0.rmse = round(rmse(-log(t0$LOAEL_measured_median),-log(t0$LOAEL_predicted)),2)
+t1 = read.csv("data/training-cv-1.csv",header=T)
+cv.t1.r_square = round(rsquare(-log(t1$LOAEL_measured_median),-log(t1$LOAEL_predicted)),2)
+cv.t1.rmse = round(rmse(-log(t1$LOAEL_measured_median),-log(t1$LOAEL_predicted)),2)
+t2 = read.csv("data/training-cv-2.csv",header=T)
+cv.t2.r_square = round(rsquare(-log(t2$LOAEL_measured_median),-log(t2$LOAEL_predicted)),2)
+cv.t2.rmse = round(rmse(-log(t2$LOAEL_measured_median),-log(t2$LOAEL_predicted)),2)
 ```
 TODO: repeated CV
 
@@ -343,7 +359,9 @@ All correlations are statistically highly significant with a p-value < 2.2e-16.
 
 Training dataset | $r^2$ | RMSE 
 -----------------|-------|------
-Combined | `r round(cv.combined.r_square,2)`  | `r round(cv.combined.rmse,2)` 
+Combined | `r round(cv.t0.r_square,2)`  | `r round(cv.t0.rmse,2)` 
+Combined | `r round(cv.t1.r_square,2)`  | `r round(cv.t1.rmse,2)` 
+Combined | `r round(cv.t2.r_square,2)`  | `r round(cv.t2.rmse,2)` 
 
 : 10-fold crossvalidation results {#tbl:cv}
author	Christoph Helma <helma@in-silico.ch>	2016-03-02 11:20:26 +0100
committer	Christoph Helma <helma@in-silico.ch>	2016-03-02 11:20:26 +0100
commit	d3071896a7116670756199f0df7c2a618de2aea3 (patch)
tree	2cf71d47232c08da2973452950e1f969c733478a /loael.Rmd
parent	7424234dbf1d7ebdb7a15adaec71c8b6fb53890f (diff)