1 files changed, 68 insertions, 14 deletions
diff --git a/paper/loael.Rmd b/paper/loael.Rmd
index 65f9b34..a94e88a 100644
--- a/paper/loael.Rmd
+++ b/paper/loael.Rmd
@@ -11,6 +11,9 @@ documentclass: achemso
 bibliography: references.bib
 bibliographystyle: achemso
 biblio-style: achemso
+output:
+  pdf_document:
+    fig_caption: yes
 ...
 
 Introduction
@@ -105,10 +108,11 @@ similarities.
 
 The chemical similarity between two compounds is expressed as the
 proportion between atom environments common in both structures and the
-total number of atom environments (Jaccard/Tanimoto index (1)).
+total number of atom environments (Jaccard/Tanimoto index, [@eq:jaccard]).
 
-(1) $sim = \frac{|A \cap B|}{|A \cup B|}$, $A$ atom environments of
-    compound A, $B$ atom environments of compound B.
+$$ sim = \frac{|A \cap B|}{|A \cup B|} $$ {#eq:jaccard}
+
+$A$ atom environments of compound A, $B$ atom environments of compound B.
 
 ### Local (Q)SAR models
 
@@ -156,14 +160,18 @@ FP3, OpenBabel FP4 and OpenBabel MACCS).
 
 Christoph
 
-Figure 1 shows the frequency of selected functional groups in both
+[@fig:fg] shows the frequency of selected functional groups in both
 datasets. A complete table for 138 functional groups from OpenBabel FP4
 fingerprints can be found in the appendix.
 
-![Frequency of functional groups.](functional-groups.pdf)
+![Frequency of functional groups.](functional-groups.pdf){#fig:fg}
 
 ### Experimental variability versus prediction uncertainty 
 
+```{r echo=FALSE}
+source('rmse.R')
+```
+
 Christoph
 
 Duplicated LOAEL values can be found in both datasets and there is a
@@ -173,40 +181,86 @@ substantial overlap of compounds, with LOAEL values in both datasets.
 
 The Mazzatorta dataset has 562 LOAEL values with 439 unique structures,
 the Swiss Federal Office dataset has 493 rat LOAEL values with 381
-unique structures. Figure 2 shows the intra-dataset variability, where
+unique structures. [@fig:intra] shows the intra-dataset variability, where
 each vertical line represents a single compound and each dot represents
 an individual LOAEL value. The experimental variance of LOAEL values is
 similar in both datasets (p-value: 0.48).
 
 [//]: # p-value: 0.4750771581019402
 
-![Intra dataset variability: Each vertical line represents a compound, dots are individual LOAEL values.](loael-dataset-comparison-all-compounds.pdf)
+![Intra dataset variability: Each vertical line represents a compound, dots are individual LOAEL values.](loael-dataset-comparison-all-compounds.pdf){#fig:intra}
 
 ##### Inter dataset variability
 
-Figure 3 shows the same situation for the combination of the Mazzatorta
+[@fig:inter] shows the same situation for the combination of the Mazzatorta
 and Swiss Federal Office datasets. Obviously the experimental
 variability is larger than for individual datasets.
 
-![Inter dataset variability](loael-dataset-comparison-common-compounds.pdf)
-
+![Inter dataset variability](loael-dataset-comparison-common-compounds.pdf){#fig:inter}
 
 ##### LOAEL correlation between datasets
 
-Figure 4 depicts the correlation between LOAEL data from both datasets
-(using means for multiple measurements). Correlation analysis shows a
-significant correlation with r\^2: 0.61, RMSE: 1.22, MAE: 0.80
+[@fig:corr-1] depicts the correlation between LOAEL data from both datasets
+(using means for multiple measurements).
+Identical values were removed from analysis.
 
 [//]: #   MAE: 0.801626064534318
 [//]: # with identical values
 
-![LOAEL correlation](loael-dataset-correlation.pdf)
+```{r fig.cap="Correlation of dataset medians (-log10(LOAEL [mmol/kg_bw])", fig.lp="fig:", echo=F}
+library(ggplot2)
+data <- read.csv("data/common-median.csv",header=T)
+print(qplot(-log10(mazzatorta),-log10(swiss),data=data,xlab="Mazzatorta",ylab="Swiss Federal Office") + geom_point() + geom_abline(intercept=0.0) )
+cor <- cor.test(-log(data$mazzatorta),-log(data$swiss))
+median.r.square <- cor(-log(data$mazzatorta),-log(data$swiss),use='complete')**2
+median.rmse <- sqrt(mean((-log(data$mazzatorta)+log(data$swiss))^2))
+``` 
 
+Correlation analysis shows a
+significant correlation (p-value < 2.2e-16) with r\^2: `r round(median.r.square,2)`, RMSE: `r round(median.rmse,2)`
 
 ### Local (Q)SAR models
 
 Christoph
 
+In order to compare the perfomance of in silico models with experimental variability we are using compounds that occur in both datasets as a test set (`r  system("./nr_compounds.sh data/common-test.csv",TRUE)` compounds, `r  system("./nr_measurements.sh data/common-test.csv",TRUE)` measurements).
+
+The Mazzatorta, the Swiss Federal Office dataset and a combined dataset were used as training data. Predictions for the test set compounds were made after eliminating all information from the test compound from the corresponding training dataset. [@tbl:common-pred] summarizes the results:
+
+```{r echo=F}
+validation <- read.csv("test-set-validation.csv",header=T)
+```
+Training data | Model prediction | Experimental variability
+--------------|------------------|-------------------------
+Mazzatorta | `r round(validation$rmse[1],2)`  | `r round(mazzatorta.rmse,2)`
+Swiss Federal Office |`r round(validation$rmse[2],2)`  | `r round(swiss.rmse,2)`
+Commmon | `r round(validation$rmse[3],2)`| `r common.rmse`
+Combined | | `r combined.rmse`
+
+: Comparison of model predictions with experimental variability. {#tbl:common-pred}
+
+```{r echo=F}
+source("crossvalidations.R")
+```
+Traditional 10-fold cross-validation results are summarised in [@tbl:cv]:
+
+Training dataset | $r^2$ | RMSE | MAE
+-----------------|-------|------|----
+Mazzatorta | `r round(cv.mazzatorta.r.squared,2)`  | `r round(cv.mazzatorta.rmse,2)`| `r round(cv.mazzatorta.mae,2)`
+Swiss Federal Office | `r round(cv.swiss.r.squared,2)`  | `r round(cv.swiss.rmse,2)`| `r round(cv.swiss.mae,2)`
+Combined | `r round(cv.combined.r.squared,2)`  | `r round(cv.combined.rmse,2)`| `r round(cv.combined.mae,2)`
+
+: 10-fold crossvalidation results {#tbl:cv}
+
+[//]: # ```{r fig.cap="Comparison of predictions with measured values (-log10(LOAEL [mmol/kg_bw])", fig.lp="fig:", echo=F}
+
+```{r predictions, fig.cap='Comparison of predictions with measured values (-log10(LOAEL [mmol/kg_bw])', echo=F}
+library(ggplot2)
+data <- read.csv("data/common-test.csv",header=T)
+sorted = data[ order(-log10(data$LOAEL)), ]
+ggplot(sorted, aes(SMILES,-log10(LOAEL),ymin = min(-log10(LOAEL)), ymax=max(-log10(LOAEL)),color=Dataset)) + geom_point() + ylab('-log(LOAEL mg/kg_bw/day)') + xlab('Compound') + theme(axis.text.x = element_blank())
+``` 
+
 Discussion
 ==========