first revision from elena

author: Christoph Helma <helma@in-silico.ch> 2016-02-12 13:22:27 +0100
committer: Christoph Helma <helma@in-silico.ch> 2016-02-12 13:22:27 +0100
commit: 015a7532988e3f76b9835ee8e8df8e89e9ef4c8c (patch)
tree: 145e92b645abad4b05b0e46596e35f1297d2fc0c
parent: 87294b79e16e8f21446b8a232f06c956e6b7e81e (diff)
4 files changed, 225 insertions, 51 deletions
diff --git a/paper/Makefile b/paper/Makefile
new file mode 100644
index 0000000..5fa2f47
--- /dev/null
+++ b/paper/Makefile
@@ -0,0 +1,12 @@
+loael.pdf: loael.md functional-groups.pdf loael-dataset-correlation.pdf
+	pandoc --filter pandoc-citeproc loael.md -s -o loael.pdf
+
+loael.docx: loael.md functional-groups.pdf loael-dataset-correlation.pdf
+	pandoc --filter pandoc-citeproc loael.md -s -o loael.docx
+
+
+functional-groups.pdf: functional-groups-reduced4R.csv functional-groups.R
+	R CMD BATCH functional-groups.R
+
+loael-dataset-correlation.pdf: loael-dataset-comparison.rb
+	ruby loael-dataset-comparison.rb
diff --git a/paper/loael.md b/paper/loael.md
index 2bb2add..ac7a50d 100644
--- a/paper/loael.md
+++ b/paper/loael.md
@@ -1,8 +1,11 @@
 ---
-title: "lazar read across models for lowest adverse effect levels: A comparison of experimental variability with read across predictions"
-author: Christoph Helma^1^, David Vorgrimmler^1^, Denis Gebele^1^, Elena Lo Piparo^2^
-E-mail: helma@in-silico.ch
-include-before: ^1^ in silico toxicology gmbh,  Basel, Switzerland\newline^2^ Chemical Food Safety Group, Nestlé Research Center, Lausanne, Switzerland
+author: |
+    Christoph Helma^1^, David Vorgrimmler^1^, Denis Gebele^1^, Martin Gütlein^2^, Benoit
+    Schilter^3^, Elena Lo Piparo^3^
+title: |
+    Modeling Chronic Toxicity: A comparison of experimental variability with
+    read across predictions
+include-before: ^1^ in silico toxicology gmbh,  Basel, Switzerland\newline^2^ Inst. f. Computer Science, Johannes Gutenberg Universität Mainz, Germany\newline^3^ Chemical Food Safety Group, Nestlé Research Center, Lausanne, Switzerland
 keywords: (Q)SAR, read-across, LOAEL
 date: \today
 abstract: " "
@@ -10,89 +13,172 @@ documentclass: achemso
 bibliography: references.bib
 bibliographystyle: achemso
 biblio-style: achemso
----
+...
+
+Introduction
+============
 
-# Introduction
+Christoph + Elena + Benoit
 
 The main objectives of this study are
 
- - to investigate the experimental variability of LOAEL data
- - develop predictive model for lowest observed effect levels
- - compare the performance of model predictions with experimental variability
+-   to investigate the experimental variability of LOAEL data
+
+-   develop predictive model for lowest observed effect levels
 
-# Methods
+-   compare the performance of model predictions with experimental
+    variability
 
-## Data
+Materials and Methods
+=====================
+
+Datasets
+--------
 
 ### Mazzatorta dataset
 
+Just referred to the paper 2008.
+
 ### Swiss Federal Office dataset
 
-  Only rat LOAEL values were used for the current investigation, because they correspond directly to the Mazzatorta dataset.
+Elena + Swiss Federal Office contribution (input)
+
+Only rat LOAEL values were used for the current investigation, because
+they correspond directly to the Mazzatorta dataset.
 
 ### Preprocessing
 
-  Chemical structures in both datasets are represented as SMILES strings [@doi:10.1021/ci00057a005]. Syntactically incorrect and missing SMILES were generated from other identifiers (e.g names, CAS numbers) when possible.
-  Studies with undefined (“0”) or empty LOAEL entries were removed for this study. 
+Christoph
 
-## Algorithms
+Chemical structures in both datasets are represented as SMILES strings
+(Weininger 1988). Syntactically incorrect and missing SMILES were
+generated from other identifiers (e.g names, CAS numbers) when possible.
+Studies with undefined (“0”) or empty LOAEL entries were removed for
+this study.
 
-  For this study we are using the modular lazar (*la*zy *s*tructure *a*ctivity *r*elationships) framework [@Maunz2013] for model development and validation. 
+Algorithms
+----------
 
-  lazar follows the following basic workflow: For a given chemical structure it searches in a database for similar structures (neighbors) with experimental data, builds a local (Q)SAR model with these neighbors and uses this model to predict the unknown activity of the query compound. This procedure resembles an automated version of *read across* predictions in toxicology, in machine learning terms it would be classified as a *k-nearest-neighbor* algorithm. 
+Christoph
 
-  Apart from this basic workflow lazar is completely modular and allows the researcher to use any algorithm for neighbor identification and local (Q)SAR modelling. Within this study we are using the following algorithms:
+For this study we are using the modular lazar (*la*zy *s*tructure
+*a*ctivity *r*elationships) framework (Maunz et al. 2013) for model
+development and validation.
+
+lazar follows the following basic workflow: For a given chemical
+structure it searches in a database for similar structures (neighbors)
+with experimental data, builds a local (Q)SAR model with these neighbors
+and uses this model to predict the unknown activity of the query
+compound. This procedure resembles an automated version of *read across*
+predictions in toxicology, in machine learning terms it would be
+classified as a *k-nearest-neighbor* algorithm.
+
+Apart from this basic workflow lazar is completely modular and allows
+the researcher to use any algorithm for neighbor identification and
+local (Q)SAR modelling. Within this study we are using the following
+algorithms:
 
 ### Neighbor identification
 
-  Similarity calculations are based on MolPrint2D fingerprints [@doi:10.1021/ci034207y] from the OpenBabel chemoinformatics library [@OBoyle2011]. 
+Christoph
 
-  The MolPrint2D fingerprint uses atom environments as molecular representation, which resemble basically the chemical concept of functional groups. For each atom in a molecule it represents the chemical environment with the atom types of connected atoms.
+Similarity calculations are based on MolPrint2D fingerprints (Bender et
+al. 2004) from the OpenBabel chemoinformatics library (OBoyle et al.
+2011).
 
-  The main advantage of MolPrint2D fingerprints over fingerprints with predefined substructures (such as OpenBabel FP3, FP4 or MACCs fingerprints) is that it may capture substructures of toxicological relevance that are not included in predefined substructure lists.
-  Preliminary experiments have shown that predictions with MolPrint2D fingerprints are indeed more accurate than fingerprints with predefined substructures.
+The MolPrint2D fingerprint uses atom environments as molecular
+representation, which resemble basically the chemical concept of
+functional groups. For each atom in a molecule it represents the
+chemical environment with the atom types of connected atoms.
 
-  From MolPrint2D fingerprints we can construct a feature vector with all atom environments of a compound, which can be used to calculate chemical similarities.
+The main advantage of MolPrint2D fingerprints over fingerprints with
+predefined substructures (such as OpenBabel FP3, FP4 or MACCs
+fingerprints) is that it may capture substructures of toxicological
+relevance that are not included in predefined substructure lists.
+Preliminary experiments have shown that predictions with MolPrint2D
+fingerprints are indeed more accurate than fingerprints with predefined
+substructures.
 
-[//]: # https://openbabel.org/docs/dev/FileFormats/MolPrint2D_format.html#molprint2d-format
+From MolPrint2D fingerprints we can construct a feature vector with all
+atom environments of a compound, which can be used to calculate chemical
+similarities.
 
-  The chemical similarity between two compounds is expressed as the proportion between atom environments common in both structures and the total number of atom environments (Jaccard/Tanimoto index (@sim)).
+[//]: # https://openbabel.org/docs/dev/FileFormats/MolPrint2D_format.html#molprint2d-format
 
-  (@sim) $sim = \frac{|A \cap B|}{|A \cup B|}$, $A$ atom environments of compound A, $B$ atom environments of compound B.
+The chemical similarity between two compounds is expressed as the
+proportion between atom environments common in both structures and the
+total number of atom environments (Jaccard/Tanimoto index (1)).
 
+(1) $sim = \frac{|A \cap B|}{|A \cup B|}$, $A$ atom environments of
+    compound A, $B$ atom environments of compound B.
 
 ### Local (Q)SAR models
 
-As soon as neighbors for a query compound have been identified, we can use their experimental LOAEL values to predict the activity of the untested compound. In this case we are using the weighted mean of the neighbors LOAEL values, where the contribution of each neighbor is weighted by its similarity to the query compound.
+Christoph
+
+As soon as neighbors for a query compound have been identified, we can
+use their experimental LOAEL values to predict the activity of the
+untested compound. In this case we are using the weighted mean of the
+neighbors LOAEL values, where the contribution of each neighbor is
+weighted by its similarity to the query compound.
 
 ### Validation
 
-# Results
+Christoph
+
+Results
+=======
 
 ### Dataset comparison
 
-  The main objective of this section is to compare the content of both databases in terms of structural composition and LOAEL values, to estimate the experimental variability of LOAEL values and to establish a baseline for evaluating prediction performance.
+Christoph + Elena
 
-#### Structural composition
+The main objective of this section is to compare the content of both
+databases in terms of structural composition and LOAEL values, to
+estimate the experimental variability of LOAEL values and to establish a
+baseline for evaluating prediction performance.
+
+#### Applicability domain
 
 ##### Ches-Mapper analysis
 
-  CheS-Mapper (Chemical Space Mapping and Visualization in 3D, http://ches-mapper.org/, [@Gütlein2012]) can be used to analyze the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. CheS-Mapper embeds a dataset into 3D space, such that compounds with similar feature values are close to each other. 
-  The following two screenshots visualise the comparison. The datasets are embeded into 3D Space based on structural fragments from three Smart list (OpenBabel FP3, OpenBabel FP4 and OpenBabel MACCS). 
+Christoph
+
+CheS-Mapper (Chemical Space Mapping and Visualization in 3D,
+http://ches-mapper.org/, (Gutlein, Karwath, and Kramer 2012)) can be
+used to analyze the relationship between the structure of chemical
+compounds, their physico-chemical properties, and biological or toxic
+effects. CheS-Mapper embeds a dataset into 3D space, such that compounds
+with similar feature values are close to each other. The following two
+screenshots visualise the comparison. The datasets are embeded into 3D
+Space based on structural fragments from three Smart list (OpenBabel
+FP3, OpenBabel FP4 and OpenBabel MACCS).
 
 ##### Distribution of functional groups
 
-  Figure 1 shows the frequency of selected functional groups in both datasets. A complete table for 138 functional groups from OpenBabel FP4 fingerprints can be found in the appendix.
+Christoph
+
+Figure 1 shows the frequency of selected functional groups in both
+datasets. A complete table for 138 functional groups from OpenBabel FP4
+fingerprints can be found in the appendix.
 
 ![Frequency of functional groups.](functional-groups.pdf)
 
-#### LOAEL values
+### Experimental variability versus prediction uncertainty 
 
-  Duplicated LOAEL values can be found in both datasets and there is a substantial overlap of compounds, with LOAEL values in both datasets.
+Christoph
+
+Duplicated LOAEL values can be found in both datasets and there is a
+substantial overlap of compounds, with LOAEL values in both datasets.
 
 ##### Intra dataset variability
 
-  The Mazzatorta dataset has 562 LOAEL values with 439 unique structures, the Swiss Federal Office dataset has 493 rat LOAEL values with 381 unique structures. Figure 2 shows the intra-dataset variability, where each vertical line represents a single compound and each dot represents an individual LOAEL value. The experimental variance of LOAEL values is similar in both datasets (p-value: 0.48).
+The Mazzatorta dataset has 562 LOAEL values with 439 unique structures,
+the Swiss Federal Office dataset has 493 rat LOAEL values with 381
+unique structures. Figure 2 shows the intra-dataset variability, where
+each vertical line represents a single compound and each dot represents
+an individual LOAEL value. The experimental variance of LOAEL values is
+similar in both datasets (p-value: 0.48).
 
 [//]: # p-value: 0.4750771581019402
 
@@ -100,15 +186,18 @@ As soon as neighbors for a query compound have been identified, we can use their
 
 ##### Inter dataset variability
 
-  Figure 3 shows the same situation for the combination of the Mazzatorta and Swiss Federal Office datasets. Obviously the experimental variability is larger than for individual datasets.
+Figure 3 shows the same situation for the combination of the Mazzatorta
+and Swiss Federal Office datasets. Obviously the experimental
+variability is larger than for individual datasets.
 
 ![Inter dataset variability](loael-dataset-comparison-common-compounds.pdf)
 
-##### LOAEL correlation between datasets
-
-  Figure 4 depicts the correlation between LOAEL data from both datasets (using means for multiple measurements). Correlation analysis shows a significant correlation with r^2: 0.61, RMSE: 1.22, MAE: 0.80
 
+##### LOAEL correlation between datasets
 
+Figure 4 depicts the correlation between LOAEL data from both datasets
+(using means for multiple measurements). Correlation analysis shows a
+significant correlation with r\^2: 0.61, RMSE: 1.22, MAE: 0.80
 
 [//]: #   MAE: 0.801626064534318
 [//]: # with identical values
@@ -116,20 +205,47 @@ As soon as neighbors for a query compound have been identified, we can use their
 ![LOAEL correlation](loael-dataset-correlation.pdf)
 
 
-### Read across predictions
-
-# Discussion
+### Local (Q)SAR models
 
-### Chemical similarity 
+Christoph
 
-### LOAEL variability
+Discussion
+==========
 
-### Predictive performance
+### Elena + Benoit
 
 ### 
 
-# Summary
-
-  $var1$
-
-# References
+Summary
+=======
+
+References
+==========
+
+Bender, Andreas, Hamse Y. Mussa, and Robert C. Glen, and Stephan
+Reiling. 2004. “Molecular Similarity Searching Using Atom Environments,
+Information-Based Feature Selection, and a Naïve Bayesian Classifier.”
+*Journal of Chemical Information and Computer Sciences* 44 (1): 170–78.
+doi:[10.1021/ci034207y](https://doi.org/10.1021/ci034207y).
+
+Gütlein, Martin, Andreas Karwath, and Stefan Kramer. 2012. “CheS-Mapper
+- Chemical Space Mapping and Visualization in 3D.” *Journal of
+Cheminformatics* 4 (1): 7.
+doi:[10.1186/1758-2946-4-7](https://doi.org/10.1186/1758-2946-4-7).
+
+Maunz, Andreas, Martin Gütlein, Micha Rautenberg, David Vorgrimmler,
+Denis Gebele, and Christoph Helma. 2013. “Lazar: A Modular Predictive
+Toxicology Framework.” *Frontiers in Pharmacology* 4. Frontiers Media
+SA.
+doi:[10.3389/fphar.2013.00038](https://doi.org/10.3389/fphar.2013.00038).
+
+OBoyle, Noel M, Michael Banck, Craig A James, Chris Morley, Tim
+Vandermeersch, and Geoffrey R Hutchison. 2011. “Open Babel: An Open
+Chemical Toolbox.” *Journal of Cheminformatics* 3 (1). Springer Science;
+Business Media: 33.
+doi:[10.1186/1758-2946-3-33](https://doi.org/10.1186/1758-2946-3-33).
+
+Weininger, David. 1988. “SMILES, a Chemical Language and Information
+System. 1. Introduction to Methodology and Encoding Rules.” *Journal of
+Chemical Information and Computer Sciences* 28 (1): 31–36.
+doi:[10.1021/ci00057a005](https://doi.org/10.1021/ci00057a005).
diff --git a/paper/loael.pdf b/paper/loael.pdf
index d77517d..93749fc 100644
--- a/paper/loael.pdf
+++ b/paper/loael.pdf
diff --git a/paper/rmse.rb b/paper/rmse.rb
new file mode 100644
index 0000000..0d5ac11
--- /dev/null
+++ b/paper/rmse.rb
@@ -0,0 +1,46 @@
+require_relative '../../lazar/lib/lazar'
+include OpenTox
+
+old = Dataset.from_csv_file File.join(File.dirname(__FILE__),"..","regression","LOAEL_mg_corrected_smiles_mmol.csv")
+new = Dataset.from_csv_file File.join(File.dirname(__FILE__),"..","regression","swissRat_chron_LOAEL_mmol.csv")
+
+[old,new].each do |dataset|
+  rmse = 0
+  nr = 0
+  dataset.compound_ids.each do |cid|
+    c = Compound.find cid
+    values = dataset.values(c,dataset.features.first)
+    if values.size > 1
+      median = -Math.log(values.mean) 
+      values.each do |v|
+        rmse += (-Math.log(v) - median)**2
+        nr += 1
+      end
+    end
+  end
+  p nr
+  rmse = Math.sqrt(rmse/nr)
+  p "#{dataset.name}: #{rmse}"
+end
+
+
+rmse = 0
+nr = 0
+(old.compound_ids & new.compound_ids).each do |cid|
+  c = Compound.find cid
+  values = old.values(c,old.features.first) + new.values(c,new.features.first)
+    p values.size
+  if values.size > 1
+    median = -Math.log(values.mean) 
+    values.each do |v|
+      rmse += (-Math.log(v) - median)**2
+      nr += 1
+    end
+  end
+end
+p nr
+  rmse = Math.sqrt(rmse/nr)
+  p "combined: #{rmse}"
+
+#combined_rmse = Math.sqrt(combined_rmse/combined_nr)
+#p "combined: #{combined_rmse}"
author	Christoph Helma <helma@in-silico.ch>	2016-02-12 13:22:27 +0100
committer	Christoph Helma <helma@in-silico.ch>	2016-02-12 13:22:27 +0100
commit	015a7532988e3f76b9835ee8e8df8e89e9ef4c8c (patch)
tree	145e92b645abad4b05b0e46596e35f1297d2fc0c
parent	87294b79e16e8f21446b8a232f06c956e6b7e81e (diff)