ch additions to manuscript

author: Christoph Helma <helma@in-silico.ch> 2020-10-17 21:16:24 +0200
committer: Christoph Helma <helma@in-silico.ch> 2020-10-17 21:16:24 +0200
commit: a1ebe0133a978e99ebfd1146efbd791824c56205 (patch)
tree: 12b6142449739d38356a2617017f050df7fd02a1
parent: c8ea095f5036f2fe6031cfa31ed6c00ca602fcee (diff)
5 files changed, 169 insertions, 144 deletions
diff --git a/Makefile b/Makefile
index 911aa52..856d87c 100644
--- a/Makefile
+++ b/Makefile
@@ -62,13 +62,13 @@ tables/pa-tab.tex: scripts/pa-table.rb
 	scripts/pa-table.rb > $@
 
 tables/lazar-summary.csv: $(CV_SUMMARY)
-	scripts/summaries2table.rb lazar > $@
+	scripts/summary2table.rb lazar > $@
 
 tables/r-summary.csv: $(CV_SUMMARY)
-	scripts/summaries2table.rb R > $@
+	scripts/summary2table.rb R > $@
 
 tables/tensorflow-summary.csv: $(CV_SUMMARY)
-	scripts/summaries2table.rb tensorflow > $@
+	scripts/summary2table.rb tensorflow > $@
 
 # crossvalidation summary
 
diff --git a/bibliography.bib b/bibliography.bib
index a8f0d5e..4e39f74 100644
--- a/bibliography.bib
+++ b/bibliography.bib
@@ -1,3 +1,12 @@
+@article{Maaten2008,
+  author = {van der Maaten, L.J.P. and Hinton, G.E.},
+  title = {Visualizing Data Using t-SNE},
+  journal = {Journal of Machine Learning Research},
+  year = 2008,
+  number = 9,
+  pages = "2579–2605"
+}
+
 @article{Helma2018,
   author = { Christoph Helma and David Vorgrimmler and Denis Gebele and Martin Gütlein and Barbara Engeli and Jürg Zarn and Benoit Schilter and Elena Lo Piparo},
   title = "Modeling Chronic Toxicity: A comparison of experimental variability with {(Q)SAR}/read-across predictions",
@@ -31,8 +40,6 @@ eprint = {
 
 }
 
-
-
 @Article{Kazius2005,
   author = "Kazius, J. and McGuire, R. and Bursi, R.",
   year = 2005,
diff --git a/mutagenicity.md b/mutagenicity.md
index ea377bb..a102afa 100644
--- a/mutagenicity.md
+++ b/mutagenicity.md
@@ -1,8 +1,6 @@
 ---
 title: A comparison of nine machine learning models based on an expanded mutagenicity dataset and their application for predicting pyrrolizidine alkaloid mutagenicity
 
-#title: A comparison of random forest, support vector machine, linear regression, deep learning and lazar algorithms for predicting the mutagenic potential of different pyrrolizidine alkaloids 
-#subtitle: Performance comparison with a new expanded dataset
 author:
   - Christoph Helma:
       institute: ist
@@ -14,6 +12,7 @@ author:
       institute: sysbio
   - Jürgen Drewe:
       institute: zeller
+
 institute:
   - ist:
       name: in silico toxicology gmbh
@@ -24,6 +23,7 @@ institute:
   - sysbio:
       name: Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association
       address: "Robert-Rössle-Strasse 10, Berlin, 13125, Germany"
+
 bibliography: bibliography.bib
 keywords: mutagenicity, QSAR, lazar, random forest, support vector machine, linear regression, neural nets, deep learning
 
@@ -31,28 +31,26 @@ documentclass: scrartcl
 tblPrefix: Table
 figPrefix: Figure
 header-includes:
-    - \usepackage{setspace}
+    - \usepackage{lineno, setspace, color, colortbl, longtable}
     - \doublespacing
-    - \usepackage{lineno}
-    - \usepackage{color, colortbl, longtable}
     - \linenumbers
 ...
 
 Abstract
 ========
 
-<!---
-Random forest, support vector machine, linear regression, deep learning and k-nearest neighbor
+Random forest, support vector machine, logistic regression, neural networks and k-nearest neighbor
 (`lazar`) algorithms, were applied to new *Salmonella* mutagenicity dataset
 with 8309 unique chemical structures. The best prediction accuracies in
-10-fold-crossvalidation were obtained with `lazar` models, that gave accuracies
+10-fold-crossvalidation were obtained with `lazar` models and Mol, that gave accuracies
 similar to the interlaboratory variability of the Ames test.
---->
+
+**TODO**: PA results
 
 Introduction
 ============
 
-TODO
+**TODO**: rationale for investigation
 
 <!---
 Pyrrolizidine alkaloids (PAs) are secondary plant ingredients found in
@@ -121,9 +119,9 @@ pyrrolizidine
 
 The main objectives of this study were
 
-  - to generate a new training dataset, by combining the most comprehensive public mutagenicity datasets
-  - to compare the performance of global models (RF, SVM, LR, NN) with local models (`lazar`)
-  - to compare the performance of MolPrint2D fingerprints with PaDEL descriptors
+  - to generate a new mutagenicity training dataset, by combining the most comprehensive public datasets
+  - to compare the performance of MolPrint2D (*MP2D*) fingerprints with PaDEL descriptors
+  - to compare the performance of global QSAR models (random forests (*RF*), support vector machines (*SVM*), logistic regression (*LR*), neural nets (*NN*)) with local models (`lazar`)
   - to apply these models for the prediction of pyrrolizidine alkaloid mutagenicity
 
 Materials and Methods
@@ -163,11 +161,12 @@ under a GPL3 License. The new combined dataset can be found at
 
 ### Pyrrolizidine alkaloid (PA) dataset
 
-The testing dataset consisted of 602 different PAs. The compilation of
-the PA dataset is described in detail in [Schöning et al.
-(2017)](#_ENREF_119).
+The testing dataset consisted of 602 different PAs.
+
+**TODO**: **Verena** Kannst Du kurz die Quellen und Auswahlkriterien zusammenfassen?
 
-TODO: **Verena** Quellen und Auswahlkriterien
+The compilation of the PA dataset is described in detail in [Schöning et al.
+(2017)](#_ENREF_119).
 
 <!---
 The PAs were assigned to groups according to
@@ -198,11 +197,10 @@ For the necic acid, following groups were assigned:
 
 -   Macrocyclic diester-type
 --->
-
 Descriptors
 -----------
 
-### MolPrint2D fingerprints (*MP2D*)
+### MolPrint2D (*MP2D*) fingerprints
 
 MolPrint2D fingerprints (@OBoyle2011a) use atom environments as molecular
 representation.  They determine for each atom in a molecule, the atom types of
@@ -225,7 +223,7 @@ library (@OBoyle2011a).
 
 #### PaDEL descriptors
 
-For R and Tensorflow models, molecular 1D and 2D descriptors were calculated
+Molecular 1D and 2D descriptors were calculated
 with the PaDEL-Descriptors program (<http://www.yapcwsoft.com> version 2.21, @Yap2011). 
 
 As the training dataset contained over 8309 instances, it was decided to
@@ -234,25 +232,25 @@ Furthermore, substances with equivocal outcome were removed. The final
 training dataset contained 8080 instances with known mutagenic
 potential.
 
-During feature
-selection, descriptor with near zero variance were removed using
-'*NearZeroVar*'-function (package 'caret'). If the percentage of the
-most common value was more than 90% or when the frequency ratio of the
-most common value to the second most common value was greater than 95:5
-(e.g. 95 instances of the most common value and only 5 or less instances
-of the second most common value), a descriptor was classified as having
-a near zero variance. After that, highly correlated descriptors were
-removed using the '*findCorrelation*'-function (package 'caret') with a
-cut-off of 0.9. This resulted in a training dataset with 516
-descriptors. These descriptors were scaled to be in the range between 0
-and 1 using the '*preProcess*'-function (package 'caret'). The scaling
-routine was saved in order to apply the same scaling on the testing
-dataset. As these three steps did not consider the outcome, it was
-decided that they do not need to be included in the cross-validation of
-the model. To further reduce the number of features, a LASSO (*least
-absolute shrinkage and selection operator*) regression was performed
-using the '*glmnet*'-function (package '*glmnet*'). The reduced dataset
-was used for the generation of the pre-trained models.
+During feature selection, descriptors with near zero variance were removed
+using '*NearZeroVar*'-function (package 'caret'). If the percentage of the most
+common value was more than 90% or when the frequency ratio of the most common
+value to the second most common value was greater than 95:5 (e.g. 95 instances
+of the most common value and only 5 or less instances of the second most common
+value), a descriptor was classified as having a near zero variance. After that,
+highly correlated descriptors were removed using the
+'*findCorrelation*'-function (package 'caret') with a cut-off of 0.9. This
+resulted in a training dataset with 516 descriptors. These descriptors were
+scaled to be in the range between 0 and 1 using the '*preProcess*'-function
+(package 'caret'). The scaling routine was saved in order to apply the same
+scaling on the testing dataset. As these three steps did not consider the
+dependent variable (experimental mutagenicity), it was decided that they do not
+need to be included in the cross-validation of the model. To further reduce the
+number of features, a LASSO (*least absolute shrinkage and selection operator*)
+regression was performed using the '*glmnet*'-function (package '*glmnet*').
+The reduced dataset was used for the generation of the pre-trained models.
+
+PaDEL descriptors were used in global (RF, SVM, LR, NN) and local (`lazar`) models.
 
 Algorithms
 ----------
@@ -276,7 +274,7 @@ in toxicology, in machine learning terms it would be classified as a
 k-nearest-neighbour algorithm.
 
 Apart from this basic workflow, `lazar` is completely modular and allows
-the researcher to use any algorithm for similarity searches and local
+the researcher to use arbitrary algorithms for similarity searches and local
 QSAR (*Quantitative structure--activity relationship*) modelling.
 Algorithms used within this study are described in the following
 sections.
@@ -327,10 +325,9 @@ in the presence of duplicates.
 Only similar compounds (neighbours) above the threshold are used for
 local QSAR models. In this investigation, we are using a weighted
 majority vote from the neighbour's experimental data for mutagenicity
-classifications. Probabilities for both classes
-(mutagenic/non-mutagenic) are calculated according to the following
-formula and the class with the higher probability is used as prediction
-outcome.
+classifications. Probabilities for both classes (mutagenic/non-mutagenic) are
+calculated according to the following formula and the class with the higher
+probability is used as prediction outcome.
 
 $$p_{c} = \ \frac{\sum_{}^{}\text{sim}_{n,c}}{\sum_{}^{}\text{sim}_{n}}$$
 
@@ -343,13 +340,13 @@ $\sum_{}^{}\text{sim}_{n}$ Sum of all neighbours
 
 The applicability domain (AD) of `lazar` models is determined by the
 structural diversity of the training data. If no similar compounds are
-found in the training data no predictions will be generated. Warnings
-are issued if the similarity threshold had to be lowered from 0.5 to 0.2
-in order to enable predictions. Predictions without warnings can be
-considered as close to the applicability domain (*high confidence*) and predictions with
-warnings as more distant from the applicability domain (*low confidence*). Quantitative
-applicability domain information can be obtained from the similarities
-of individual neighbours.
+found in the training data no predictions will be generated. Warnings are
+issued if the similarity threshold had to be lowered from 0.5 to 0.2 in order
+to enable predictions. Predictions without warnings can be considered as close
+to the applicability domain (*high confidence*) and predictions with warnings
+as more distant from the applicability domain (*low confidence*). Quantitative
+applicability domain information can be obtained from the similarities of
+individual neighbours.
 
 #### Availability
 
@@ -375,17 +372,19 @@ software (R-project for Statistical Computing,
 <https://www.r-project.org/>*;* version 3.3.1), specific R packages used
 are identified for each step in the description below. 
 
-#### Random Forest
+#### Random Forest (*RF*)
 
 For the RF model, the '*randomForest*'-function (package
 '*randomForest*') was used. A forest with 1000 trees with maximal
 terminal nodes of 200 was grown for the prediction.
 
-#### Support Vector Machines
+#### Support Vector Machines (*SVM*)
 
 The '*svm*'-function (package 'e1071') with a *radial basis function
 kernel* was used for the SVM model.
 
+**TODO**: **Verena, Phillip** Sollen wir die DL Modelle ebenso wie die Tensorflow als Neural Nets (NN) bezeichnen?
+
 #### Deep Learning
 
 The DL model was generated using the '*h2o.deeplearning*'-function
@@ -397,8 +396,6 @@ Weights and biases were in a first step determined with an unsupervised
 DL model. These values were then used for the actual, supervised DL
 model.
 
-TODO: **Verena** kannst Du bitte ueberpruefen, ob das noch stimmt und ggf die Figure 1 anpassen
-
 To validate these models, an internal cross-validation approach was
 chosen. The training dataset was randomly split in training data, which
 contained 95% of the data, and validation data, which contain 5% of the
@@ -408,15 +405,17 @@ repeated five times. Based on each of the five different training data,
 the predictive models were trained and the performance tested with the
 validation data. This step was repeated 10 times. 
 
+**TODO**: **Verena** kannst Du bitte ueberpruefen, ob das noch stimmt und ggf die Figure 1 anpassen
+
 ![Flowchart of the generation and validation of the models generated in R-project](figures/image1.png){#fig:valid}
 
 <!--
-TODO: **Verena** Ich hab die *Applicability domain* section weggelassen, da sie ansc
+**TODO**: **Verena** Ich hab die *Applicability domain* section weggelassen, da sie ansc
 -->
 
 #### Applicability domain
 
-TODO: **Verena**: Mit welchen Deskriptoren hast Du den Jaccard index berechnet?  Fuer den Jaccard index braucht man binaere Deskriptoren (zB MP2D), mit PaDEL Deskriptoren koennte man zB eine euklidische oder cosinus Distanz berechnen.
+**TODO**: **Verena**: Mit welchen Deskriptoren hast Du den Jaccard index berechnet?  Fuer den Jaccard index braucht man binaere Deskriptoren (zB MP2D), mit PaDEL Deskriptoren koennte man zB eine euklidische oder cosinus Distanz berechnen.
 
 The AD of the training dataset and the PA dataset was evaluated using
 the Jaccard distance. A Jaccard distance of '0' indicates that the
@@ -432,22 +431,6 @@ R scripts for these experiments can be found in https://git.in-silico.ch/mutagen
 
 ### Tensorflow models
 
-TODO: **Philipp** bitte ergaenzen
-
-#### Random forests
-
-#### Logistic regression (SGD)
-
-#### Logistic regression (scikit)
-
-#### Neural Nets
-
-Alternatively, a DL model was established with Python-based Tensorflow
-program (<https://www.tensorflow.org/>) using the high-level API Keras
-(<https://www.tensorflow.org/guide/keras>) to build the models. 
-
-Tensorflow models used the same PaDEL descriptors as the R models.
-
 Data pre-processing was done by rank transformation using the
 '*QuantileTransformer*' procedure. A sequential model has been used.
 Four layers have been used: input layer, two hidden layers (with 12, 8
@@ -460,8 +443,25 @@ loss using the default parameters of Keras. Training was performed for
 100 epochs with a batch size of 64. The model was implemented with
 Python 3.6 and Keras. 
 
-TODO: **Philipp** kannst Du bitte ueberpruefen ob die Beschreibung noch stimmt
-und ob der Ablauf von Verena (Figure 1) auch fuer Deine Modelle gilt
+**TODO**: **Philipp** Ich hab die alten Ergebnisse mit feature selection weggelassen, ist das ok? Dann muesste auch dieser Absatz gestrichen werden, oder?
+
+**TODO**: **Philipp** Kannst Du bitte die folgenden Absaetze ergaenzen
+
+#### Random forests (*RF*)
+
+#### Logistic regression (SGD) (*LR-sgd*)
+
+#### Logistic regression (scikit) (*LR-scikit*)
+
+**TODO**: **Philipp, Verena** DL oder NN?
+
+#### Neural Nets (*NN*)
+
+Alternatively, a DL model was established with Python-based Tensorflow
+program (<https://www.tensorflow.org/>) using the high-level API Keras
+(<https://www.tensorflow.org/guide/keras>) to build the models. 
+
+Tensorflow models used the same PaDEL descriptors as the R models.
 
 Validation
 ----------
@@ -497,23 +497,28 @@ Crossvalidation results are summarized in the following tables: @tbl:lazar shows
 Confusion matrices for all models are available from the git repository http://git.in-silico.ch/mutagenicity-paper/10-fold-crossvalidations/confusion-matrices/, individual predictions can be found in 
 http://git.in-silico.ch/mutagenicity-paper/10-fold-crossvalidations/predictions/.
 
-The most accurate crossvalidation predictions have been obtained with `lazar` models with MolPrint2D descriptors ({{lazar-high-confidence.acc}} for predictions with high confidence, {{lazar-all.acc}} for all predictions). Models utilizing PaDEL descriptors have generally lower accuracies ranging from TODO to TODO. Sensitivity and specificity is generally well balanced with the exception of `lazar`-PaDEL (low sensitivity) and R deep learning (low specificity) models.
-
+The most accurate crossvalidation predictions have been obtained with standard `lazar` models using MolPrint2D descriptors ({{lazar-high-confidence.acc}} for predictions with high confidence, {{lazar-all.acc}} for all predictions). Models utilizing PaDEL descriptors have generally lower accuracies ranging from {{R-DL}} (R deep learning) to {{R-RF}} (R/Tensorflow random forests). Sensitivity and specificity is generally well balanced with the exception of `lazar`-PaDEL (low sensitivity) and R deep learning (low specificity) models.
 
 Pyrrolizidine alkaloid mutagenicity predictions 
 -----------------------------------------------
 
-Pyrrolizidine alkaloid mutagenicity predictions are summarized in @tab:pa. 
-
-@fig:tsne-mp2d shows the position of pyrrolizidine alkaloids (PA) in the mutagenicity training dataset in MP2D space
+Mutagenicity predictions from all investigated models for 602 pyrrolizidine alkaloids are summarized in Table 4. 
 
 \input{tables/pa-tab.tex}
 
-![t-sne visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-mp2d.png){#fig:tsne-mp2d}
+Training data and 
+pyrrolizidine alkaloids were visualised with t-distributed stochastic neighbor embedding (t-SNE, @Maaten2008)
+for MolPrint2D and PaDEL descriptors.  t-SNA maps each high-dimensional object
+(chemical) to a two-dimensional point. Similar objects are represented by
+nearby points and dissimilar objects are represented by distant points.
+
+@fig:tsne-mp2d shows the t-SNE of pyrrolizidine alkaloids (PA) and the mutagenicity training data in MP2D space (Tanimoto/Jaccard similarity).
+
+![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-mp2d.png){#fig:tsne-mp2d}
 
-@fig:tsne-padel shows the position of pyrrolizidine alkaloids (PA) in the mutagenicity training dataset in PADEL space
+@fig:tsne-padel shows the t-SNE of pyrrolizidine alkaloids (PA) and the mutagenicity training data in PaDEL space (Euclidean similarity).
 
-![t-sne visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-padel.png){#fig:tsne-padel}
+![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-padel.png){#fig:tsne-padel}
 
 Discussion
 ==========
@@ -554,52 +559,48 @@ predictions comes from Tensorflow models ({{tensorflow-all.n}}). Standard
 and Tensorflow models. This is not necessarily a disadvantage, because `lazar`
 abstains from predictions, if the query compound is very dissimilar from the
 compounds in the training set and thus avoids to make predictions for compounds
-that do not fall into its applicability domain. 
-
-There are two major differences between `lazar` and R/Tensorflow models, which
-might explain the different prediction accuracies:
-
-- `lazar` uses MolPrint2D fingerprints, while all other models use PaDEL descriptors
-- `lazar` creates local models for each query compound and the other models use a single global model for all predictions
-
-We will discuss both options in the following sections.
+out of the applicability domain. 
 
 Descriptors
 -----------
 
-This study uses two types of descriptors to characterize chemical structures.
+This study uses two types of descriptors for the characterisation of chemical
+structures:
 
-MolPrint2D fingerprints (MP2D, @Bender2004) use atom environments (i.e.
-connected atoms for all atoms in a molecule) as molecular representation, which
-resembles basically the chemical concept of functional groups. MP2D descriptors
-are used to determine chemical similarities in lazar, and previous experiments
-have shown, that they give more accurate results than predefined descriptors
-(e.g.  MACCS, FP2-4) for all investigated endpoints.
+*MolPrint2D* fingerprints (MP2D, @Bender2004) use atom environments (i.e.
+connected atom types for all atoms in a molecule) as molecular representation,
+which resembles basically the chemical concept of functional groups. MP2D
+descriptors are used to determine chemical similarities in the default `lazar`
+settings, and previous experiments have shown, that they give more accurate
+results than predefined fragments (e.g.  MACCS, FP2-4).
+
+In order to investigate, if MP2D fingerprints are also suitable for global
+models we have tried to build R and Tensorflow models, both with and without
+unsupervised feature selection. Unfortunately none of the algorithms was
+capable to deal with the large and sparsely populated descriptor matrix.  Based
+on this result we can conclude, that MolPrint2D descriptors are at the moment
+unsuitable for standard global machine learning algorithms.
+
+`lazar` does not suffer from the size and sparseness problem, because (a) it
+utilizes internally a much more efficient occurrence based representation and
+(b) it uses fingerprints only for similarity calculations and not as model
+parameters.
 
 PaDEL calculates topological and physical-chemical descriptors.
 
-TODO: **Verena** kannst Du bitte die Deskriptoren nochmals kurz beschreiben
-
-PaDEL descriptors were used for the R and Tensorflow models. In addition we
-have used PaDEL descriptors to calculate cosine similarities for the `lazar`
-algorithm and compared the results with standard MP2D similarities, which led
-to a significant decrease of `lazar` prediction accuracies. Based on this
-result we can conclude, that PaDEL descriptors are less suited for similarity
-calculations than MP2D descriptors.
-
-In order to investigate, if MP2D fingerprints are also a better option for
-global models we have tried to build R and Tensorflow models both with and
-without unsupervised feature selection. Unfortunately none of the algorithms
-was capable to deal with the large and sparsely populated descriptor matrix.
-Based on this result we can conclude, that MP2D descriptors are at the moment
-unsuitable for standard global machine learning algorithms. Please note that
-`lazar` does not suffer from the sparseness problem, because (a) it utilizes
-internally a much more efficient occurrence based representation and (b) it
-uses fingerprints only for similarity calculations and mot as model parameters.
-
-Based on these results we can conclude, that PaDEL descriptors are less suited
-for similarity calculations than MP2D fingerprints and that current standard
-machine learning algorithms are not capable to utilize chemical fingerprints.
+**TODO**: **Verena** kannst Du bitte die Deskriptoren nochmals kurz beschreiben
+
+*PaDEL* descriptors were used for `lazar`, R and Tensorflow models.  All models
+based on PaDEL descriptors had similar crossvalidation accuracies that were
+significantly lower than `lazar` MolPrint2D results.  Direct comparisons are
+available only for the `lazar` algorithm, and also in this case PaDEL
+accuracies were lower than MolPrint2D accuracies.
+
+Based on `lazar` results we can conclude, that PaDEL descriptors are less
+suited for chemical similarity calculations than MP2D descriptors. It is also
+likely that PaDEL descriptors lead to less accurate predictions for global
+models, but we cannot draw any definitive conclusion in the absence of MP2D
+models.
 
 Algorithms
 ----------
@@ -612,16 +613,29 @@ query compound. R and Tensorflow models are in contrast *global models*, i.e. a
 single model is used to make predictions for all compounds. It has been
 postulated in the past, that local models are more accurate, because they can
 account better for mechanisms, that affect only a subset of the training data.
-Our results seem to support this assumption, because `lazar` models perform
-better than global models. Both types of models use however different
-descriptors, and for this reason we cannot draw a definitive conclusion if the
-model algorithm or the descriptor type are the reason for the observed
-differences. In order to answer this question, we would have to use global
-modelling algorithms that are capable to handle large, sparse binary matrices.
+Our results seem to support this assumption, because standard `lazar` models
+with MolPrint2D descriptors perform better than global models. The accuracy of
+`lazar` models with PaDEL descriptors is however substantially lower and
+comparable to global models with the same descriptors.
+
+This observation may lead to the conclusion that the choice of suitable
+descriptors is more important for predictive accuracy than the modelling
+algorithm, but we were unable to obtain global MP2D models for direct
+comparisons.  The selection of an appropriate modelling algorithm is still
+crucial, because it needs the capability to handle the descriptor space.
+Neighbour (and thus similarity) based algorithms like `lazar` have a clear
+advantage in this respect over global machine learning algorithms (e.g. RF, SVM,
+LR, NN), because Tanimoto/Jaccard similarities can be calculated efficiently
+with simple set operations. 
+
+Pyrrolizidine alkaloid mutagenicity predictions
+-----------------------------------------------
+
+**TODO**: **Philipp** Ich wuerde fuer meinen Teil (generelle Uebersicht, applicability domain) noch die Tensorflow Ergebnisse brauchen.
 
-Mutagenicity of PAs
--------------------
+**TODO**: **Verena** Ich wuerde den Grossteil der Diskussion hier dir ueberlassen. Wenn Du lazar Ergebnisse konkret diskutieren willst, kann ich Dir ausfuehrliche Vorhersagen (mit aehnlichen Verbindungen und deren Aktivitaet) fuer einzelne Beispiele zusammenstellen 
 
+<!---
 Due to the low to moderate predictivity of all models, quantitative
 statement on the genotoxicity of single PAs cannot be made with
 sufficient confidence.
@@ -730,19 +744,23 @@ issues:
     metabolic activation of PAs by microsomal enzymes was the
     sensitivity-limiting step. This could very well mean that this is
     also reflected in the QSAR models.
-
+--->
 
 Conclusions
 ===========
 
 A new public *Salmonella* mutagenicity training dataset with 8309 compounds was
-created and used it to train `lazar`, R and Tensorflow models. The best
-performance was obtained with `lazar` models using MolPrint2D descriptors, with
-prediction accuracies comparable to the interlaboratory variability of the Ames
-test. Differences between algorithms (local vs. global models) and/or
-descriptors (MolPrint2D vs PaDEL) may be responsible for the different
-prediction accuracies. 
+created and used it to train `lazar`, R and Tensorflow models with MolPrint2D
+and PaDEL descriptors. The best performance was obtained with `lazar` models
+using MolPrint2D descriptors, with prediction accuracies
+({{lazar.-high-confidence.acc}}) comparable to the interlaboratory variability
+of the Ames test (80-85%). Models based on PaDEL descriptors had lower
+accuracies than MolPrint2D models, but only the `lazar` algorithm could use
+MolPrint2D descriptors.
+
+**TODO**: PA Vorhersagen
 
+<!---
 In this study, an attempt was made to predict the genotoxic potential of
 PAs using five different machine learning techniques (LAZAR, RF, SVM, DL
 (R-project and Tensorflow). The results of all models fitted only partly
@@ -761,7 +779,7 @@ possible mechanisms of toxicity.
 In further studies, additional machine learning techniques or a modified
 (extended) training dataset should be used for an additional attempt to
 predict the genotoxic potential of PAs.
-
+--->
 
 References
 ==========
diff --git a/scripts/summary2roc.rb b/scripts/summary2roc.rb
index dbac2f4..e50d97a 100755
--- a/scripts/summary2roc.rb
+++ b/scripts/summary2roc.rb
@@ -4,6 +4,6 @@ require "yaml"
 data = YAML.load(File.read ARGV[0])
 puts "tpr,fpr"
 data.each do |algo,values|
-  algo = algo.sub("tensorflow","Tensorflow").sub("selected","FS").sub(".v3","").sub("-all"," (all)").sub("-high-confidence"," (high confidence)").sub("padel","PaDEL").sub("lazar ","lazar-MP2D ").sub("lr2","LR (scikit)").sub("lr","LR (SGD)").sub("nn","NN").sub("-rf","-RF")
+  algo = algo.sub("tensorflow","Tensorflow").sub("selected","FS").sub(".v3","").sub("-all"," (all)").sub("-high-confidence"," (high confidence)").sub("padel","PaDEL").sub("lazar ","lazar-MP2D ").sub("lr2","LR-scikit").sub("lr","LR-sgd").sub("nn","NN").sub("-rf","-RF")
   puts [algo,values[:tpr],values[:fpr]].join(",")
 end
diff --git a/scripts/summaries2table.rb b/scripts/summary2table.rb
index 99b84a1..555097c 100755
--- a/scripts/summaries2table.rb
+++ b/scripts/summary2table.rb
@@ -9,7 +9,7 @@ when "R"
   header = ["RF","SVM","DL"]
   keys = header.collect{|h| "R-"+h}
 when "tensorflow"
-  header = ["RF","LR (SGD)","LR (SCIKIT)","NN"]
+  header = ["RF","LR-sgd","LR-scikit","NN"]
   keys = ["rf","lr","lr2","nn"].collect{|n| "tensorflow-"+n+".v3"}
 when "lazar"
   header = ["MP2D", "PaDEL"]
author	Christoph Helma <helma@in-silico.ch>	2020-10-17 21:16:24 +0200
committer	Christoph Helma <helma@in-silico.ch>	2020-10-17 21:16:24 +0200
commit	a1ebe0133a978e99ebfd1146efbd791824c56205 (patch)
tree	12b6142449739d38356a2617017f050df7fd02a1
parent	c8ea095f5036f2fe6031cfa31ed6c00ca602fcee (diff)