From 7bbe4c444523f281d07f79aa8d0a4719668c3c80 Mon Sep 17 00:00:00 2001
From: Christoph Helma <helma@in-silico.ch>
Date: Sat, 20 Mar 2021 00:14:10 +0100
Subject: manuscript update

---
 mutagenicity.md | 609 +++++++++++++++++++-------------------------------------
 1 file changed, 203 insertions(+), 406 deletions(-)

(limited to 'mutagenicity.md')

diff --git a/mutagenicity.md b/mutagenicity.md
index 3939d31..5a01ee9 100644
--- a/mutagenicity.md
+++ b/mutagenicity.md
@@ -1,5 +1,5 @@
 ---
-title: A comparison of nine machine learning models based on an expanded mutagenicity dataset and their application for predicting pyrrolizidine alkaloid mutagenicity
+title: A comparison of nine machine learning mutagenicity models and their application for predicting pyrrolizidine alkaloids
 
 author:
   - Christoph Helma:
@@ -8,23 +8,26 @@ author:
       correspondence: "yes"
   - Verena Schöning:
       institute: insel
+  - Jürgen Drewe:
+      institute: zeller, unibas
   - Philipp Boss:
       institute: sysbio
-  - Jürgen Drewe:
-      institute: zeller
 
 institute:
   - ist:
       name: in silico toxicology gmbh
       address: "Rastatterstrasse 41, 4057 Basel, Switzerland"
   - zeller: 
-      name: Zeller AG
+      name: Max Zeller Söhne AG
       address: "Seeblickstrasse 4, 8590 Romanshorn, Switzerland"
   - sysbio:
       name: Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association
       address: "Robert-Rössle-Strasse 10, Berlin, 13125, Germany"
+  - unibas:
+      name: Clinical Pharmacology, Department of Pharmaceutical Sciences, University Hospital Basel, University of Basel
+      address: "Petersgraben 4, 4031 Basel, Switzerland"
   - insel:
-      name: Clinical Pharmacology and Toxicology, Department of General Internal Medicine, Bern University Hospital, University of Bern
+      name: Clinical Pharmacology and Toxicology, Department of General Internal Medicine, University Hospital Bern, University of Bern
       address: "Inselspital, 3010 Bern, Switzerland"
 
 bibliography: bibliography.bib
@@ -44,16 +47,13 @@ Abstract
 
 Random forest, support vector machine, logistic regression, neural networks and
 k-nearest neighbor (`lazar`) algorithms, were applied to new *Salmonella*
-mutagenicity dataset with {{cv.n_uniq}} unique chemical structures.
-<!--
-The best prediction
-accuracies in 10-fold-crossvalidation were obtained with `lazar` models and
-MolPrint2D descriptors, that gave accuracies
-({{cv.lazar-high-confidence.acc_perc}}%)
-similar to the interlaboratory variability of the Ames test.
--->
-
-**TODO**: PA results
+mutagenicity dataset with {{cv.n_uniq}} unique chemical structures utilizing
+MolPrint2D and Chemistry Development Kit (CDK) descriptors.  Crossvalidation
+accuracies of all investigated models ranged from 80-85% which is comparable
+with the interlaboratory variability of the *Salmonella* mutagenicity assay.
+Pyrrolizidine alkaloid predictions showed a clear distinction between chemical
+groups, where Otonecines had the highest proportion of positive mutagenicity
+predictions and Monoester the lowest.
 
 Introduction
 ============
@@ -154,13 +154,13 @@ without further processing. To achieve consistency with these
 datasets, EFSA compounds were classified as mutagenic, if at least one
 positive result was found for TA98 or T100 Salmonella strains.
 
-Dataset merges were based on unique SMILES (*Simplified Molecular Input
-Line Entry Specification*) strings of the compound structures.
-Duplicated experimental data with the same outcome was merged into a
-single value, because it is likely that it originated from the same
-experiment. Contradictory results were kept as multiple measurements in
-the database. The combined training dataset contains {{cv.n_uniq}} unique
-structures and {{cv.n}} individual measurements.
+Dataset merges were based on unique SMILES (*Simplified Molecular Input Line
+Entry Specification*, @Weininger1989) strings of the compound structures.
+Duplicated experimental data with the same outcome was merged into a single
+value, because it is likely that it originated from the same experiment.
+Contradictory results were kept as multiple measurements in the database. The
+combined training dataset contains {{cv.n_uniq}} unique structures and {{cv.n}}
+individual measurements.
 
 Source code for all data download, extraction and merge operations is publicly
 available from the git repository <https://git.in-silico.ch/mutagenicity-paper>
@@ -215,10 +215,10 @@ basically the chemical concept of functional groups.
 
 In contrast to predefined lists of fragments (e.g. FP3, FP4 or MACCs
 fingerprints) or descriptors (e.g CDK) they are generated dynamically from
-chemical structures. This has the advantage that they can capture unknown substructures
-of toxicological relevance that are not included in other descriptors. In addition they
-allow the efficient calculation of 
-chemical similarities (e.g. Tanimoto indices) with simple set operations.
+chemical structures. This has the advantage that they can capture unknown
+substructures of toxicological relevance that are not included in other
+descriptors. In addition they allow the efficient calculation of chemical
+similarities (e.g. Tanimoto indices) with simple set operations.
 
 MolPrint2D fingerprints were calculated with the OpenBabel cheminformatics
 library (@OBoyle2011a). They can be obtained from the following locations:
@@ -233,13 +233,6 @@ library (@OBoyle2011a). They can be obtained from the following locations:
   - sparse representation (<https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/mp2d/fingerprints.mp2d>)
   - descriptor matrix (<https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/mp2d/pa-fingerprints.csv.gz>)
 
-<!--
-Using them as descriptors for global
-models leads however to huge, sparsely populated matrices that cannot be
-handled with traditional machine learning algorithms. In our experiments none
-of the R and Tensorflow algorithms was capable to use them as descriptors.
--->
-
 #### Chemistry Development Kit (*CDK*) descriptors
 
 Molecular 1D and 2D descriptors were calculated with the PaDEL-Descriptors
@@ -259,28 +252,6 @@ The same procedure was applied for the pyrrolizidine dataset yielding
  {{pa.cdk.n_descriptors}} descriptors for {{pa.cdk.n_compounds}}
 compounds. CDK features for pyrrolizidine alkaloids are available at  <https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/cdk/PA-Padel-2D_m2.csv>.
 
-<!--
-During feature selection, descriptors with near zero variance were removed
-using '*NearZeroVar*'-function (package 'caret'). If the percentage of the most
-common value was more than 90% or when the frequency ratio of the most common
-value to the second most common value was greater than 95:5 (e.g. 95 instances
-of the most common value and only 5 or less instances of the second most common
-value), a descriptor was classified as having a near zero variance. After that,
-highly correlated descriptors were removed using the
-'*findCorrelation*'-function (package 'caret') with a cut-off of 0.9. This
-resulted in a training dataset with 516 descriptors. These descriptors were
-scaled to be in the range between 0 and 1 using the '*preProcess*'-function
-(package 'caret'). The scaling routine was saved in order to apply the same
-scaling on the testing dataset. As these three steps did not consider the
-dependent variable (experimental mutagenicity), it was decided that they do not
-need to be included in the cross-validation of the model. To further reduce the
-number of features, a LASSO (*least absolute shrinkage and selection operator*)
-regression was performed using the '*glmnet*'-function (package '*glmnet*').
-The reduced dataset was used for the generation of the pre-trained models.
-
-CDK descriptors were used in global (RF, SVM, LR, NN) and local (`lazar`) models.
--->
-
 Algorithms
 ----------
 
@@ -308,6 +279,14 @@ QSAR (*Quantitative structure--activity relationship*) modelling.
 Algorithms used within this study are described in the following
 sections.
 
+#### Feature preprocessing
+
+MolPrint2D features were used without preprocessing. Near zero variance and
+strongly correlated CDK descriptors were removed and the remaining descriptor
+values were centered and scaled. Preprocessing was performed with the R `caret`
+preProcess function using the methods "nzv","corr","center" and "scale" with
+default settings.
+
 #### Neighbour identification
 
 Utilizing this modularity, similarity calculations were based both on
@@ -332,17 +311,18 @@ threshold) and the number of predictable compounds (low threshold). As
 it is in many practical cases desirable to make predictions even in the
 absence of closely related neighbours, we follow a tiered approach:
 
--   First a similarity threshold of 0.5 is used to collect neighbours,
-    to create a local QSAR model and to make a prediction for the query
-    compound. This are predictions with *high confidence*.
+-   First a similarity threshold of 0.5 (MP2D/Tanimoto) or 0.9 (CDK/Cosine) is
+    used to collect neighbours, to create a local QSAR model and to make a
+    prediction for the query compound. This are predictions with *high
+    confidence*.
 
--   If any of these steps fails, the procedure is repeated with a
-    similarity threshold of 0.2 and the prediction is flagged with a
-    warning that it might be out of the applicability domain of the
-    training data (*low confidence*).
+-   If any of these steps fails, the procedure is repeated with a similarity
+    threshold of 0.2 (MP2D/Tanimoto) or 0.7 (CDK/Cosine) and the prediction is
+    flagged with a warning that it might be out of the applicability domain of
+    the training data (*low confidence*).
 
--   Similarity thresholds of 0.5 and 0.2 are the default values chosen
-    by the software developers and remained unchanged during the
+-   These Similarity thresholds are the default values chosen
+    by software developers and remained unchanged during the
     course of these experiments.
 
 Compounds with the same structure as the query structure are
@@ -377,6 +357,17 @@ as more distant from the applicability domain (*low confidence*). Quantitative
 applicability domain information can be obtained from the similarities of
 individual neighbours.
 
+#### Validation
+
+10-fold cross validation was performed for model evaluation.
+
+#### Pyrrolizidine alkaloid predictions
+
+For the prediction of pyrrolizidine alkaloids models were generated with the
+MP2D and CDK training datasets. The complete feature set was used for MP2D
+predictions, for CDK predictions the intersection between training and
+pyrrolizidine alkaloid features was used.
+
 #### Availability
 
   - Source code for this manuscript (GPL3):
@@ -391,107 +382,56 @@ individual neighbours.
   - Public web interface:
     <https://lazar.in-silico.ch>
 
-<!--
-### R Random Forest, Support Vector Machines, and Deep Learning
-
-The RF, SVM, and DL models were generated using the R
-software (R-project for Statistical Computing,
-<https://www.r-project.org/>*;* version 3.3.1), specific R packages used
-are identified for each step in the description below. 
-
-#### Random Forest (*RF*)
-
-For the RF model, the '*randomForest*'-function (package
-'*randomForest*') was used. A forest with 1000 trees with maximal
-terminal nodes of 200 was grown for the prediction.
-
-#### Support Vector Machines (*SVM*)
-
-The '*svm*'-function (package 'e1071') with a *radial basis function
-kernel* was used for the SVM model.
-
-**TODO**: **Verena, Phillip** Sollen wir die DL Modelle ebenso wie die Tensorflow als Neural Nets (NN) bezeichnen?
-
-#### Deep Learning
-
-The DL model was generated using the '*h2o.deeplearning*'-function
-(package '*h2o*'). The DL contained four hidden layer with 70, 50, 50,
-and 10 neurons, respectively. Other hyperparameter were set as follows:
-l1=1.0E-7, l2=1.0E-11, epsilon = 1.0E-10, rho = 0.8, and quantile\_alpha
-= 0.5. For all other hyperparameter, the default values were used.
-Weights and biases were in a first step determined with an unsupervised
-DL model. These values were then used for the actual, supervised DL
-model.
-
-To validate these models, an internal cross-validation approach was
-chosen. The training dataset was randomly split in training data, which
-contained 95% of the data, and validation data, which contain 5% of the
-data. A feature selection with LASSO on the training data was performed,
-reducing the number of descriptors to approximately 100. This step was
-repeated five times. Based on each of the five different training data,
-the predictive models were trained and the performance tested with the
-validation data. This step was repeated 10 times. 
-
-![Flowchart of the generation and validation of the models generated in R-project](figures/image1.png){#fig:valid}
-
-#### Applicability domain
-
-**TODO**: **Verena**: Mit welchen Deskriptoren hast Du den Jaccard index berechnet?  Fuer den Jaccard index braucht man binaere Deskriptoren (zB MP2D), mit PaDEL Deskriptoren koennte man zB eine euklidische oder cosinus Distanz berechnen.
-
-The AD of the training dataset and the PA dataset was evaluated using
-the Jaccard distance. A Jaccard distance of '0' indicates that the
-substances are similar, whereas a value of '1' shows that the substances
-are different. The Jaccard distance was below 0.2 for all PAs relative
-to the training dataset. Therefore, PA dataset is within the AD of the
-training dataset and the models can be used to predict the genotoxic
-potential of the PA dataset.
-
-#### Availability
-
-R scripts for these experiments can be found in https://git.in-silico.ch/mutagenicity-paper/tree/scripts/R.
--->
-
 ### Tensorflow models
 
-**TODO**: **Philipp** Kannst Du bitte die folgenden Absaetze ergaenzen und die Vorgangsweise fuer MP2D/CDK bzw CV/PA Vorhersagen beschreiben.
+#### Feature Preprocessing
 
-<!--
-Data pre-processing was done by rank transformation using the
-'*QuantileTransformer*' procedure. A sequential model has been used.
-Four layers have been used: input layer, two hidden layers (with 12, 8
-and 8 nodes, respectively) and one output layer. For the output layer, a
-sigmoidal activation function and for all other layers the ReLU
-('*Rectified Linear Unit*') activation function was used. Additionally,
-a L^2^-penalty of 0.001 was used for the input layer. For training of
-the model, the ADAM algorithm was used to minimise the cross-entropy
-loss using the default parameters of Keras. Training was performed for
-100 epochs with a batch size of 64. The model was implemented with
-Python 3.6 and Keras. 
-
-**TODO**: **Philipp** Ich hab die alten Ergebnisse mit feature selection weggelassen, ist das ok? Dann muesste auch dieser Absatz gestrichen werden, oder?
-
-Alternatively, a DL model was established with Python-based Tensorflow
-program (<https://www.tensorflow.org/>) using the high-level API Keras
-(<https://www.tensorflow.org/guide/keras>) to build the models. 
-
-Tensorflow models used the same CDK descriptors as the R models.
--->
+For preprocessing of the CDK features we used a quantile transformation 
+to a uniform distribution. MP2D features were not preprocessed.
 
 #### Random forests (*RF*)
 
+For the random forest classifier we used the parameters 
+n_estimators=1000and max_leaf_nodes=200. For the other parameters we 
+used the scikit-learn default values.
+
 #### Logistic regression (SGD) (*LR-sgd*)
 
+For the logistic regression we used an ensemble of five trained models. 
+For each model we used a batch size of 64 and trained for 50 epoch. As 
+an optimizer ADAM was chosen. For the other parameters we used the 
+tensorflow default values.
+
 #### Logistic regression (scikit) (*LR-scikit*)
 
+For the logistic regression we used as parameters the scikit-learn 
+default values.
+
 #### Neural Nets (*NN*)
 
+For the neural network we used an ensemble of five trained models. For 
+each model we used a batch size of 64 and trained for 50 epoch. As an 
+optimizer ADAM was chosen. The neural network had 4 hidden layers with 
+64 nodes each and a ReLu activation function. For the other parameters 
+we used the tensorflow default values.
+
 #### Support vector machines (*SVM*)
 
-Validation
-----------
+We used the SVM implemented in scikit-learn. We used the parameters 
+kernel='rbf', gamma='scale'. For the other parameters we used the 
+scikit-learn default values.
+
+#### Validation
 
 10-fold cross-validation was used for all Tensorflow models.
 
+#### Pyrrolizidine alkaloid predictions
+
+For the prediction of pyrrolizidine alkaloids we trained the model described above on the 
+training data. For training and prediction only the features were used 
+that were in the intersection of features from the training data and the 
+pyrrolizidine alkaloids.
+
 #### Availability
 
 Jupyter notebooks for these experiments can be found at the following locations
@@ -548,32 +488,22 @@ https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/confusion-matr
 individual predictions can be found in
 https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/predictions/.
 
-With exception of lazar/CDK all investigated algorithm/descriptor combinations
+All investigated algorithm/descriptor combinations
 give accuracies between (80 and 85%) which is equivalent to the experimental
 variability of the *Salmonella typhimurium* mutagenicity bioassay (80-85%,
 @Benigni1988). Sensitivities and specificities are balanced in all of
 these models.
 
-<!--
-The most accurate crossvalidation predictions have been obtained with standard
-`lazar` models using MolPrint2D descriptors ({{cv.lazar-high-confidence.acc}}
-for predictions with high confidence, {{cv.lazar-all.acc}} for all
-predictions). Models utilizing CDK descriptors have generally lower
-accuracies ranging from {{cv.R-DL.acc}} (R deep learning) to {{cv.R-RF.acc}}
-(R/Tensorflow random forests). Sensitivity and specificity is generally well
-balanced with the exception of `lazar`-CDK (low sensitivity) and R deep
-learning (low specificity) models.
--->
-
 Pyrrolizidine alkaloid mutagenicity predictions 
 -----------------------------------------------
 
-Mutagenicity predictions from all investigated models for {{pa.n}}
-pyrrolizidine alkaloids (PAs) can be downloaded from
+Mutagenicity predictions of {{pa.n}} pyrrolizidine alkaloids (PAs) from all
+investigated models can be downloaded from
 <https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/pa-predictions.csv>.
 A visual representation of all PA predictions can be found at
 <https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/pa-predictions.pdf>.
 
+<!--
 @tbl:pa-mp2d and @tbl:pa-cdk summarise the outcome of pyrrolizidine alkaloid predictions from all models with MolPrint2D and CDK descriptors.
 
 | Model  | mutagenic | non-mutagenic | Nr. predictions |
@@ -599,11 +529,19 @@ A visual representation of all PA predictions can be found at
 | SVM | {{pa.cdk_svm.mut_perc}}% ({{pa.cdk_svm.mut}}) | {{pa.cdk_svm.non_mut_perc}}% ({{pa.cdk_svm.non_mut}}) | {{pa.cdk_svm.n_perc}}% ({{pa.cdk_svm.n}}) |
 
 : Summary of CDK pyrrolizidine alkaloid predictions {#tbl:pa-cdk}
+-->
 
-@fig:dhp - @fig:tert display the proportion of positive mutagenicity predictions from all models for the different pyrrolizidine alkaloid groups.
+@fig:pa-groups displays the proportion of positive mutagenicity predictions
+from all models for the different pyrrolizidine alkaloid groups. Tensorflow
+models predicted all {{pa.n}} pyrrolizidine alkaloids, `lazar` MP2D models
+predicted {{pa.mp2d_lazar_all.n}} compounds
+({{pa.mp2d_lazar_high_confidence.n}} with high confidence) and `lazar` CDK
+models {{pa.cdk_lazar_all.n}} compounds ({{pa.cdk_lazar_high_confidence.n}}
+with high confidence).
 
-![Summary of Dehydropyrrolizidine predictions](figures/Dehydropyrrolizidine.png){#fig:dhp}
+![Summary of pyrrolizidine alkaloid predictions](figures/pa-groups.png){#fig:pa-groups}
 
+<!--
 ![Summary of Diester predictions](figures/Diester.png){#fig:die}
 
 ![Summary of Macrocyclic-diester predictions](figures/Macrocyclic.diester.png){#fig:mcdie}
@@ -619,23 +557,65 @@ A visual representation of all PA predictions can be found at
 ![Summary of Retronecine predictions](figures/Retronecine.png){#fig:ret}
 
 ![Summary of Tertiary PA predictions](figures/Tertiary.PA.png){#fig:tert}
-
+-->
 
 For the visualisation of the position of pyrrolizidine alkaloids in respect to
 the training data set we have applied t-distributed stochastic neighbor
-embedding (t-SNE, @Maaten2008) for MolPrint2D and CDK descriptors.  t-SNE
-maps each high-dimensional object (chemical) to a two-dimensional point,
-maintaining the high-dimensional distances of the objects. Similar objects are
-represented by nearby points and dissimilar objects are represented by distant
-points.
+embedding (t-SNE, @Maaten2008) for MolPrint2D and CDK descriptors.  t-SNE maps
+each high-dimensional object (chemical) to a two-dimensional point, maintaining
+the high-dimensional distances of the objects. Similar objects are represented
+by nearby points and dissimilar objects are represented by distant points.
+t-SNE coordinates were calculated with the R `Rtsne` package using the default
+settings (perplexity = 30, theta = 0.5, max_iter = 1000).
+
+@fig:tsne-mp2d shows the t-SNE of pyrrolizidine alkaloids (PA) and the
+mutagenicity training data in MP2D space (Tanimoto/Jaccard similarity), which
+resembles basically the structural diversity of the investigated compounds.
+
+![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA) in MP2D space](figures/tsne-mp2d-mutagenicity.png){#fig:tsne-mp2d}
+
+@fig:tsne-cdk shows the t-SNE of pyrrolizidine alkaloids (PA) and the
+mutagenicity training data in CDK space (Euclidean similarity), which resembles
+basically the physical-chemical properties of the investigated compounds.
+
+![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA) in CDK space](figures/tsne-cdk-mutagenicity.png){#fig:tsne-cdk}
 
-@fig:tsne-mp2d shows the t-SNE of pyrrolizidine alkaloids (PA) and the mutagenicity training data in MP2D space (Tanimoto/Jaccard similarity).
+@fig:tsne-mp2d-rf and @fig:tsne-cdk-lazar-all depict two example pyrrolizidine alkaloid
+mutagenicity predictions in the context of training data. t-SNE visualisations of all investigated models can be downloaded from <https://git.in-silico.ch/mutagenicity-paper/figures>.
 
-![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-mp2d-mutagenicity.png){#fig:tsne-mp2d}
+<!--
+![t-SNE visualisation of all MP2D lazar predictions](figures/tsne-mp2d-lazar-all-classifications.png){#fig:tsne-mp2d-lazar-all}
+
+![t-SNE visualisation of MP2D lazar high-confidence predictions](figures/tsne-mp2d-lazar-high-confidence-classifications.png){#fig:tsne-mp2d-lazar-high-confidence}
+
+![t-SNE visualisation of MP2D logistic regression (sgd) predictions](figures/tsne-mp2d-lr-classifications.png){#fig:tsne-mp2d-lr}
+
+![t-SNE visualisation of MP2D logistic regression (scikit) predictions](figures/tsne-mp2d-lr2-classifications.png){#fig:tsne-mp2d-lr2}
+
+![t-SNE visualisation of MP2D neural network predictions](figures/tsne-mp2d-nn-classifications.png){#fig:tsne-mp2d-nn}
+-->
 
-@fig:tsne-cdk shows the t-SNE of pyrrolizidine alkaloids (PA) and the mutagenicity training data in CDK space (Euclidean similarity).
+![t-SNE visualisation of MP2D random forest predictions](figures/tsne-mp2d-rf-classifications.png){#fig:tsne-mp2d-rf}
+
+<!--
+![t-SNE visualisation of MP2D support vector machine predictions](figures/tsne-mp2d-svm-classifications.png){#fig:tsne-mp2d-svm}
+-->
+
+![t-SNE visualisation of all CDK lazar predictions](figures/tsne-cdk-lazar-all-classifications.png){#fig:tsne-cdk-lazar-all}
+
+<!--
+![t-SNE visualisation of CDK lazar high-confidence predictions](figures/tsne-cdk-lazar-high-confidence-classifications.png){#fig:tsne-cdk-lazar-high-confidence}
 
-![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-cdk-mutagenicity.png){#fig:tsne-cdk}
+![t-SNE visualisation of CDK logistic regression (sgd) predictions](figures/tsne-cdk-lr-classifications.png){#fig:tsne-cdk-lr}
+
+![t-SNE visualisation of CDK logistic regression (scikit) predictions](figures/tsne-cdk-lr2-classifications.png){#fig:tsne-cdk-lr2}
+
+![t-SNE visualisation of CDK neural network predictions](figures/tsne-cdk-nn-classifications.png){#fig:tsne-cdk-nn}
+
+![t-SNE visualisation of CDK random forest predictions](figures/tsne-cdk-rf-classifications.png){#fig:tsne-cdk-rf}
+
+![t-SNE visualisation of CDK support vector machine predictions](figures/tsne-cdk-svm-classifications.png){#fig:tsne-cdk-svm}
+-->
 
 Discussion
 ==========
@@ -657,63 +637,39 @@ Algorithms
 structures for a given compound and calculates the prediction based on the
 experimental data for these structures. The QSAR literature calls such models
 frequently *local models*, because models are generated specifically for each
-query compound. The investigated tensorflow models are in contrast *global models*, i.e. a
-single model is used to make predictions for all compounds. It has been
-postulated in the past, that local models are more accurate, because they can
-account better for mechanisms, that affect only a subset of the training data.
-
-@tbl:cv-mp2d, @tbl:cv-cdk and @fig:roc show that all models with the exception
-of lazar-CDK have similar crossvalidation accuracies that are comparable to the
-experimental variability of the *Salmonella typhimurium* mutagenicity bioassay
-(80-85% according to @Benigni1988). All of these models have balanced
-sensitivity (true position rate) and specificity (true negative rate) and
-provide highly significant concordance with experimental data (as determined by
-McNemar's Test). This is a clear indication that *in-silico* predictions can be
-as reliable as the bioassays. Given that the variability of experimental data
-is similar to model variability it is impossible to decide which model gives
-the most accurate predictions, as models with higher accuracies (e.g. NN-CDK)
-might just approximate experimental errors better than more robust models.
-
-`lazar` predictions with CDK descriptors are a notable exception, as it has a
-much lower overall accuracy ({{lazar_all_cdk.acc}}) than all other models.
-`lazar` uses basically a k-nearest-neighbor (with variable k) and it seems that
-CDK descriptors are not very well suited for chemical similarity calculations.
-We have confirmed this independently by validating k-nn models from the `R
-caret` package, which give also sub-par accuracies (data not shown).
-
-@fig:tsne-cdk is another indication that similarity calculations with CDK
-descriptors are not as useful as fingerprint based similarities, because it
-shows a less clearer separation between chemical classes and
-mutagens/non-mutagens than @fig:tsne-mp2d.  It seems that more complex models
-than simple k-nn are required to utilize CDK descriptors efficiently.
+query compound. The investigated tensorflow models are in contrast *global
+models*, i.e. a single model is used to make predictions for all compounds. It
+has been postulated in the past, that local models are more accurate, because
+they can account better for mechanisms, that affect only a subset of the
+training data.
+
+@tbl:cv-mp2d, @tbl:cv-cdk and @fig:roc show that the crossvalidation accuracies
+of all models are comparable to the experimental variability of the *Salmonella
+typhimurium* mutagenicity bioassay (80-85% according to @Benigni1988). All of
+these models have balanced sensitivity (true position rate) and specificity
+(true negative rate) and provide highly significant concordance with
+experimental data (as determined by McNemar's Test). This is a clear indication
+that *in-silico* predictions can be as reliable as the bioassays. Given that
+the variability of experimental data is similar to model variability it is
+impossible to decide which model gives the most accurate predictions, as models
+with higher accuracies might just approximate experimental errors better than
+more robust models.
 
 Our results do not support the assumption that local models are superior to
 global models for classification purposes. For regression models (lowest
 observed effect level) we have found however that local models may outperform
 global models (@Helma2018) with accuracies similar to experimental variability.
 
-<!--
-@tbl:lazar, @tbl:R, @tbl:tensorflow and @fig:roc show that the standard `lazar` algorithm (with MP2D
-fingerprints) give the most accurate crossvalidation results. R Random Forests,
-Support Vector Machines and Tensorflow models have similar accuracies with
-balanced sensitivity (true position rate) and specificity (true negative rate).
-`lazar` models with CDK descriptors have low sensitivity and R Deep Learning
-models have low specificity.
-
-The accuracy of `lazar` *in-silico* predictions are comparable to the
-interlaboratory variability of the Ames test (80-85% according to
-@Benigni1988), especially for predictions with high confidence
-({{cv.lazar-high-confidence.acc_perc}}%).
-
-The lowest number of predictions ({{cv.lazar-padel-high-confidence.n}}) has been
-obtained from `lazar`-CDK high confidence predictions, the largest number of
-predictions comes from Tensorflow models ({{cv.tensorflow-rf.v3.n}}). Standard
-`lazar` give a slightly lower number of predictions ({{cv.lazar-all.n}}) than R
-and Tensorflow models. This is not necessarily a disadvantage, because `lazar`
-abstains from predictions, if the query compound is very dissimilar from the
-compounds in the training set and thus avoids to make predictions for compounds
-out of the applicability domain. 
--->
+As all investigated algorithms give similar accuracies the selection will
+depend more on practical considerations than on intrinsic  properties. Nearest
+neighbor algorithms like `lazar` have the practical advantage that the
+rationales for individual predictions can be presented in a  straightforward
+manner that is understandable without a background in statistics or machine
+learning (@fig:lazar). This allows a critical examination of individual
+predictions and prevents blind trust in models that are intransparent to users
+with a toxicological background.
+
+![Lazar screenshot of 12,21-Dihydroxy-4-methyl-4,8-secosenecinonan-8,11,16-trione mutagenicity prediction](figures/lazar-screenshot.png){#fig:lazar}
 
 Descriptors
 -----------
@@ -728,31 +684,15 @@ descriptors are used to determine chemical similarities in the default `lazar`
 settings, and previous experiments have shown, that they give more accurate
 results than predefined fingerprints (e.g.  MACCS, FP2-4).
 
-<!--
-In order to investigate, if MP2D fingerprints are also suitable for global
-models we have tried to build R and Tensorflow models, both with and without
-unsupervised feature selection. Unfortunately none of the algorithms was
-capable to deal with the large and sparsely populated descriptor matrix.  Based
-on this result we can conclude, that MolPrint2D descriptors are at the moment
-unsuitable for standard global machine learning algorithms.
-
-`lazar` does not suffer from the size and sparseness problem, because (a) it
-utilizes internally a much more efficient occurrence based representation and
-(b) it uses fingerprints only for similarity calculations and not as model
-parameters.
--->
-
 *Chemistry Development Kit* (CDK, @Willighagen2017) descriptors 
 were calculated with the PaDEL graphical interface (@Yap2011). They include 
 1D and 2D topological descriptors as well as physical-chemical properties.
 
-With exception of `lazar` all investigated algorithms obtained models within
-the experimental variability for both types of descriptors. As discussed before
-CDK descriptors seem to be less suitable for chemical similarity calculations
-than MolPrint2D descriptors.
+All investigated algorithms obtained models within the experimental variability
+for both types of descriptors (@tbl:cv-mp2d, @tbl:cv-cdk, @fig:roc).
 
 Given that similar predictive accuracies are obtainable from both types of
-descriptors the choice depends more on practical considerations:
+descriptors the choice depends once more on practical considerations:
 
 MolPrint2D fragments can be calculated very efficiently for every well defined
 chemical structure with OpenBabel (@OBoyle2011a). CDK descriptor calculations
@@ -771,43 +711,12 @@ efficient. Due to the large number of substructures present in training
 compounds, they lead however to large and sparsely populated datasets, if they
 have to be expanded to a binary matrix (e.g. as input for tensorflow models).
 CDK descriptors contain in contrast in every case matrices with
-{{cv.cdk.n_descriptors}} columns.
-
-<!--
-
-**TODO**: **Verena** kannst Du bitte die Deskriptoren nochmals kurz beschreiben
-
-*CDK* descriptors were used for `lazar`, R and Tensorflow models.  All models
-based on CDK descriptors had similar crossvalidation accuracies that were
-significantly lower than `lazar` MolPrint2D results.  Direct comparisons are
-available only for the `lazar` algorithm, and also in this case CDK
-accuracies were lower than MolPrint2D accuracies.
-
-Based on `lazar` results we can conclude, that CDK descriptors are less
-suited for chemical similarity calculations than MP2D descriptors. It is also
-likely that CDK descriptors lead to less accurate predictions for global
-models, but we cannot draw any definitive conclusion in the absence of MP2D
-models.
-Our results seem to support this assumption, because standard `lazar` models
-with MolPrint2D descriptors perform better than global models. The accuracy of
-`lazar` models with CDK descriptors is however substantially lower and
-comparable to global models with the same descriptors.
-
-This observation may lead to the conclusion that the choice of suitable
-descriptors is more important for predictive accuracy than the modelling
-algorithm, but we were unable to obtain global MP2D models for direct
-comparisons.  The selection of an appropriate modelling algorithm is still
-crucial, because it needs the capability to handle the descriptor space.
-Neighbour (and thus similarity) based algorithms like `lazar` have a clear
-advantage in this respect over global machine learning algorithms (e.g. RF, SVM,
-LR, NN), because Tanimoto/Jaccard similarities can be calculated efficiently
-with simple set operations. 
--->
+{{cv.cdk.n_descriptors}} columns which can cause substantial computational overhead.
 
 Pyrrolizidine alkaloid mutagenicity predictions
 -----------------------------------------------
 
-@fig:dhp - @fig:tert show a clear differentiation between the different
+@fig:pa-groups shows a clear differentiation between the different
 pyrrolizidine alkaloid groups. The largest proportion of mutagenic predictions
 was observed for Otonecines {{pa.groups.Otonecine.mut_perc}}%
 ({{pa.groups.Otonecine.mut}}/{{pa.groups.Otonecine.n_pred}}), the lowest for
@@ -821,24 +730,26 @@ specificities in crossvalidation experiments some of the models (MPD-RF, CDK-RF
 and CDK-SVM) predict a lower number of mutagens
 ({{pa.cdk_rf.mut_perc}}-{{pa.mp2d_rf.mut_perc}}%) than the majority of the
 models ({{pa.mp2d_svm.mut_perc}}-{{pa.mp2d_lazar_high_confidence.mut_perc}}%
-@tbl:pa-mp2d, @tbl:pa-cdk, @fig:dhp - @fig:tert).
+(@fig:pa-groups). lazar-CDK on the other hand
+predicts the largest number of mutagens for all groups with exception of
+Otonecines.
 
-From a practical point we still have to face the question, how to choose model predictions, if no experimental data is available (we found two PAs in the training data, but this number is too low, to draw any general conclusions). 
+These differences between predictions from different algorithms and descriptors
+were not expected based on crossvalidation results.
 
-<!--
-`lazar` models with MolPrint2D descriptors predicted {{pa.lazar.mp2d.all.n_perc}}%
-of the pyrrolizidine alkaloids (PAs) ({{pa.lazar.mp2d.high_confidence.n_perc}}%
-with high confidence), the remaining compounds are not within its applicability
-domain. All other models predicted 100% of the 602 compounds, indicating that
-all compounds are within their applicability domain.
-
-Mutagenicity predictions from different models show little agreement in general
-(table 4). 42 from 602 PAs have non-conflicting predictions (all of them
-non-mutagenic).  Most models predict predominantly a non-mutagenic outcome for
-PAs, with exception of the R deep learning (DL) and the Tensorflow Scikit
-logistic regression models ({{pa.tf.dl.mut_perc}} and
-{{pa.tf.lr_scikit.mut_perc}}% positive predictions). 
+In order to investigate, if any of the investigated models show systematic
+errors in the  vicinity of pyrrolizidine-alkaloids we have performed a
+detailled t-SNE analysis of all models (see @fig:tsne-mp2d-rf and
+@fig:tsne-cdk-lazar-all for two examples, all visualisations can be found at
+<https://git.in-silico.ch/mutagenicity-paper/figures>.
+
+Nevertheless none of the models showed obvious deviations from their expected
+behaviour, so the reason for the disagreement between some of the models
+remains unclear at the moment.  It is however perfectly possible that some
+systematic errors are covered up by converting high dimensional spaces to two
+coordinates and are thus invisible in t-SNE visualisations.
 
+<!--
 non-conflicting CIDs
 43040
 186980
@@ -896,122 +807,8 @@ This is probably the reason why CDK models predicted all instances and the MP2D
 Differences between MP2D and CDK descriptors can be explained by their specific properties: CDK calculates a fixed set of descriptors for all structures, while MolPrint2D descriptors resemble substructures that are present in a compound. For this reason there is no fixed number of MP2D descriptors, the descriptor space are all unique substructures of the training set. If a query compound contains new substructures, this is immediately reflected in a lower similarity to training compounds, which makes applicability domain estimations very straightforward. With CDK (or any other predefined descriptors), the same set of descriptors is calculated for every compound, even if a compound comes from an completely new chemical class. 
 
 From a practical point we still have to face the question, how to choose model predictions, if no experimental data is available (we found two PAs in the training data, but this number is too low, to draw any general conclusions). Based on crossvalidation results and the arguments in favor of MolPrint2D descriptors we would put the highest trust in `lazar` MolPrint2D predictions, especially in high-confidence predictions. `lazar` predictions have a accuracy comparable to experimental variability (@Helma2018) for compounds within the applicability domain. But they should not be trusted blindly. For practical purposes it is important to study the rationales (i.e. neighbors and their experimental activities) for each prediction of relevance. A freely accessible GUI for this purpose has been implemented at https://lazar.in-silico.ch.
-
-
-**TODO**: **Verena**  Wenn Du lazar Ergebnisse konkret diskutieren willst, kann ich Dir ausfuehrliche Vorhersagen (mit aehnlichen Verbindungen und deren Aktivitaet) fuer einzelne Beispiele zusammenstellen 
-
-Due to the low to moderate predictivity of all models, quantitative
-statement on the genotoxicity of single PAs cannot be made with
-sufficient confidence.
-
-The predictions of the SVM model did not fit with the other models or
-literature, and are therefore not further considered in the discussion.
 -->
 
-**TODO**: **Verena** Hier ist ein alter Text von Dir zum Recylen: 
-
-Necic acid
-
-The rank order of the necic acid is comparable in the four models
-considered (LAZAR, RF and DL (R-project and Tensorflow). PAs from the
-monoester type had the lowest genotoxic potential, followed by PAs from
-the open-ring diester type. PAs with macrocyclic diesters had the
-highest genotoxic potential. The result fit well with current state of
-knowledge: in general, PAs, which have a macrocyclic diesters as necic
-acid, are considered more toxic than those with an open-ring diester or
-monoester [EFSA 2011](#_ENREF_36)[Fu et al. 2004](#_ENREF_45)[Ruan et
-al. 2014b](#_ENREF_115)(; ; ).
-
-Necine base
-
-The rank order of necine base is comparable in LAZAR, RF, and DL
-(R-project) models: with platynecine being less or as genotoxic as
-retronecine, and otonecine being the most genotoxic. In the
-Tensorflow-generate DL model, platynecine also has the lowest genotoxic
-probability, but are then followed by the otonecines and last by
-retronecine. These results partly correspond to earlier published
-studies. Saturated PAs of the platynecine-type are generally accepted to
-be less or non-toxic and have been shown in *in vitro* experiments to
-form no DNA-adducts [Xia et al. 2013](#_ENREF_139)(). Therefore, it is
-striking, that 1,2-unsaturated PAs of the retronecine-type should have
-an almost comparable genotoxic potential in the LAZAR and DL (R-project)
-model. In literature, otonecine-type PAs were shown to be more toxic
-than those of the retronecine-type [Li et al. 2013](#_ENREF_80)().
-
-Modifications of necine base
-
-The group-specific results of the Tensorflow-generated DL model appear
-to reflect the expected relationship between the groups: the low
-genotoxic potential of *N*-oxides and the highest potential of
-dehydropyrrolizidines [Chen et al. 2010](#_ENREF_26)().
-
-In the LAZAR model, the genotoxic potential of dehydropyrrolizidines
-(DHP) (using the extended AD) is comparable to that of tertiary PAs.
-Since, DHP is regarded as the toxic principle in the metabolism of PAs,
-and known to produce protein- and DNA-adducts [Chen et al.
-2010](#_ENREF_26)(), the LAZAR model did not meet this expectation it
-predicted the majority of DHP as being not genotoxic. However, the
-following issues need to be considered. On the one hand, all DHP were
-outside of the stricter AD of 0.5. This indicates that in general, there
-might be a problem with the AD. In addition, DHP has two unsaturated
-double bounds in its necine base, making it highly reactive. DHP and
-other comparable molecules have a very short lifespan, and usually
-cannot be used in *in vitro* experiments. This might explain the absence
-of suitable neighbours in LAZAR.
-
-Furthermore, the probabilities for this substance groups needs to be
-considered, and not only the consolidated prediction. In the LAZAR
-model, all DHPs had probabilities for both outcomes (genotoxic and not
-genotoxic) mainly below 30%. Additionally, the probabilities for both
-outcomes were close together, often within 10% of each other. The fact
-that for both outcomes, the probabilities were low and close together,
-indicates a lower confidence in the prediction of the model for DHPs.
-
-In the DL (R-project) and RF model, *N*-oxides have a by far more
-genotoxic potential that tertiary PAs or dehydropyrrolizidines. As PA
-*N*-oxides are easily conjugated for extraction, they are generally
-considered as detoxification products, which are *in vivo* quickly
-renally eliminated [Chen et al. 2010](#_ENREF_26)(). On the other hand,
-*N*-oxides can be also back-transformed to the corresponding tertiary PA
-[Wang et al. 2005](#_ENREF_134)(). Therefore, it may be questioned,
-whether *N*-oxides themselves are generally less genotoxic than the
-corresponding tertiary PAs. However, in the groups of modification of
-the necine base, dehydropyrrolizidine, the toxic principle of PAs,
-should have had the highest genotoxic potential. Taken together, the
-predictions of the modifications of the necine base from the LAZAR, RF
-and R-generated DL model cannot - in contrast to the Tensorflow DL
-model - be considered as reliable.
-
-Overall, when comparing the prediction results of the PAs to current
-published knowledge, it can be concluded that the performance of most
-models was low to moderate. This might be contributed to the following
-issues:
-
-1.  In the LAZAR model, only 26.6% PAs were within the stricter AD. With
-    the extended AD, 92.3% of the PAs could be included in the
-    prediction. Even though the Jaccard distance between the training
-    dataset and the PA dataset for the RF, SVM, and DL (R-project and
-    Tensorflow) models was small, suggesting a high similarity, the
-    LAZAR indicated that PAs have only few local neighbours, which might
-    adversely affect the prediction of the mutagenic potential of PAs.
-
-2.  All above-mentioned models were used to predict the mutagenicity of
-    PAs. PAs are generally considered to be genotoxic, and the mode of
-    action is also known. Therefore, the fact that some models predict
-    the majority of PAs as not genotoxic seems contradictory. To
-    understand this result, the basis, the training dataset, has to be
-    considered. The mutagenicity of in the training dataset are based on
-    data of mutagenicity in bacteria. There are some studies, which show
-    mutagenicity of PAs in the AMES test [Chen et al.
-    2010](#_ENREF_26)(). Also, [Rubiolo et al. (1992)](#_ENREF_116)
-    examined several different PAs and several different extracts of
-    PA-containing plants in the AMES test. They found that the AMES test
-    was indeed able to detect mutagenicity of PAs, but in general,
-    appeared to have a low sensitivity. The pre-incubation phase for
-    metabolic activation of PAs by microsomal enzymes was the
-    sensitivity-limiting step. This could very well mean that this is
-    also reflected in the QSAR models.
-
 Conclusions
 ===========
 
-- 
cgit v1.2.3