---
title: A comparison of twelve machine learning models based on an expanded mutagenicity dataset and their application for predicting pyrrolizidine alkaloid mutagenicity
# TODO check # algorithms

#title: A comparison of random forest, support vector machine, linear regression, deep learning and lazar algorithms for predicting the mutagenic potential of different pyrrolizidine alkaloids 
#subtitle: Performance comparison with a new expanded dataset
author:
  - Christoph Helma:
      institute: ist
      email: helma@in-silico.ch
      correspondence: "yes"
  - Verena Schöning:
      institute: zeller
  - Philipp Boss:
      institute: sysbio
  - Jürgen Drewe:
      institute: zeller
institute:
  - ist:
      name: in silico toxicology gmbh
      address: "Rastatterstrasse 41, 4057 Basel, Switzerland"
  - zeller: 
      name: Zeller AG
      address: "Seeblickstrasse 4, 8590 Romanshorn, Switzerland"
  - sysbio:
      name: Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association
      address: "Robert-Rössle-Strasse 10, Berlin, 13125, Germany"
bibliography: bibliography.bib
keywords: mutagenicity, QSAR, lazar, random forest, support vector machine, deep learning

documentclass: scrartcl
tblPrefix: Table
figPrefix: Figure
header-includes:
    - \usepackage{setspace}
    - \doublespacing
    - \usepackage{lineno}
    - \linenumbers
...

Abstract
========

<!---
Random forest, support vector machine, linear regression, deep learning and k-nearest neighbor
(`lazar`) algorithms, were applied to new *Salmonella* mutagenicity dataset
with 8309 unique chemical structures. The best prediction accuracies in
10-fold-crossvalidation were obtained with `lazar` models, that gave accuracies
similar to the interlaboratory variability of the Ames test.
--->

Introduction
============

TODO

<!---
Pyrrolizidine alkaloids (PAs) are secondary plant ingredients found in
many plant species as protection against predators [Hartmann & Witte
1995](#_ENREF_59)[Langel et al. 2011](#_ENREF_76)(; ). PAs are ester
alkaloids, which are composed of a necine base (two fused five-membered
rings joined by a nitrogen atom) and one or two necic acid (carboxylic
ester arms). The necine base can have different structures and thereby
divides PAs into several structural groups, e.g. otonecine, platynecine,
and retronecine. The structural groups of the necic acid are macrocyclic
diester, open-ring diester and monoester [Langel et al.
2011](#_ENREF_76)().

PA are mainly metabolised in the liver, which is at the same time the
main target organ of toxicity [Bull & Dick 1959](#_ENREF_17)[Bull et al.
1958](#_ENREF_18)[Butler et al. 1970](#_ENREF_20)[DeLeve et al.
1996](#_ENREF_33)[Jago 1971](#_ENREF_65)[Li et al.
2011](#_ENREF_78)[Neumann et al. 2015](#_ENREF_99)(; ; ; ; ; ; ). There
are three principal metabolic pathways for 1,2-unsaturated PAs [Chen et
al. 2010](#_ENREF_26)(): (i) Detoxification by hydrolysis: the ester
bond on positions C7 and C9 are hydrolysed by non-specific esterases to
release necine base and necic acid, which are then subjected to further
phase II-conjugation and excretion. (ii) Detoxification by *N*-oxidation
of the necine base (only possible for retronecine-type PAs): the
nitrogen is oxidised to form a PA *N*-oxides, which can be conjugated by
phase II enzymes e.g. glutathione and then excreted. PA *N*-oxides can
be converted back into the corresponding parent PA [Wang et al.
2005](#_ENREF_134)(). (iii) Metabolic activation or toxification: PAs
are metabolic activated/ toxified by oxidation (for retronecine-type
PAs) or oxidative *N*-demethylation (for otonecine-type PAs [Lin
1998](#_ENREF_82)()). This pathway is mainly catalysed by cytochrome
P450 isoforms CYP2B and 3A [Ruan et al. 2014b](#_ENREF_115)(), and
results in the formation of dehydropyrrolizidines (DHP, also known as
pyrrolic ester or reactive pyrroles). DHPs are highly reactive and cause
damage in the cells where they are formed, usually hepatocytes. However,
they can also pass from the hepatocytes into the adjacent sinusoids and
damage the endothelial lining cells [Gao et al. 2015](#_ENREF_48)()
predominantly by reaction with protein, lipids and DNA. There is even
evidence, that conjugation of DHP to glutathione, which would generally
be considered a detoxification step, could result in reactive
metabolites, which might also lead to DNA adduct formation [Xia et al.
2015](#_ENREF_138)(). Due to the ability to form DNA adducts, DNA
crosslinks and DNA breaks 1,2-unsaturated PAs are generally considered
genotoxic and carcinogenic [Chen et al. 2010](#_ENREF_26)[EFSA
2011](#_ENREF_36)[Fu et al. 2004](#_ENREF_45)[Li et al.
2011](#_ENREF_78)[Takanashi et al. 1980](#_ENREF_126)[Yan et al.
2008](#_ENREF_140)[Zhao et al. 2012](#_ENREF_148)(; ; ; ; ; ; ). Still,
there is no evidence yet that PAs are carcinogenic in humans [ANZFA
2001](#_ENREF_4)[EMA 2016](#_ENREF_39)(; ). One general limitation of
studies with PAs is the number of different PAs investigated. Around 30
PAs are currently commercially available, therefore all studies focus on
these PAs. This is also true for *in vitro* and *in vivo* tests on
mutagenicity and genotoxicity. To gain a wider perspective, in this
study over 600 different PAs were assessed on their mutagenic potential
using four different machine learning techniques.
--->

<!---

Mutagenicity datasets
Algorithms
descriptors
define abbreviations
pyrrolizidine 
--->

The main objectives of this study were

  - to generate a new training dataset, by combining the most comprehensive public mutagenicity datasets
  - to compare the performance of global models (RF, SVM, LR, NN) with local models (`lazar`)
  - to compare the performance of MolPrint2D fingerprints with PaDEL descriptors
  - to apply these models for the prediction of pyrrolizidine alkaloid mutagenicity

Materials and Methods
=====================

Data
----

### Mutagenicity training data

An identical training dataset was used for all models. The
training dataset was compiled from the following sources:

-   Kazius/Bursi Dataset (4337 compounds, @Kazius2005): <http://cheminformatics.org/datasets/bursi/cas_4337.zip>

-   Hansen Dataset (6513 compounds, @Hansen2009): <http://doc.ml.tu-berlin.de/toxbenchmark/Mutagenicity_N6512.csv>

-   EFSA Dataset (695 compounds @EFSA2016): <https://data.europa.eu/euodp/data/storage/f/2017-0719T142131/GENOTOX%20data%20and%20dictionary.xls>

Mutagenicity classifications from Kazius and Hansen datasets were used
without further processing. To achieve consistency with these
datasets, EFSA compounds were classified as mutagenic, if at least one
positive result was found for TA98 or T100 Salmonella strains.

Dataset merges were based on unique SMILES (*Simplified Molecular Input
Line Entry Specification*) strings of the compound structures.
Duplicated experimental data with the same outcome was merged into a
single value, because it is likely that it originated from the same
experiment. Contradictory results were kept as multiple measurements in
the database. The combined training dataset contains 8309 unique
structures.

Source code for all data download, extraction and merge operations is publicly
available from the git repository <https://git.in-silico.ch/mutagenicity-paper>
under a GPL3 License. The new combined dataset can be found at
<https://git.in-silico.ch/mutagenicity-paper/data/mutagenicity.csv>.

### Pyrrolizidine alkaloid (PA) dataset

The testing dataset consisted of 602 different PAs. The compilation of
the PA dataset is described in detail in [Schöning et al.
(2017)](#_ENREF_119).

TODO: **Verena** Quellen und Auswahlkriterien

<!---
The PAs were assigned to groups according to
structural features of the necine base and necic acid.

For the necine base, following groups were assigned:

-   Retronecine-type (1,2-unstaturated necine base)

-   Otonecine-type (1,2-unstaturated necine base)

-   Platynecine-type (1,2-saturated necine base)

For the modification of necine base, following groups were assigned:

-   *N*-oxide-type

-   Tertiary-type (PAs which were neither from the *N*-oxide- nor
    > DHP-type)

-   DHP-type (dehydropyrrolizidine, pyrrolic ester)

For the necic acid, following groups were assigned:

-   Monoester-type

-   Open-ring diester-type

-   Macrocyclic diester-type
--->

Descriptors
-----------

### MolPrint2D fingerprints (*MP2D*)

MolPrint2D fingerprints (@OBoyle2011a) use atom environments as molecular
representation.  They determine for each atom in a molecule, the atom types of
its connected atoms to represent their chemical environment.  This resembles
basically the chemical concept of functional groups.

In contrast to predefined lists of fragments (e.g. FP3, FP4 or MACCs
fingerprints) or descriptors (e.g PaDEL) they are generated dynamically from
chemical structures. This has the advantage that they can capture substructures
of toxicological relevance that are not included in other descriptors. 

Chemical similarities (e.g. Tanimoto indices) can be calculated very
efficiently with MolPrint2D fingerprints. Using them as descriptors for global
models leads however to huge, sparsely populated matrices that cannot be
handled with traditional machine learning algorithms. In our experiments none
of the R and Tensorflow algorithms was capable to use them as descriptors.

MolPrint2D fingerprints were calculated with the OpenBabel cheminformatics
library (@OBoyle2011a).

#### PaDEL descriptors

For R and Tensorflow models, molecular 1D and 2D descriptors were calculated
with the PaDEL-Descriptors program (<http://www.yapcwsoft.com> version 2.21, @Yap2011). 

As the training dataset contained over 8309 instances, it was decided to
delete instances with missing values during data pre-processing.
Furthermore, substances with equivocal outcome were removed. The final
training dataset contained 8080 instances with known mutagenic
potential.

During feature
selection, descriptor with near zero variance were removed using
'*NearZeroVar*'-function (package 'caret'). If the percentage of the
most common value was more than 90% or when the frequency ratio of the
most common value to the second most common value was greater than 95:5
(e.g. 95 instances of the most common value and only 5 or less instances
of the second most common value), a descriptor was classified as having
a near zero variance. After that, highly correlated descriptors were
removed using the '*findCorrelation*'-function (package 'caret') with a
cut-off of 0.9. This resulted in a training dataset with 516
descriptors. These descriptors were scaled to be in the range between 0
and 1 using the '*preProcess*'-function (package 'caret'). The scaling
routine was saved in order to apply the same scaling on the testing
dataset. As these three steps did not consider the outcome, it was
decided that they do not need to be included in the cross-validation of
the model. To further reduce the number of features, a LASSO (*least
absolute shrinkage and selection operator*) regression was performed
using the '*glmnet*'-function (package '*glmnet*'). The reduced dataset
was used for the generation of the pre-trained models.

Algorithms
----------

### `lazar`

`lazar` (*lazy structure activity relationships*) is a modular framework
for read-across model development and validation. It follows the
following basic workflow: For a given chemical structure `lazar`:

-   searches in a database for similar structures (neighbours) with
    experimental data,

-   builds a local QSAR model with these neighbours and

-   uses this model to predict the unknown activity of the query
    compound.

This procedure resembles an automated version of read across predictions
in toxicology, in machine learning terms it would be classified as a
k-nearest-neighbour algorithm.

Apart from this basic workflow, `lazar` is completely modular and allows
the researcher to use any algorithm for similarity searches and local
QSAR (*Quantitative structure--activity relationship*) modelling.
Algorithms used within this study are described in the following
sections.

#### Neighbour identification

Utilizing this modularity, similarity calculations were based both on
MolPrint2D fingerprints and on PaDEL descriptors.

For MolPrint2D fingerprints chemical similarity between two compounds $a$ and
$b$ is expressed as the proportion between atom environments common in both
structures $A \cap B$ and the total number of atom environments $A \cup B$
(Jaccard/Tanimoto index).

$$sim = \frac{\lvert A\  \cap B \rvert}{\lvert A\  \cup B \rvert}$$

For PaDEL descriptors chemical similarity between two compounds $a$ and $b$ is
expressed as the cosine similarity between the descriptor vectors $A$ for $a$
and $B$ for $b$.

$$sim = \frac{A \cdot B}{\lvert A \rvert \lvert B \rvert}$$


Threshold selection is a trade-off between prediction accuracy (high
threshold) and the number of predictable compounds (low threshold). As
it is in many practical cases desirable to make predictions even in the
absence of closely related neighbours, we follow a tiered approach:

-   First a similarity threshold of 0.5 is used to collect neighbours,
    to create a local QSAR model and to make a prediction for the query
    compound. This are predictions with *high confidence*.

-   If any of these steps fails, the procedure is repeated with a
    similarity threshold of 0.2 and the prediction is flagged with a
    warning that it might be out of the applicability domain of the
    training data (*low confidence*).

-   Similarity thresholds of 0.5 and 0.2 are the default values chosen
    by the software developers and remained unchanged during the
    course of these experiments.

Compounds with the same structure as the query structure are
automatically eliminated from neighbours to obtain unbiased predictions
in the presence of duplicates.

#### Local QSAR models and predictions

Only similar compounds (neighbours) above the threshold are used for
local QSAR models. In this investigation, we are using a weighted
majority vote from the neighbour's experimental data for mutagenicity
classifications. Probabilities for both classes
(mutagenic/non-mutagenic) are calculated according to the following
formula and the class with the higher probability is used as prediction
outcome.

$$p_{c} = \ \frac{\sum_{}^{}\text{sim}_{n,c}}{\sum_{}^{}\text{sim}_{n}}$$

$p_{c}$ Probability of class c (e.g. mutagenic or non-mutagenic)\
$\sum_{}^{}\text{sim}_{n,c}$ Sum of similarities of neighbours with
class c\
$\sum_{}^{}\text{sim}_{n}$ Sum of all neighbours

#### Applicability domain

The applicability domain (AD) of `lazar` models is determined by the
structural diversity of the training data. If no similar compounds are
found in the training data no predictions will be generated. Warnings
are issued if the similarity threshold had to be lowered from 0.5 to 0.2
in order to enable predictions. Predictions without warnings can be
considered as close to the applicability domain (*high confidence*) and predictions with
warnings as more distant from the applicability domain (*low confidence*). Quantitative
applicability domain information can be obtained from the similarities
of individual neighbours.

#### Availability

-   `lazar` experiments for this manuscript:
    <https://git.in-silico.ch/mutagenicity-paper>
    (source code, GPL3)

-   `lazar` framework:
    <https://git.in-silico.ch/lazar>
    (source code, GPL3)

-   `lazar` GUI:
    <https://git.in-silico.ch/lazar-gui>
    (source code, GPL3)

-   Public web interface:
    <https://lazar.in-silico.ch>

### R Random Forest, Support Vector Machines, and Deep Learning

The RF, SVM, and DL models were generated using the R
software (R-project for Statistical Computing,
<https://www.r-project.org/>*;* version 3.3.1), specific R packages used
are identified for each step in the description below. 

#### Random Forest

For the RF model, the '*randomForest*'-function (package
'*randomForest*') was used. A forest with 1000 trees with maximal
terminal nodes of 200 was grown for the prediction.

#### Support Vector Machines

The '*svm*'-function (package 'e1071') with a *radial basis function
kernel* was used for the SVM model.

#### Deep Learning

The DL model was generated using the '*h2o.deeplearning*'-function
(package '*h2o*'). The DL contained four hidden layer with 70, 50, 50,
and 10 neurons, respectively. Other hyperparameter were set as follows:
l1=1.0E-7, l2=1.0E-11, epsilon = 1.0E-10, rho = 0.8, and quantile\_alpha
= 0.5. For all other hyperparameter, the default values were used.
Weights and biases were in a first step determined with an unsupervised
DL model. These values were then used for the actual, supervised DL
model.

TODO: **Verena** kannst Du bitte ueberpruefen, ob das noch stimmt und ggf die Figure 1 anpassen

To validate these models, an internal cross-validation approach was
chosen. The training dataset was randomly split in training data, which
contained 95% of the data, and validation data, which contain 5% of the
data. A feature selection with LASSO on the training data was performed,
reducing the number of descriptors to approximately 100. This step was
repeated five times. Based on each of the five different training data,
the predictive models were trained and the performance tested with the
validation data. This step was repeated 10 times. 

![Flowchart of the generation and validation of the models generated in R-project](figures/image1.png){#fig:valid}

#### Applicability domain

TODO: **Verena**: Mit welchen Deskriptoren hast Du den Jaccard index berechnet?  Fuer den Jaccard index braucht man binaere Deskriptoren (zB MP2D), mit PaDEL Deskriptoren koennte man zB eine euklidische oder cosinus Distanz berechnen.

The AD of the training dataset and the PA dataset was evaluated using
the Jaccard distance. A Jaccard distance of '0' indicates that the
substances are similar, whereas a value of '1' shows that the substances
are different. The Jaccard distance was below 0.2 for all PAs relative
to the training dataset. Therefore, PA dataset is within the AD of the
training dataset and the models can be used to predict the genotoxic
potential of the PA dataset.

#### Availability

R scripts for these experiments can be found in https://git.in-silico.ch/mutagenicity-paper/scripts/R.

### Tensorflow models

TODO: **Philipp** bitte ergaenzen

#### Logistic regression (SGD)

#### Logistic regression (scikit)

#### Random forests

#### Deep Learning

Alternatively, a DL model was established with Python-based Tensorflow
program (<https://www.tensorflow.org/>) using the high-level API Keras
(<https://www.tensorflow.org/guide/keras>) to build the models. 

Tensorflow models used the same PaDEL descriptors as the R models.

Data pre-processing was done by rank transformation using the
'*QuantileTransformer*' procedure. A sequential model has been used.
Four layers have been used: input layer, two hidden layers (with 12, 8
and 8 nodes, respectively) and one output layer. For the output layer, a
sigmoidal activation function and for all other layers the ReLU
('*Rectified Linear Unit*') activation function was used. Additionally,
a L^2^-penalty of 0.001 was used for the input layer. For training of
the model, the ADAM algorithm was used to minimise the cross-entropy
loss using the default parameters of Keras. Training was performed for
100 epochs with a batch size of 64. The model was implemented with
Python 3.6 and Keras. 

TODO: **Philipp** kannst Du bitte ueberpruefen ob die Beschreibung noch stimmt
und ob der Ablauf von Verena (Figure 1) auch fuer Deine Modelle gilt

Validation
----------

10-fold cross-validation was used for all Tensorflow models.

#### Availability

Jupyter notebooks for these experiments can be found in https://git.in-silico.ch/mutagenicity-paper/scripts/tensorflow.

Results
=======

10-fold crossvalidations
------------------------

Crossvalidation results are summarized in the following tables: @tbl:lazar shows `lazar` results with MolPrint2D and PaDEL descriptors, @tbl:R R results and @tbl:tensorflow Tensorflow results.


```{#tbl:lazar .table file="tables/lazar-summary.csv" caption="Summary of lazar crossvalidation results (all predictions/high confidence predictions"}
```

```{#tbl:R .table file="tables/r-summary.csv" caption="Summary of R crossvalidation results"}
```

```{#tbl:tensorflow .table file="tables/tensorflow-summary.csv" caption="Summary of tensorflow crossvalidation results"}
```

@fig:roc depicts the position of all crossvalidation results in receiver operating characteristic (ROC) space.

![ROC plot of crossvalidation results. *R-RF*: R Random Forests, *R-SVM*: R Support Vector Machines, *R-DL*: R Deep Learning, *TF*: Tensorflow without feature selection, *TF-FS*: Tensorflow with feature selection, *L*: lazar, *L-HC*: lazar high confidence predictions, *L-P*: lazar with PaDEL descriptors, *L-P-HC*: lazar PaDEL high confidence predictions (overlaps with L-P)](figures/roc.png){#fig:roc}

Confusion matrices for all models are available from the git repository http://git.in-silico.ch/mutagenicity-paper/10-fold-crossvalidations/confusion-matrices/, individual predictions can be found in 
http://git.in-silico.ch/mutagenicity-paper/10-fold-crossvalidations/predictions/.

The most accurate crossvalidation predictions have been obtained with `lazar` models with MolPrint2D descriptors ({{lazar-high-confidence.acc}} for predictions with high confidence, {{lazar-all.acc}} for all predictions). Models utilizing PaDEL descriptors have generally lower accuracies ranging from TODO to TODO. Sensitivity and specificity is generally well balanced with the exception of `lazar`-PaDEL (low sensitivity) and R deep learning (low specificity) models.

<!--
| |R-RF | R-SVM | R-DL | TF | TF-FS | L | L-HC | L-P | L-P-HC|
|-|-----|-------|------|----|-------|---|------|------|--------|
|Accuracy|{{R-RF.acc}}|{{R-SVM.acc}}|{{R-DL.acc}}|{{tensorflow-all.acc}}|{{tensorflow-selected.acc}}|{{lazar-all.acc}}|{{lazar-high-confidence.acc}}|{{lazar-padel-all.acc}}|{{lazar-padel-high-confidence.acc}}|
|Sensitivity|{{R-RF.tpr}}|{{R-SVM.tpr}}|{{R-DL.tpr}}|{{tensorflow-all.tpr}}|{{tensorflow-selected.tpr}}|{{lazar-all.tpr}}|{{lazar-high-confidence.tpr}}|{{lazar-padel-all.tpr}}|{{lazar-padel-high-confidence.tpr}}|
|Specificity|{{R-RF.tnr}}|{{R-SVM.tnr}}|{{R-DL.tnr}}|{{tensorflow-all.tnr}}|{{tensorflow-selected.tnr}}|{{lazar-all.tnr}}|{{lazar-high-confidence.tnr}}|{{lazar-padel-all.tnr}}|{{lazar-padel-high-confidence.tnr}}|
|PPV|{{R-RF.ppv}}|{{R-SVM.ppv}}|{{R-DL.ppv}}|{{tensorflow-all.ppv}}|{{tensorflow-selected.ppv}}|{{lazar-all.ppv}}|{{lazar-high-confidence.ppv}}|{{lazar-padel-all.ppv}}|{{lazar-padel-high-confidence.ppv}}|
|NPV|{{R-RF.npv}}|{{R-SVM.npv}}|{{R-DL.npv}}|{{tensorflow-all.npv}}|{{tensorflow-selected.npv}}|{{lazar-all.npv}}|{{lazar-high-confidence.npv}}|{{lazar-padel-all.npv}}|{{lazar-padel-high-confidence.npv}}|
|Nr. predictions|{{R-RF.n}}|{{R-SVM.n}}|{{R-DL.n}}|{{tensorflow-all.n}}|{{tensorflow-selected.n}}|{{lazar-all.n}}|{{lazar-high-confidence.n}}|{{lazar-padel-all.n}}|{{lazar-padel-high-confidence.n}}|

: Summary of crossvalidation results. *R-RF*: R Random Forests, *R-SVM*: R Support Vector Machines, *R-DL*: R Deep Learning, *TF*: Tensorflow without feature selection, *TF-FS*: Tensorflow with feature selection, *L*: lazar, *L-HC*: lazar high confidence predictions, *L-P*: lazar with PaDEL descriptors, *L-P-HC*: lazar PaDEL high confidence predictions, *PPV*: Positive predictive value (Precision), *NPV*: Negative predictive value {#tbl:summary}

R Models
--------

### Random Forest

10-fold crossvalidation of the R-RF model gave an accuracy of
{{R-RF.acc_perc}}%, a sensitivity of {{R-RF.tpr_perc}}% and a specificity of
{{R-RF.tnr_perc}}%.  The confusion matrix for {{R-RF.n}}
predictions is provided in @tbl:R-RF.

```{#tbl:R-RF .table file="tables/R-RF.csv" caption="Confusion matrix for R Random Forest predictions"}
```

### Support Vector Machines

10-fold crossvalidation of the R-SVM model gave an accuracy of
{{R-SVM.acc_perc}}%, a sensitivity of {{R-SVM.tpr_perc}}% and a specificity of
{{R-SVM.tnr_perc}}%.  The confusion matrix for {{R-SVM.n}}
predictions is provided in @tbl:R-SVM.

```{#tbl:R-SVM .table file="tables/R-SVM.csv" caption="Confusion matrix for R Support Vector Machine predictions"}
```

### Deep Learning

10-fold crossvalidation of the R-DL model gave an accuracy of
{{R-DL.acc_perc}}%, a sensitivity of {{R-DL.tpr_perc}}% and a specificity of
{{R-DL.tnr_perc}}%.  The confusion matrix for {{R-DL.n}}
predictions is provided in @tbl:R-DL.

```{#tbl:R-DL .table file="tables/R-DL.csv" caption="Confusion matrix for R Deep Learning predictions"}
```

Tensorflow Models
-----------------

### Without feature selection

10-fold crossvalidation of the Tensorflow DL model gave an accuracy of
{{tensorflow-all.acc_perc}}%, a sensitivity of {{tensorflow-all.tpr_perc}}% and a specificity of
{{tensorflow-all.tnr_perc}}%.  The confusion matrix for {{tensorflow-all.n}}
predictions is provided in @tbl:tensorflow-all.

```{#tbl:tensorflow-all .table file="tables/tensorflow-all.csv" caption="Confusion matrix for Tensorflow predictions without feature selecetion"}
```

### With feature selection

10-fold crossvalidation of the Tensorflow model with feature selection gave an accuracy of
{{tensorflow-selected.acc_perc}}%, a sensitivity of {{tensorflow-selected.tpr_perc}}% and a specificity of
{{tensorflow-selected.tnr_perc}}%.  The confusion matrix for {{tensorflow-selected.n}}
predictions is provided in @tbl:tensorflow-selected.

```{#tbl:tensorflow-selected .table file="tables/tensorflow-selected.csv" caption="Confusion matrix for Tensorflow predictions with feature selecetion"}
```

`lazar` Models
--------------

### MolPrint2D Descriptors

10-fold crossvalidation of the lazar model with MolPrint2D descriptors gave an accuracy of
{{lazar-all.acc_perc}}%, a sensitivity of {{lazar-all.tpr_perc}}% and a specificity of
{{lazar-all.tnr_perc}}%. 
The confusion matrix for {{lazar-all.n}}
predictions is provided in @tbl:lazar-all.

```{#tbl:lazar-all .table file="tables/lazar-all.csv" caption="Confusion matrix for lazar predictions with MolPrint2D descriptors"}
```

Predictions with high confidence had an accuracy of
{{lazar-high-confidence.acc_perc}}%, a sensitivity of {{lazar-high-confidence.tpr_perc}}% and a specificity of
{{lazar-high-confidence.tnr_perc}}%. 
The confusion matrix for {{lazar-high-confidence.n}}
predictions is provided in @tbl:lazar-high-confidence.


```{#tbl:lazar-high-confidence .table file="tables/lazar-high-confidence.csv" caption="Confusion matrix for high confidence lazar predictions with MolPrint2D descriptors"}
```

### PaDEL Descriptors

10-fold crossvalidation of the lazar model with PaDEL descriptors gave an accuracy of
{{lazar-all.acc_perc}}%, a sensitivity of {{lazar-all.tpr_perc}}% and a specificity of
{{lazar-all.tnr_perc}}%. 
The confusion matrix for {{lazar-all.n}}
predictions is provided in @tbl:lazar-padel-all.

```{#tbl:lazar-padel-all .table file="tables/lazar-padel-all.csv" caption="Confusion matrix for lazar predictions with PaDEL descriptors" }
```

Predictions with high confidence had an accuracy of
{{lazar-high-confidence.acc_perc}}%, a sensitivity of {{lazar-high-confidence.tpr_perc}}% and a specificity of
{{lazar-high-confidence.tnr_perc}}%. 
The confusion matrix for {{lazar-high-confidence.n}}
predictions is provided in @tbl:lazar-padel-high-confidence.

```{#tbl:lazar-padel-high-confidence .table file="tables/lazar-padel-high-confidence.csv" caption="Confusion matrix for high confidence lazar predictions with PaDEL descriptors"}
```
-->

Pyrrolizidine alkaloid mutagenicity predictions 
-----------------------------------------------

Pyrrolizidine alkaloid mutagenicity predictions are summarized in Table @tab:pa. 

@fig:tsne-mp2d shows the position of pyrrolizidine alkaloids (PA) in the mutagenicity training dataset in MP2D space

![t-sne visualisation of mutagenicty training data and pyrrolizidine alkaloids (PA)](figures/tsne-mp2d.png){#fig:tsne-mp2d}

@fig:tsne-padel shows the position of pyrrolizidine alkaloids (PA) in the mutagenicity training dataset in PADEL space

![t-sne visualisation of mutagenicty training data and pyrrolizidine alkaloids (PA)](figures/tsne-padel.png){#fig:tsne-padel}

Discussion
==========

Data
----

A new training dataset for *Salmonella* mutagenicity was created from three
different sources (@Kazius2005, @Hansen2009, @EFSA2016). It contains 8309
unique chemical structures, which is according to our knowledge the largest
public mutagenicity dataset presently available. The new training data can be
downloaded from
<https://git.in-silico.ch/mutagenicity-paper/data/mutagenicity.csv>.

Model performance
-----------------

@tbl:summary and @fig:roc show that the standard `lazar` algorithm (with MP2D
fingerprints) give the most accurate crossvalidation results. R Random Forests,
Support Vector Machines and Tensorflow models have similar accuracies with
balanced sensitivity (true position rate) and specificity (true negative rate).
`lazar` models with PaDEL descriptors have low sensitivity and R Deep Learning
models have low specificity.

The accuracy of `lazar` *in-silico* predictions are comparable to the
interlaboratory variability of the Ames test (80-85% according to
@Benigni1988), especially for predictions with high confidence
({{lazar-high-confidence.acc_perc}}%). This is a clear indication that
*in-silico* predictions can be as reliable as the bioassays, if the compounds
are close to the applicability domain. This conclusion is also supported by our
analysis of `lazar` lowest observed effect level predictions, which are also
similar to the experimental variability (@Helma2018).

The lowest number of predictions ({{lazar-padel-high-confidence.n}}) has been
obtained from `lazar`/PaDEL high confidence predictions, the largest number of
predictions comes from Tensorflow models ({{tensorflow-all.n}}). Standard
`lazar` give a slightly lower number of predictions ({{lazar-all.n}}) than R
and Tensorflow models. This is not necessarily a disadvantage, because `lazar`
abstains from predictions, if the query compound is very dissimilar from the
compounds in the training set and thus avoids to make predictions for compounds
that do not fall into its applicability domain. 

There are two major differences between `lazar` and R/Tensorflow models, which
might explain the different prediction accuracies:

- `lazar` uses MolPrint2D fingerprints, while all other models use PaDEL descriptors
- `lazar` creates local models for each query compound and the other models use a single global model for all predictions

We will discuss both options in the following sections.

Descriptors
-----------

This study uses two types of descriptors to characterize chemical structures.

MolPrint2D fingerprints (MP2D, @Bender2004) use atom environments (i.e.
connected atoms for all atoms in a molecule) as molecular representation, which
resembles basically the chemical concept of functional groups. MP2D descriptors
are used to determine chemical similarities in lazar, and previous experiments
have shown, that they give more accurate results than predefined descriptors
(e.g.  MACCS, FP2-4) for all investigated endpoints.

PaDEL calculates topological and physical-chemical descriptors.

TODO: **Verena** kannst Du bitte die Deskriptoren nochmals kurz beschreiben

PaDEL descriptors were used for the R and Tensorflow models. In addition we
have used PaDEL descriptors to calculate cosine similarities for the `lazar`
algorithm and compared the results with standard MP2D similarities, which led
to a significant decrease of `lazar` prediction accuracies. Based on this
result we can conclude, that PaDEL descriptors are less suited for similarity
calculations than MP2D descriptors.

In order to investigate, if MP2D fingerprints are also a better option for
global models we have tried to build R and Tensorflow models both with and
without unsupervised feature selection. Unfortunately none of the algorithms
was capable to deal with the large and sparsely populated descriptor matrix.
Based on this result we can conclude, that MP2D descriptors are at the moment
unsuitable for standard global machine learning algorithms. Please note that
`lazar` does not suffer from the sparseness problem, because (a) it utilizes
internally a much more efficient occurrence based representation and (b) it
uses fingerprints only for similarity calculations and mot as model parameters.

Based on these results we can conclude, that PaDEL descriptors are less suited
for similarity calculations than MP2D fingerprints and that current standard
machine learning algorithms are not capable to utilize chemical fingerprints.

Algorithms
----------

`lazar` is formally a *k-nearest-neighbor* algorithm that searches for similar
structures for a given compound and calculates the prediction based on the
experimental data for these structures. The QSAR literature calls such models
frequently *local models*, because models are generated specifically for each
query compound. R and Tensorflow models are in contrast *global models*, i.e. a
single model is used to make predictions for all compounds. It has been
postulated in the past, that local models are more accurate, because they can
account better for mechanisms, that affect only a subset of the training data.
Our results seem to support this assumption, because `lazar` models perform
better than global models. Both types of models use however different
descriptors, and for this reason we cannot draw a definitive conclusion if the
model algorithm or the descriptor type are the reason for the observed
differences. In order to answer this question, we would have to use global
modelling algorithms that are capable to handle large, sparse binary matrices.

Mutagenicity of PAs
-------------------

Due to the low to moderate predictivity of all models, quantitative
statement on the genotoxicity of single PAs cannot be made with
sufficient confidence.

The predictions of the SVM model did not fit with the other models or
literature, and are therefore not further considered in the discussion.

Necic acid

The rank order of the necic acid is comparable in the four models
considered (LAZAR, RF and DL (R-project and Tensorflow). PAs from the
monoester type had the lowest genotoxic potential, followed by PAs from
the open-ring diester type. PAs with macrocyclic diesters had the
highest genotoxic potential. The result fit well with current state of
knowledge: in general, PAs, which have a macrocyclic diesters as necic
acid, are considered more toxic than those with an open-ring diester or
monoester [EFSA 2011](#_ENREF_36)[Fu et al. 2004](#_ENREF_45)[Ruan et
al. 2014b](#_ENREF_115)(; ; ).

Necine base

The rank order of necine base is comparable in LAZAR, RF, and DL
(R-project) models: with platynecine being less or as genotoxic as
retronecine, and otonecine being the most genotoxic. In the
Tensorflow-generate DL model, platynecine also has the lowest genotoxic
probability, but are then followed by the otonecines and last by
retronecine. These results partly correspond to earlier published
studies. Saturated PAs of the platynecine-type are generally accepted to
be less or non-toxic and have been shown in *in vitro* experiments to
form no DNA-adducts [Xia et al. 2013](#_ENREF_139)(). Therefore, it is
striking, that 1,2-unsaturated PAs of the retronecine-type should have
an almost comparable genotoxic potential in the LAZAR and DL (R-project)
model. In literature, otonecine-type PAs were shown to be more toxic
than those of the retronecine-type [Li et al. 2013](#_ENREF_80)().

Modifications of necine base

The group-specific results of the Tensorflow-generated DL model appear
to reflect the expected relationship between the groups: the low
genotoxic potential of *N*-oxides and the highest potential of
dehydropyrrolizidines [Chen et al. 2010](#_ENREF_26)().

In the LAZAR model, the genotoxic potential of dehydropyrrolizidines
(DHP) (using the extended AD) is comparable to that of tertiary PAs.
Since, DHP is regarded as the toxic principle in the metabolism of PAs,
and known to produce protein- and DNA-adducts [Chen et al.
2010](#_ENREF_26)(), the LAZAR model did not meet this expectation it
predicted the majority of DHP as being not genotoxic. However, the
following issues need to be considered. On the one hand, all DHP were
outside of the stricter AD of 0.5. This indicates that in general, there
might be a problem with the AD. In addition, DHP has two unsaturated
double bounds in its necine base, making it highly reactive. DHP and
other comparable molecules have a very short lifespan, and usually
cannot be used in *in vitro* experiments. This might explain the absence
of suitable neighbours in LAZAR.

Furthermore, the probabilities for this substance groups needs to be
considered, and not only the consolidated prediction. In the LAZAR
model, all DHPs had probabilities for both outcomes (genotoxic and not
genotoxic) mainly below 30%. Additionally, the probabilities for both
outcomes were close together, often within 10% of each other. The fact
that for both outcomes, the probabilities were low and close together,
indicates a lower confidence in the prediction of the model for DHPs.

In the DL (R-project) and RF model, *N*-oxides have a by far more
genotoxic potential that tertiary PAs or dehydropyrrolizidines. As PA
*N*-oxides are easily conjugated for extraction, they are generally
considered as detoxification products, which are *in vivo* quickly
renally eliminated [Chen et al. 2010](#_ENREF_26)(). On the other hand,
*N*-oxides can be also back-transformed to the corresponding tertiary PA
[Wang et al. 2005](#_ENREF_134)(). Therefore, it may be questioned,
whether *N*-oxides themselves are generally less genotoxic than the
corresponding tertiary PAs. However, in the groups of modification of
the necine base, dehydropyrrolizidine, the toxic principle of PAs,
should have had the highest genotoxic potential. Taken together, the
predictions of the modifications of the necine base from the LAZAR, RF
and R-generated DL model cannot -- in contrast to the Tensorflow DL
model - be considered as reliable.

Overall, when comparing the prediction results of the PAs to current
published knowledge, it can be concluded that the performance of most
models was low to moderate. This might be contributed to the following
issues:

1.  In the LAZAR model, only 26.6% PAs were within the stricter AD. With
    the extended AD, 92.3% of the PAs could be included in the
    prediction. Even though the Jaccard distance between the training
    dataset and the PA dataset for the RF, SVM, and DL (R-project and
    Tensorflow) models was small, suggesting a high similarity, the
    LAZAR indicated that PAs have only few local neighbours, which might
    adversely affect the prediction of the mutagenic potential of PAs.

2.  All above-mentioned models were used to predict the mutagenicity of
    PAs. PAs are generally considered to be genotoxic, and the mode of
    action is also known. Therefore, the fact that some models predict
    the majority of PAs as not genotoxic seems contradictory. To
    understand this result, the basis, the training dataset, has to be
    considered. The mutagenicity of in the training dataset are based on
    data of mutagenicity in bacteria. There are some studies, which show
    mutagenicity of PAs in the AMES test [Chen et al.
    2010](#_ENREF_26)(). Also, [Rubiolo et al. (1992)](#_ENREF_116)
    examined several different PAs and several different extracts of
    PA-containing plants in the AMES test. They found that the AMES test
    was indeed able to detect mutagenicity of PAs, but in general,
    appeared to have a low sensitivity. The pre-incubation phase for
    metabolic activation of PAs by microsomal enzymes was the
    sensitivity-limiting step. This could very well mean that this is
    also reflected in the QSAR models.


Conclusions
===========

A new public *Salmonella* mutagenicity training dataset with 8309 compounds was
created and used it to train `lazar`, R and Tensorflow models. The best
performance was obtained with `lazar` models using MolPrint2D descriptors, with
prediction accuracies comparable to the interlaboratory variability of the Ames
test. Differences between algorithms (local vs. global models) and/or
descriptors (MolPrint2D vs PaDEL) may be responsible for the different
prediction accuracies. 

In this study, an attempt was made to predict the genotoxic potential of
PAs using five different machine learning techniques (LAZAR, RF, SVM, DL
(R-project and Tensorflow). The results of all models fitted only partly
to the findings in literature, with best results obtained with the
Tensorflow DL model. Therefore, modelling allows statements on the
relative risks of genotoxicity of the different PA groups. Individual
predictions for selective PAs appear, however, not reliable on the
current basis of the used training dataset.

This study emphasises the importance of critical assessment of
predictions by QSAR models. This includes not only extensive literature
research to assess the plausibility of the predictions, but also a good
knowledge of the metabolism of the test substances and understanding for
possible mechanisms of toxicity.

In further studies, additional machine learning techniques or a modified
(extended) training dataset should be used for an additional attempt to
predict the genotoxic potential of PAs.


References
==========