---
title: A comparison of nine machine learning models based on an expanded mutagenicity dataset and their application for predicting pyrrolizidine alkaloid mutagenicity

author:
  - Christoph Helma:
      institute: ist
      email: helma@in-silico.ch
      correspondence: "yes"
  - Verena Schöning:
      institute: zeller
  - Philipp Boss:
      institute: sysbio
  - Jürgen Drewe:
      institute: zeller

institute:
  - ist:
      name: in silico toxicology gmbh
      address: "Rastatterstrasse 41, 4057 Basel, Switzerland"
  - zeller: 
      name: Zeller AG
      address: "Seeblickstrasse 4, 8590 Romanshorn, Switzerland"
  - sysbio:
      name: Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association
      address: "Robert-Rössle-Strasse 10, Berlin, 13125, Germany"

bibliography: bibliography.bib
keywords: mutagenicity, QSAR, lazar, random forest, support vector machine, linear regression, neural nets, deep learning

documentclass: scrartcl
tblPrefix: Table
figPrefix: Figure
header-includes:
    - \usepackage{lineno, setspace, color, colortbl, longtable}
    - \doublespacing
    - \linenumbers
...

Abstract
========

Random forest, support vector machine, logistic regression, neural networks and k-nearest neighbor
(`lazar`) algorithms, were applied to new *Salmonella* mutagenicity dataset
with 8309 unique chemical structures. The best prediction accuracies in
10-fold-crossvalidation were obtained with `lazar` models and MolPrint2D descriptors, that gave accuracies ({{cv.lazar-high-confidence.acc_perc}}%)
similar to the interlaboratory variability of the Ames test.

**TODO**: PA results

Introduction
============

**TODO**: rationale for investigation

<!---
Pyrrolizidine alkaloids (PAs) are secondary plant ingredients found in
many plant species as protection against predators [Hartmann & Witte
1995](#_ENREF_59)[Langel et al. 2011](#_ENREF_76)(; ). PAs are ester
alkaloids, which are composed of a necine base (two fused five-membered
rings joined by a nitrogen atom) and one or two necic acid (carboxylic
ester arms). The necine base can have different structures and thereby
divides PAs into several structural groups, e.g. otonecine, platynecine,
and retronecine. The structural groups of the necic acid are macrocyclic
diester, open-ring diester and monoester [Langel et al.
2011](#_ENREF_76)().

PA are mainly metabolised in the liver, which is at the same time the
main target organ of toxicity [Bull & Dick 1959](#_ENREF_17)[Bull et al.
1958](#_ENREF_18)[Butler et al. 1970](#_ENREF_20)[DeLeve et al.
1996](#_ENREF_33)[Jago 1971](#_ENREF_65)[Li et al.
2011](#_ENREF_78)[Neumann et al. 2015](#_ENREF_99)(; ; ; ; ; ; ). There
are three principal metabolic pathways for 1,2-unsaturated PAs [Chen et
al. 2010](#_ENREF_26)(): (i) Detoxification by hydrolysis: the ester
bond on positions C7 and C9 are hydrolysed by non-specific esterases to
release necine base and necic acid, which are then subjected to further
phase II-conjugation and excretion. (ii) Detoxification by *N*-oxidation
of the necine base (only possible for retronecine-type PAs): the
nitrogen is oxidised to form a PA *N*-oxides, which can be conjugated by
phase II enzymes e.g. glutathione and then excreted. PA *N*-oxides can
be converted back into the corresponding parent PA [Wang et al.
2005](#_ENREF_134)(). (iii) Metabolic activation or toxification: PAs
are metabolic activated/ toxified by oxidation (for retronecine-type
PAs) or oxidative *N*-demethylation (for otonecine-type PAs [Lin
1998](#_ENREF_82)()). This pathway is mainly catalysed by cytochrome
P450 isoforms CYP2B and 3A [Ruan et al. 2014b](#_ENREF_115)(), and
results in the formation of dehydropyrrolizidines (DHP, also known as
pyrrolic ester or reactive pyrroles). DHPs are highly reactive and cause
damage in the cells where they are formed, usually hepatocytes. However,
they can also pass from the hepatocytes into the adjacent sinusoids and
damage the endothelial lining cells [Gao et al. 2015](#_ENREF_48)()
predominantly by reaction with protein, lipids and DNA. There is even
evidence, that conjugation of DHP to glutathione, which would generally
be considered a detoxification step, could result in reactive
metabolites, which might also lead to DNA adduct formation [Xia et al.
2015](#_ENREF_138)(). Due to the ability to form DNA adducts, DNA
crosslinks and DNA breaks 1,2-unsaturated PAs are generally considered
genotoxic and carcinogenic [Chen et al. 2010](#_ENREF_26)[EFSA
2011](#_ENREF_36)[Fu et al. 2004](#_ENREF_45)[Li et al.
2011](#_ENREF_78)[Takanashi et al. 1980](#_ENREF_126)[Yan et al.
2008](#_ENREF_140)[Zhao et al. 2012](#_ENREF_148)(; ; ; ; ; ; ). Still,
there is no evidence yet that PAs are carcinogenic in humans [ANZFA
2001](#_ENREF_4)[EMA 2016](#_ENREF_39)(; ). One general limitation of
studies with PAs is the number of different PAs investigated. Around 30
PAs are currently commercially available, therefore all studies focus on
these PAs. This is also true for *in vitro* and *in vivo* tests on
mutagenicity and genotoxicity. To gain a wider perspective, in this
study over 600 different PAs were assessed on their mutagenic potential
using four different machine learning techniques.
--->

<!---

Mutagenicity datasets
Algorithms
descriptors
define abbreviations
pyrrolizidine 
--->

The main objectives of this study were

  - to generate a new mutagenicity training dataset, by combining the most comprehensive public datasets
  - to compare the performance of MolPrint2D (*MP2D*) fingerprints with PaDEL descriptors
  - to compare the performance of global QSAR models (random forests (*RF*), support vector machines (*SVM*), logistic regression (*LR*), neural nets (*NN*)) with local models (`lazar`)
  - to apply these models for the prediction of pyrrolizidine alkaloid mutagenicity

Materials and Methods
=====================

Data
----

### Mutagenicity training data

An identical training dataset was used for all models. The
training dataset was compiled from the following sources:

-   Kazius/Bursi Dataset (4337 compounds, @Kazius2005): <http://cheminformatics.org/datasets/bursi/cas_4337.zip>

-   Hansen Dataset (6513 compounds, @Hansen2009): <http://doc.ml.tu-berlin.de/toxbenchmark/Mutagenicity_N6512.csv>

-   EFSA Dataset (695 compounds @EFSA2016): <https://data.europa.eu/euodp/data/storage/f/2017-0719T142131/GENOTOX%20data%20and%20dictionary.xls>

Mutagenicity classifications from Kazius and Hansen datasets were used
without further processing. To achieve consistency with these
datasets, EFSA compounds were classified as mutagenic, if at least one
positive result was found for TA98 or T100 Salmonella strains.

Dataset merges were based on unique SMILES (*Simplified Molecular Input
Line Entry Specification*) strings of the compound structures.
Duplicated experimental data with the same outcome was merged into a
single value, because it is likely that it originated from the same
experiment. Contradictory results were kept as multiple measurements in
the database. The combined training dataset contains 8309 unique
structures.

Source code for all data download, extraction and merge operations is publicly
available from the git repository <https://git.in-silico.ch/mutagenicity-paper>
under a GPL3 License. The new combined dataset can be found at
<https://git.in-silico.ch/mutagenicity-paper/data/mutagenicity.csv>.

### Pyrrolizidine alkaloid (PA) dataset

The testing dataset consisted of 602 different PAs.

**TODO**: **Verena** Kannst Du kurz die Quellen und Auswahlkriterien zusammenfassen?

The compilation of the PA dataset is described in detail in [Schöning et al.
(2017)](#_ENREF_119).

<!---
The PAs were assigned to groups according to
structural features of the necine base and necic acid.

For the necine base, following groups were assigned:

-   Retronecine-type (1,2-unstaturated necine base)

-   Otonecine-type (1,2-unstaturated necine base)

-   Platynecine-type (1,2-saturated necine base)

For the modification of necine base, following groups were assigned:

-   *N*-oxide-type

-   Tertiary-type (PAs which were neither from the *N*-oxide- nor
    > DHP-type)

-   DHP-type (dehydropyrrolizidine, pyrrolic ester)

For the necic acid, following groups were assigned:

-   Monoester-type

-   Open-ring diester-type

-   Macrocyclic diester-type
--->
Descriptors
-----------

### MolPrint2D (*MP2D*) fingerprints

MolPrint2D fingerprints (@OBoyle2011a) use atom environments as molecular
representation.  They determine for each atom in a molecule, the atom types of
its connected atoms to represent their chemical environment.  This resembles
basically the chemical concept of functional groups.

In contrast to predefined lists of fragments (e.g. FP3, FP4 or MACCs
fingerprints) or descriptors (e.g PaDEL) they are generated dynamically from
chemical structures. This has the advantage that they can capture substructures
of toxicological relevance that are not included in other descriptors. 

Chemical similarities (e.g. Tanimoto indices) can be calculated very
efficiently with MolPrint2D fingerprints. Using them as descriptors for global
models leads however to huge, sparsely populated matrices that cannot be
handled with traditional machine learning algorithms. In our experiments none
of the R and Tensorflow algorithms was capable to use them as descriptors.

MolPrint2D fingerprints were calculated with the OpenBabel cheminformatics
library (@OBoyle2011a).

#### PaDEL descriptors

Molecular 1D and 2D descriptors were calculated
with the PaDEL-Descriptors program (<http://www.yapcwsoft.com> version 2.21, @Yap2011). 

As the training dataset contained over 8309 instances, it was decided to
delete instances with missing values during data pre-processing.
Furthermore, substances with equivocal outcome were removed. The final
training dataset contained 8080 instances with known mutagenic
potential.

During feature selection, descriptors with near zero variance were removed
using '*NearZeroVar*'-function (package 'caret'). If the percentage of the most
common value was more than 90% or when the frequency ratio of the most common
value to the second most common value was greater than 95:5 (e.g. 95 instances
of the most common value and only 5 or less instances of the second most common
value), a descriptor was classified as having a near zero variance. After that,
highly correlated descriptors were removed using the
'*findCorrelation*'-function (package 'caret') with a cut-off of 0.9. This
resulted in a training dataset with 516 descriptors. These descriptors were
scaled to be in the range between 0 and 1 using the '*preProcess*'-function
(package 'caret'). The scaling routine was saved in order to apply the same
scaling on the testing dataset. As these three steps did not consider the
dependent variable (experimental mutagenicity), it was decided that they do not
need to be included in the cross-validation of the model. To further reduce the
number of features, a LASSO (*least absolute shrinkage and selection operator*)
regression was performed using the '*glmnet*'-function (package '*glmnet*').
The reduced dataset was used for the generation of the pre-trained models.

PaDEL descriptors were used in global (RF, SVM, LR, NN) and local (`lazar`) models.

Algorithms
----------

### `lazar`

`lazar` (*lazy structure activity relationships*) is a modular framework
for read-across model development and validation. It follows the
following basic workflow: For a given chemical structure `lazar`:

-   searches in a database for similar structures (neighbours) with
    experimental data,

-   builds a local QSAR model with these neighbours and

-   uses this model to predict the unknown activity of the query
    compound.

This procedure resembles an automated version of read across predictions
in toxicology, in machine learning terms it would be classified as a
k-nearest-neighbour algorithm.

Apart from this basic workflow, `lazar` is completely modular and allows
the researcher to use arbitrary algorithms for similarity searches and local
QSAR (*Quantitative structure--activity relationship*) modelling.
Algorithms used within this study are described in the following
sections.

#### Neighbour identification

Utilizing this modularity, similarity calculations were based both on
MolPrint2D fingerprints and on PaDEL descriptors.

For MolPrint2D fingerprints chemical similarity between two compounds $a$ and
$b$ is expressed as the proportion between atom environments common in both
structures $A \cap B$ and the total number of atom environments $A \cup B$
(Jaccard/Tanimoto index).

$$sim = \frac{\lvert A\  \cap B \rvert}{\lvert A\  \cup B \rvert}$$

For PaDEL descriptors chemical similarity between two compounds $a$ and $b$ is
expressed as the cosine similarity between the descriptor vectors $A$ for $a$
and $B$ for $b$.

$$sim = \frac{A \cdot B}{\lvert A \rvert \lvert B \rvert}$$


Threshold selection is a trade-off between prediction accuracy (high
threshold) and the number of predictable compounds (low threshold). As
it is in many practical cases desirable to make predictions even in the
absence of closely related neighbours, we follow a tiered approach:

-   First a similarity threshold of 0.5 is used to collect neighbours,
    to create a local QSAR model and to make a prediction for the query
    compound. This are predictions with *high confidence*.

-   If any of these steps fails, the procedure is repeated with a
    similarity threshold of 0.2 and the prediction is flagged with a
    warning that it might be out of the applicability domain of the
    training data (*low confidence*).

-   Similarity thresholds of 0.5 and 0.2 are the default values chosen
    by the software developers and remained unchanged during the
    course of these experiments.

Compounds with the same structure as the query structure are
automatically eliminated from neighbours to obtain unbiased predictions
in the presence of duplicates.

#### Local QSAR models and predictions

Only similar compounds (neighbours) above the threshold are used for
local QSAR models. In this investigation, we are using a weighted
majority vote from the neighbour's experimental data for mutagenicity
classifications. Probabilities for both classes (mutagenic/non-mutagenic) are
calculated according to the following formula and the class with the higher
probability is used as prediction outcome.

$$p_{c} = \ \frac{\sum_{}^{}\text{sim}_{n,c}}{\sum_{}^{}\text{sim}_{n}}$$

$p_{c}$ Probability of class c (e.g. mutagenic or non-mutagenic)\
$\sum_{}^{}\text{sim}_{n,c}$ Sum of similarities of neighbours with
class c\
$\sum_{}^{}\text{sim}_{n}$ Sum of all neighbours

#### Applicability domain

The applicability domain (AD) of `lazar` models is determined by the
structural diversity of the training data. If no similar compounds are
found in the training data no predictions will be generated. Warnings are
issued if the similarity threshold had to be lowered from 0.5 to 0.2 in order
to enable predictions. Predictions without warnings can be considered as close
to the applicability domain (*high confidence*) and predictions with warnings
as more distant from the applicability domain (*low confidence*). Quantitative
applicability domain information can be obtained from the similarities of
individual neighbours.

#### Availability

-   `lazar` experiments for this manuscript:
    <https://git.in-silico.ch/mutagenicity-paper>
    (source code, GPL3)

-   `lazar` framework:
    <https://git.in-silico.ch/lazar>
    (source code, GPL3)

-   `lazar` GUI:
    <https://git.in-silico.ch/lazar-gui>
    (source code, GPL3)

-   Public web interface:
    <https://lazar.in-silico.ch>

### R Random Forest, Support Vector Machines, and Deep Learning

The RF, SVM, and DL models were generated using the R
software (R-project for Statistical Computing,
<https://www.r-project.org/>*;* version 3.3.1), specific R packages used
are identified for each step in the description below. 

#### Random Forest (*RF*)

For the RF model, the '*randomForest*'-function (package
'*randomForest*') was used. A forest with 1000 trees with maximal
terminal nodes of 200 was grown for the prediction.

#### Support Vector Machines (*SVM*)

The '*svm*'-function (package 'e1071') with a *radial basis function
kernel* was used for the SVM model.

**TODO**: **Verena, Phillip** Sollen wir die DL Modelle ebenso wie die Tensorflow als Neural Nets (NN) bezeichnen?

#### Deep Learning

The DL model was generated using the '*h2o.deeplearning*'-function
(package '*h2o*'). The DL contained four hidden layer with 70, 50, 50,
and 10 neurons, respectively. Other hyperparameter were set as follows:
l1=1.0E-7, l2=1.0E-11, epsilon = 1.0E-10, rho = 0.8, and quantile\_alpha
= 0.5. For all other hyperparameter, the default values were used.
Weights and biases were in a first step determined with an unsupervised
DL model. These values were then used for the actual, supervised DL
model.

To validate these models, an internal cross-validation approach was
chosen. The training dataset was randomly split in training data, which
contained 95% of the data, and validation data, which contain 5% of the
data. A feature selection with LASSO on the training data was performed,
reducing the number of descriptors to approximately 100. This step was
repeated five times. Based on each of the five different training data,
the predictive models were trained and the performance tested with the
validation data. This step was repeated 10 times. 

**TODO**: **Verena** kannst Du bitte ueberpruefen, ob das noch stimmt und ggf die Figure 1 anpassen

![Flowchart of the generation and validation of the models generated in R-project](figures/image1.png){#fig:valid}

<!--
**TODO**: **Verena** Ich hab die *Applicability domain* section weggelassen, da sie ansc
-->

#### Applicability domain

**TODO**: **Verena**: Mit welchen Deskriptoren hast Du den Jaccard index berechnet?  Fuer den Jaccard index braucht man binaere Deskriptoren (zB MP2D), mit PaDEL Deskriptoren koennte man zB eine euklidische oder cosinus Distanz berechnen.

The AD of the training dataset and the PA dataset was evaluated using
the Jaccard distance. A Jaccard distance of '0' indicates that the
substances are similar, whereas a value of '1' shows that the substances
are different. The Jaccard distance was below 0.2 for all PAs relative
to the training dataset. Therefore, PA dataset is within the AD of the
training dataset and the models can be used to predict the genotoxic
potential of the PA dataset.

#### Availability

R scripts for these experiments can be found in https://git.in-silico.ch/mutagenicity-paper/scripts/R.

### Tensorflow models

Data pre-processing was done by rank transformation using the
'*QuantileTransformer*' procedure. A sequential model has been used.
Four layers have been used: input layer, two hidden layers (with 12, 8
and 8 nodes, respectively) and one output layer. For the output layer, a
sigmoidal activation function and for all other layers the ReLU
('*Rectified Linear Unit*') activation function was used. Additionally,
a L^2^-penalty of 0.001 was used for the input layer. For training of
the model, the ADAM algorithm was used to minimise the cross-entropy
loss using the default parameters of Keras. Training was performed for
100 epochs with a batch size of 64. The model was implemented with
Python 3.6 and Keras. 

**TODO**: **Philipp** Ich hab die alten Ergebnisse mit feature selection weggelassen, ist das ok? Dann muesste auch dieser Absatz gestrichen werden, oder?

**TODO**: **Philipp** Kannst Du bitte die folgenden Absaetze ergaenzen

#### Random forests (*RF*)

#### Logistic regression (SGD) (*LR-sgd*)

#### Logistic regression (scikit) (*LR-scikit*)

**TODO**: **Philipp, Verena** DL oder NN?

#### Neural Nets (*NN*)

Alternatively, a DL model was established with Python-based Tensorflow
program (<https://www.tensorflow.org/>) using the high-level API Keras
(<https://www.tensorflow.org/guide/keras>) to build the models. 

Tensorflow models used the same PaDEL descriptors as the R models.

Validation
----------

10-fold cross-validation was used for all Tensorflow models.

#### Availability

Jupyter notebooks for these experiments can be found in https://git.in-silico.ch/mutagenicity-paper/scripts/tensorflow.

Results
=======

10-fold crossvalidations
------------------------

Crossvalidation results are summarized in the following tables: @tbl:lazar
shows `lazar` results with MolPrint2D and PaDEL descriptors, @tbl:R R results
and @tbl:tensorflow Tensorflow results.


```{#tbl:lazar .table file="tables/lazar-summary.csv" caption="Summary of lazar crossvalidation results (all/high confidence predictions)"}
```

```{#tbl:R .table file="tables/r-summary.csv" caption="Summary of R crossvalidation results"}
```

```{#tbl:tensorflow .table file="tables/tensorflow-summary.csv" caption="Summary of tensorflow crossvalidation results"}
```

@fig:roc depicts the position of all crossvalidation results in receiver operating characteristic (ROC) space.

![ROC plot of crossvalidation results.](figures/roc.png){#fig:roc}

Confusion matrices for all models are available from the git repository
https://git.in-silico.ch/mutagenicity-paper/10-fold-crossvalidations/confusion-matrices/,
individual predictions can be found in
https://git.in-silico.ch/mutagenicity-paper/10-fold-crossvalidations/predictions/.

The most accurate crossvalidation predictions have been obtained with standard
`lazar` models using MolPrint2D descriptors ({{cv.lazar-high-confidence.acc}}
for predictions with high confidence, {{cv.lazar-all.acc}} for all
predictions). Models utilizing PaDEL descriptors have generally lower
accuracies ranging from {{cv.R-DL.acc}} (R deep learning) to {{cv.R-RF.acc}}
(R/Tensorflow random forests). Sensitivity and specificity is generally well
balanced with the exception of `lazar`-PaDEL (low sensitivity) and R deep
learning (low specificity) models.

Pyrrolizidine alkaloid mutagenicity predictions 
-----------------------------------------------

Mutagenicity predictions from all investigated models for 602 pyrrolizidine
alkaloids (PAs) are shown in Table 4. A CSV table with all predictions can be
downloaded from https://git.in-silico.ch/mutagenicity-paper/tables/pa-table.csv

**TODO** **Verena und Philipp** Koennt Ihr bitte stichprobenweise die Tabelle ueberpruefen

\input{tables/pa-tab.tex}

@tbl:pa-summary summarises the number of positive and negative mutagenicity predictions for all investigated models.

```{#tbl:pa-summary .table file="tables/pa-summary.csv" caption="Summary of pyrrolizidine alkaloid mutagenicity predictions"}
```

For the visualisation of the position of pyrrolizidine alkaloids in respect to
the training data set we have applied t-distributed stochastic neighbor
embedding (t-SNE, @Maaten2008) for MolPrint2D and PaDEL descriptors.  t-SNE
maps each high-dimensional object (chemical) to a two-dimensional point,
maintaining the high-dimensional distances of the objects. Similar objects are
represented by nearby points and dissimilar objects are represented by distant
points.


@fig:tsne-mp2d shows the t-SNE of pyrrolizidine alkaloids (PA) and the mutagenicity training data in MP2D space (Tanimoto/Jaccard similarity).

![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-mp2d.png){#fig:tsne-mp2d}

@fig:tsne-padel shows the t-SNE of pyrrolizidine alkaloids (PA) and the mutagenicity training data in PaDEL space (Euclidean similarity).

![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-padel.png){#fig:tsne-padel}

Discussion
==========

Data
----

A new training dataset for *Salmonella* mutagenicity was created from three
different sources (@Kazius2005, @Hansen2009, @EFSA2016). It contains 8309
unique chemical structures, which is according to our knowledge the largest
public mutagenicity dataset presently available. The new training data can be
downloaded from
<https://git.in-silico.ch/mutagenicity-paper/data/mutagenicity.csv>.

Model performance
-----------------

@tbl:lazar, @tbl:R, @tbl:tensorflow and @fig:roc show that the standard `lazar` algorithm (with MP2D
fingerprints) give the most accurate crossvalidation results. R Random Forests,
Support Vector Machines and Tensorflow models have similar accuracies with
balanced sensitivity (true position rate) and specificity (true negative rate).
`lazar` models with PaDEL descriptors have low sensitivity and R Deep Learning
models have low specificity.

The accuracy of `lazar` *in-silico* predictions are comparable to the
interlaboratory variability of the Ames test (80-85% according to
@Benigni1988), especially for predictions with high confidence
({{cv.lazar-high-confidence.acc_perc}}%). This is a clear indication that
*in-silico* predictions can be as reliable as the bioassays, if the compounds
are close to the applicability domain. This conclusion is also supported by our
analysis of `lazar` lowest observed effect level predictions, which are also
similar to the experimental variability (@Helma2018).

The lowest number of predictions ({{cv.lazar-padel-high-confidence.n}}) has been
obtained from `lazar`-PaDEL high confidence predictions, the largest number of
predictions comes from Tensorflow models ({{cv.tensorflow-rf.v3.n}}). Standard
`lazar` give a slightly lower number of predictions ({{cv.lazar-all.n}}) than R
and Tensorflow models. This is not necessarily a disadvantage, because `lazar`
abstains from predictions, if the query compound is very dissimilar from the
compounds in the training set and thus avoids to make predictions for compounds
out of the applicability domain. 

Descriptors
-----------

This study uses two types of descriptors for the characterisation of chemical
structures:

*MolPrint2D* fingerprints (MP2D, @Bender2004) use atom environments (i.e.
connected atom types for all atoms in a molecule) as molecular representation,
which resembles basically the chemical concept of functional groups. MP2D
descriptors are used to determine chemical similarities in the default `lazar`
settings, and previous experiments have shown, that they give more accurate
results than predefined fragments (e.g.  MACCS, FP2-4).

In order to investigate, if MP2D fingerprints are also suitable for global
models we have tried to build R and Tensorflow models, both with and without
unsupervised feature selection. Unfortunately none of the algorithms was
capable to deal with the large and sparsely populated descriptor matrix.  Based
on this result we can conclude, that MolPrint2D descriptors are at the moment
unsuitable for standard global machine learning algorithms.

`lazar` does not suffer from the size and sparseness problem, because (a) it
utilizes internally a much more efficient occurrence based representation and
(b) it uses fingerprints only for similarity calculations and not as model
parameters.

PaDEL calculates topological and physical-chemical descriptors.

**TODO**: **Verena** kannst Du bitte die Deskriptoren nochmals kurz beschreiben

*PaDEL* descriptors were used for `lazar`, R and Tensorflow models.  All models
based on PaDEL descriptors had similar crossvalidation accuracies that were
significantly lower than `lazar` MolPrint2D results.  Direct comparisons are
available only for the `lazar` algorithm, and also in this case PaDEL
accuracies were lower than MolPrint2D accuracies.

Based on `lazar` results we can conclude, that PaDEL descriptors are less
suited for chemical similarity calculations than MP2D descriptors. It is also
likely that PaDEL descriptors lead to less accurate predictions for global
models, but we cannot draw any definitive conclusion in the absence of MP2D
models.

Algorithms
----------

`lazar` is formally a *k-nearest-neighbor* algorithm that searches for similar
structures for a given compound and calculates the prediction based on the
experimental data for these structures. The QSAR literature calls such models
frequently *local models*, because models are generated specifically for each
query compound. R and Tensorflow models are in contrast *global models*, i.e. a
single model is used to make predictions for all compounds. It has been
postulated in the past, that local models are more accurate, because they can
account better for mechanisms, that affect only a subset of the training data.
Our results seem to support this assumption, because standard `lazar` models
with MolPrint2D descriptors perform better than global models. The accuracy of
`lazar` models with PaDEL descriptors is however substantially lower and
comparable to global models with the same descriptors.

This observation may lead to the conclusion that the choice of suitable
descriptors is more important for predictive accuracy than the modelling
algorithm, but we were unable to obtain global MP2D models for direct
comparisons.  The selection of an appropriate modelling algorithm is still
crucial, because it needs the capability to handle the descriptor space.
Neighbour (and thus similarity) based algorithms like `lazar` have a clear
advantage in this respect over global machine learning algorithms (e.g. RF, SVM,
LR, NN), because Tanimoto/Jaccard similarities can be calculated efficiently
with simple set operations. 

Pyrrolizidine alkaloid mutagenicity predictions
-----------------------------------------------

`lazar` models with MolPrint2D descriptors predicted {{pa.lazar.mp2d.all.n_perc}}%
of the pyrrolizidine alkaloids (PAs) ({{pa.lazar.mp2d.high_confidence.n_perc}}%
with high confidence), the remaining compounds are not within its applicability
domain. All other models predicted 100% of the 602 compounds, indicating that
all compounds are within their applicability domain.

Mutagenicity predictions from different models show little agreement in general
(table 4). 42 from 602 PAs have non-conflicting predictions (all of them
non-mutagenic).  Most models predict predominantly a non-mutagenic outcome for
PAs, with exception of the R deep learning (DL) and the Tensorflow Scikit
logistic regression models ({{pa.tf.dl.mut_perc}} and
{{pa.tf.lr_scikit.mut_perc}}% positive predictions). 

<!--
non-conflicting CIDs
43040
186980
187805
610955
3033169
6429355
10095536
10251171
10577975
10838897
10992912
10996028
11618501
11827237
11827238
16687858
73893122
91747608
91749688
91751314
91752877
100979630
100979631
101648301
102478913
148322
194088
21626760
91747610
91747612
91749428
91749448
102596226
6440436
4483893
5315247
46930232
67189194
91747354
91749894
101324794
118701599
-->

R RF and SVM models favor very strongly non-mutagenic predictions (only {{pa.r.rf.mut_perc}} and {{pa.r.svm.mut_perc}} % mutagenic PAs), while Tensorflow models classify approximately half of the PAs as mutagenic (RF {{pa.tf.rf.mut_perc}}%, LR-sgd {{pa.tf.lr_sgd}}%, LR-scikit:{{pa.tf.lr_scikit.mut_perc}}, LR-NN:{{pa.tf.nn.mut_perc}}%). `lazar` models predict predominately non-mutagenicity, but to a lesser extend than R models (MP2D:{{pa.lazar.mp2d.all.mut_perc}}, PaDEL:{{pa.lazar.padel.all.mut_perc}}).

It is interesting to note, that different implementations of the same algorithm show little accordance in their prediction (see e.g R-RF vs. Tensorflow-RF and LR-sgd vs. LR-scikit in Table 4 and @tbl:pa-summary).

**TODO** **Verena, Philipp** habt ihr eine Erklaerung dafuer?

@fig:tsne-mp2d and @fig:tsne-padel show the t-SNE of training data and pyrrolizidine alkaloids. In @fig:tsne-mp2d the PAs are located closely together at the outer border of the training set. In @fig:tsne-padel they are less clearly separated and spread over the space occupied by the training examples.

This is probably the reason why PaDEL models predicted all instances and the MP2D model only {{pa.lazar.mp2d.all.n}} PAs. Predicting a large number of instances is however not the ultimate goal, we need accurate predictions and an unambiguous estimation of the applicability domain. With PaDEL descriptors *all* PAs are within the applicability domain of the training data, which is unlikely despite the size of the training set. MolPrint2D descriptors provide a clearer separation, which is also reflected in a better separation between high and low confidence predictions in `lazar` MP2D predictions as compared to `lazar` PaDEL predictions. Crossvalidation results with substantially higher accuracies for MP2D models than for PaDEL models also support this argument.

Differences between MP2D and PaDEL descriptors can be explained by their specific properties: PaDEL calculates a fixed set of descriptors for all structures, while MolPrint2D descriptors resemble substructures that are present in a compound. For this reason there is no fixed number of MP2D descriptors, the descriptor space are all unique substructures of the training set. If a query compound contains new substructures, this is immediately reflected in a lower similarity to training compounds, which makes applicability domain estimations very straightforward. With PaDEL (or any other predefined descriptors), the same set of descriptors is calculated for every compound, even if a compound comes from an completely new chemical class. 

From a practical point we still have to face the question, how to choose model predictions, if no experimental data is available (we found two PAs in the training data, but this number is too low, to draw any general conclusions). Based on crossvalidation results and the arguments in favor of MolPrint2D descriptors we would put the highest trust in `lazar` MolPrint2D predictions, especially in high-confidence predictions. `lazar` predictions have a accuracy comparable to experimental variability (@Helma2018) for compounds within the applicability domain. But they should not be trusted blindly. For practical purposes it is important to study the rationales (i.e. neighbors and their experimental activities) for each prediction of relevance. A freely accessible GUI for this purpose has been implemented at https://lazar.in-silico.ch.


**TODO**: **Verena**  Wenn Du lazar Ergebnisse konkret diskutieren willst, kann ich Dir ausfuehrliche Vorhersagen (mit aehnlichen Verbindungen und deren Aktivitaet) fuer einzelne Beispiele zusammenstellen 

<!---
Due to the low to moderate predictivity of all models, quantitative
statement on the genotoxicity of single PAs cannot be made with
sufficient confidence.

The predictions of the SVM model did not fit with the other models or
literature, and are therefore not further considered in the discussion.

Necic acid

The rank order of the necic acid is comparable in the four models
considered (LAZAR, RF and DL (R-project and Tensorflow). PAs from the
monoester type had the lowest genotoxic potential, followed by PAs from
the open-ring diester type. PAs with macrocyclic diesters had the
highest genotoxic potential. The result fit well with current state of
knowledge: in general, PAs, which have a macrocyclic diesters as necic
acid, are considered more toxic than those with an open-ring diester or
monoester [EFSA 2011](#_ENREF_36)[Fu et al. 2004](#_ENREF_45)[Ruan et
al. 2014b](#_ENREF_115)(; ; ).

Necine base

The rank order of necine base is comparable in LAZAR, RF, and DL
(R-project) models: with platynecine being less or as genotoxic as
retronecine, and otonecine being the most genotoxic. In the
Tensorflow-generate DL model, platynecine also has the lowest genotoxic
probability, but are then followed by the otonecines and last by
retronecine. These results partly correspond to earlier published
studies. Saturated PAs of the platynecine-type are generally accepted to
be less or non-toxic and have been shown in *in vitro* experiments to
form no DNA-adducts [Xia et al. 2013](#_ENREF_139)(). Therefore, it is
striking, that 1,2-unsaturated PAs of the retronecine-type should have
an almost comparable genotoxic potential in the LAZAR and DL (R-project)
model. In literature, otonecine-type PAs were shown to be more toxic
than those of the retronecine-type [Li et al. 2013](#_ENREF_80)().

Modifications of necine base

The group-specific results of the Tensorflow-generated DL model appear
to reflect the expected relationship between the groups: the low
genotoxic potential of *N*-oxides and the highest potential of
dehydropyrrolizidines [Chen et al. 2010](#_ENREF_26)().

In the LAZAR model, the genotoxic potential of dehydropyrrolizidines
(DHP) (using the extended AD) is comparable to that of tertiary PAs.
Since, DHP is regarded as the toxic principle in the metabolism of PAs,
and known to produce protein- and DNA-adducts [Chen et al.
2010](#_ENREF_26)(), the LAZAR model did not meet this expectation it
predicted the majority of DHP as being not genotoxic. However, the
following issues need to be considered. On the one hand, all DHP were
outside of the stricter AD of 0.5. This indicates that in general, there
might be a problem with the AD. In addition, DHP has two unsaturated
double bounds in its necine base, making it highly reactive. DHP and
other comparable molecules have a very short lifespan, and usually
cannot be used in *in vitro* experiments. This might explain the absence
of suitable neighbours in LAZAR.

Furthermore, the probabilities for this substance groups needs to be
considered, and not only the consolidated prediction. In the LAZAR
model, all DHPs had probabilities for both outcomes (genotoxic and not
genotoxic) mainly below 30%. Additionally, the probabilities for both
outcomes were close together, often within 10% of each other. The fact
that for both outcomes, the probabilities were low and close together,
indicates a lower confidence in the prediction of the model for DHPs.

In the DL (R-project) and RF model, *N*-oxides have a by far more
genotoxic potential that tertiary PAs or dehydropyrrolizidines. As PA
*N*-oxides are easily conjugated for extraction, they are generally
considered as detoxification products, which are *in vivo* quickly
renally eliminated [Chen et al. 2010](#_ENREF_26)(). On the other hand,
*N*-oxides can be also back-transformed to the corresponding tertiary PA
[Wang et al. 2005](#_ENREF_134)(). Therefore, it may be questioned,
whether *N*-oxides themselves are generally less genotoxic than the
corresponding tertiary PAs. However, in the groups of modification of
the necine base, dehydropyrrolizidine, the toxic principle of PAs,
should have had the highest genotoxic potential. Taken together, the
predictions of the modifications of the necine base from the LAZAR, RF
and R-generated DL model cannot - in contrast to the Tensorflow DL
model - be considered as reliable.

Overall, when comparing the prediction results of the PAs to current
published knowledge, it can be concluded that the performance of most
models was low to moderate. This might be contributed to the following
issues:

1.  In the LAZAR model, only 26.6% PAs were within the stricter AD. With
    the extended AD, 92.3% of the PAs could be included in the
    prediction. Even though the Jaccard distance between the training
    dataset and the PA dataset for the RF, SVM, and DL (R-project and
    Tensorflow) models was small, suggesting a high similarity, the
    LAZAR indicated that PAs have only few local neighbours, which might
    adversely affect the prediction of the mutagenic potential of PAs.

2.  All above-mentioned models were used to predict the mutagenicity of
    PAs. PAs are generally considered to be genotoxic, and the mode of
    action is also known. Therefore, the fact that some models predict
    the majority of PAs as not genotoxic seems contradictory. To
    understand this result, the basis, the training dataset, has to be
    considered. The mutagenicity of in the training dataset are based on
    data of mutagenicity in bacteria. There are some studies, which show
    mutagenicity of PAs in the AMES test [Chen et al.
    2010](#_ENREF_26)(). Also, [Rubiolo et al. (1992)](#_ENREF_116)
    examined several different PAs and several different extracts of
    PA-containing plants in the AMES test. They found that the AMES test
    was indeed able to detect mutagenicity of PAs, but in general,
    appeared to have a low sensitivity. The pre-incubation phase for
    metabolic activation of PAs by microsomal enzymes was the
    sensitivity-limiting step. This could very well mean that this is
    also reflected in the QSAR models.
--->

Conclusions
===========

A new public *Salmonella* mutagenicity training dataset with 8309 compounds was
created and used it to train `lazar`, R and Tensorflow models with MolPrint2D
and PaDEL descriptors. The best performance was obtained with `lazar` models
using MolPrint2D descriptors, with prediction accuracies
({{cv.lazar-high-confidence.acc_perc}}%) comparable to the interlaboratory variability
of the Ames test (80-85%). Models based on PaDEL descriptors had lower
accuracies than MolPrint2D models, but only the `lazar` algorithm could use
MolPrint2D descriptors.

**TODO**: PA Vorhersagen

<!---
In this study, an attempt was made to predict the genotoxic potential of
PAs using five different machine learning techniques (LAZAR, RF, SVM, DL
(R-project and Tensorflow). The results of all models fitted only partly
to the findings in literature, with best results obtained with the
Tensorflow DL model. Therefore, modelling allows statements on the
relative risks of genotoxicity of the different PA groups. Individual
predictions for selective PAs appear, however, not reliable on the
current basis of the used training dataset.

This study emphasises the importance of critical assessment of
predictions by QSAR models. This includes not only extensive literature
research to assess the plausibility of the predictions, but also a good
knowledge of the metabolism of the test substances and understanding for
possible mechanisms of toxicity.

In further studies, additional machine learning techniques or a modified
(extended) training dataset should be used for an additional attempt to
predict the genotoxic potential of PAs.
--->

References
==========