---
title: A comparison of nine machine learning mutagenicity models and their application for predicting pyrrolizidine alkaloids

author:
  - Christoph Helma:
      institute: ist
      email: helma@in-silico.ch
      correspondence: "yes"
  - Verena Schöning:
      institute: insel
  - Jürgen Drewe:
      institute: zeller, unibas
  - Philipp Boss:
      institute: sysbio

institute:
  - ist:
      name: in silico toxicology gmbh
      address: "Rastatterstrasse 41, 4057 Basel, Switzerland"
  - zeller: 
      name: Max Zeller Söhne AG
      address: "Seeblickstrasse 4, 8590 Romanshorn, Switzerland"
  - sysbio:
      name: Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association
      address: "Robert-Rössle-Strasse 10, Berlin, 13125, Germany"
  - unibas:
      name: Clinical Pharmacology, Department of Pharmaceutical Sciences, University Hospital Basel, University of Basel
      address: "Petersgraben 4, 4031 Basel, Switzerland"
  - insel:
      name: Clinical Pharmacology and Toxicology, Department of General Internal Medicine, University Hospital Bern, University of Bern
      address: "Inselspital, 3010 Bern, Switzerland"

bibliography: bibliography.bib
keywords: mutagenicity, QSAR, lazar, random forest, support vector machine, linear regression, neural nets, deep learning, pyrrolizidine alkaloids, OpenBabel, CDK

documentclass: scrartcl
tblPrefix: Table
figPrefix: Figure
header-includes:
    - \usepackage{lineno, setspace, color, colortbl, longtable}
    - \doublespacing
    - \linenumbers
...

Abstract
========

Random forest, support vector machine, logistic regression, neural networks and
k-nearest neighbor (`lazar`) algorithms, were applied to new *Salmonella*
mutagenicity dataset with {{cv.n_uniq}} unique chemical structures utilizing
MolPrint2D and Chemistry Development Kit (CDK) descriptors.  Crossvalidation
accuracies of all investigated models ranged from 80-85% which is comparable
with the interlaboratory variability of the *Salmonella* mutagenicity assay.
Pyrrolizidine alkaloid predictions showed a clear distinction between chemical
groups, where Otonecines had the highest proportion of positive mutagenicity
predictions and Monoester the lowest.

Introduction
============

**TODO**: rationale for investigation

<!---
Pyrrolizidine alkaloids (PAs) are secondary plant ingredients found in
many plant species as protection against predators [Hartmann & Witte
1995](#_ENREF_59)[Langel et al. 2011](#_ENREF_76)(; ). PAs are ester
alkaloids, which are composed of a necine base (two fused five-membered
rings joined by a nitrogen atom) and one or two necic acid (carboxylic
ester arms). The necine base can have different structures and thereby
divides PAs into several structural groups, e.g. otonecine, platynecine,
and retronecine. The structural groups of the necic acid are macrocyclic
diester, open-ring diester and monoester [Langel et al.
2011](#_ENREF_76)().

PA are mainly metabolised in the liver, which is at the same time the
main target organ of toxicity [Bull & Dick 1959](#_ENREF_17)[Bull et al.
1958](#_ENREF_18)[Butler et al. 1970](#_ENREF_20)[DeLeve et al.
1996](#_ENREF_33)[Jago 1971](#_ENREF_65)[Li et al.
2011](#_ENREF_78)[Neumann et al. 2015](#_ENREF_99)(; ; ; ; ; ; ). There
are three principal metabolic pathways for 1,2-unsaturated PAs [Chen et
al. 2010](#_ENREF_26)(): (i) Detoxification by hydrolysis: the ester
bond on positions C7 and C9 are hydrolysed by non-specific esterases to
release necine base and necic acid, which are then subjected to further
phase II-conjugation and excretion. (ii) Detoxification by *N*-oxidation
of the necine base (only possible for retronecine-type PAs): the
nitrogen is oxidised to form a PA *N*-oxides, which can be conjugated by
phase II enzymes e.g. glutathione and then excreted. PA *N*-oxides can
be converted back into the corresponding parent PA [Wang et al.
2005](#_ENREF_134)(). (iii) Metabolic activation or toxification: PAs
are metabolic activated/ toxified by oxidation (for retronecine-type
PAs) or oxidative *N*-demethylation (for otonecine-type PAs [Lin
1998](#_ENREF_82)()). This pathway is mainly catalysed by cytochrome
P450 isoforms CYP2B and 3A [Ruan et al. 2014b](#_ENREF_115)(), and
results in the formation of dehydropyrrolizidines (DHP, also known as
pyrrolic ester or reactive pyrroles). DHPs are highly reactive and cause
damage in the cells where they are formed, usually hepatocytes. However,
they can also pass from the hepatocytes into the adjacent sinusoids and
damage the endothelial lining cells [Gao et al. 2015](#_ENREF_48)()
predominantly by reaction with protein, lipids and DNA. There is even
evidence, that conjugation of DHP to glutathione, which would generally
be considered a detoxification step, could result in reactive
metabolites, which might also lead to DNA adduct formation [Xia et al.
2015](#_ENREF_138)(). Due to the ability to form DNA adducts, DNA
crosslinks and DNA breaks 1,2-unsaturated PAs are generally considered
genotoxic and carcinogenic [Chen et al. 2010](#_ENREF_26)[EFSA
2011](#_ENREF_36)[Fu et al. 2004](#_ENREF_45)[Li et al.
2011](#_ENREF_78)[Takanashi et al. 1980](#_ENREF_126)[Yan et al.
2008](#_ENREF_140)[Zhao et al. 2012](#_ENREF_148)(; ; ; ; ; ; ). Still,
there is no evidence yet that PAs are carcinogenic in humans [ANZFA
2001](#_ENREF_4)[EMA 2016](#_ENREF_39)(; ). One general limitation of
studies with PAs is the number of different PAs investigated. Around 30
PAs are currently commercially available, therefore all studies focus on
these PAs. This is also true for *in vitro* and *in vivo* tests on
mutagenicity and genotoxicity. To gain a wider perspective, in this
study over 600 different PAs were assessed on their mutagenic potential
using four different machine learning techniques.
--->

<!---

Mutagenicity datasets
Algorithms
descriptors
define abbreviations
pyrrolizidine 
--->

The main objectives of this study were

  - to generate a new mutagenicity training dataset, by combining the most comprehensive public datasets
  - to compare the performance of MolPrint2D (*MP2D*) fingerprints with Chemistry Development Kit (*CDK*) descriptors
  - to compare the performance of global QSAR models (random forests (*RF*), support vector machines (*SVM*), logistic regression (*LR*), neural nets (*NN*)) with local models (`lazar`)
  - to apply these models for the prediction of pyrrolizidine alkaloid mutagenicity

Materials and Methods
=====================

Data
----

### Mutagenicity training data

An identical training dataset was used for all models. The
training dataset was compiled from the following sources:

-   Kazius/Bursi Dataset (4337 compounds, @Kazius2005): <http://cheminformatics.org/datasets/bursi/cas_4337.zip>

-   Hansen Dataset (6513 compounds, @Hansen2009): <http://doc.ml.tu-berlin.de/toxbenchmark/Mutagenicity_N6512.csv>

-   EFSA Dataset (695 compounds @EFSA2016): <https://data.europa.eu/euodp/data/storage/f/2017-0719T142131/GENOTOX%20data%20and%20dictionary.xls>

Mutagenicity classifications from Kazius and Hansen datasets were used
without further processing. To achieve consistency with these
datasets, EFSA compounds were classified as mutagenic, if at least one
positive result was found for TA98 or T100 Salmonella strains.

Dataset merges were based on unique SMILES (*Simplified Molecular Input Line
Entry Specification*, @Weininger1989) strings of the compound structures.
Duplicated experimental data with the same outcome was merged into a single
value, because it is likely that it originated from the same experiment.
Contradictory results were kept as multiple measurements in the database. The
combined training dataset contains {{cv.n_uniq}} unique structures and {{cv.n}}
individual measurements.

Source code for all data download, extraction and merge operations is publicly
available from the git repository <https://git.in-silico.ch/mutagenicity-paper>
under a GPL3 License. The new combined dataset can be found at
<https://git.in-silico.ch/mutagenicity-paper/tree/mutagenicity/mutagenicity.csv>.

### Pyrrolizidine alkaloid (PA) dataset

The pyrrolizidine alkaloid dataset was created from five independent, necine
base substructure searches in PubChem (https://pubchem.ncbi.nlm.nih.gov/) and
compared to the PAs listed in the EFSA publication @EFSA2011 and the book by
Mattocks @Mattocks1986, to ensure, that all major PAs were included. PAs
mentioned in these publications which were not found in the downloaded
substances were searched individually in PubChem and, if available, downloaded
separately.  Non-PA substances, duplicates, and isomers were removed from the
files, but artificial PAs, even if unlikely to occur in nature, were kept. The
resulting PA dataset comprised a total of {{pa.n}} different PAs.

The PAs in the dataset were classified according to structural features. A
total of 9 different structural features were assigned to the necine base,
modifications of the necine base and to the necic acid:

For the necine base, the following structural features were chosen:

  - Retronecine-type (1,2-unstaturated necine base, {{pa.groups.Retronecine.n}} compounds)
  - Otonecine-type (1,2-unstaturated necine base, {{pa.groups.Otonecine.n}} compounds)
  - Platynecine-type (1,2-saturated necine base, {{pa.groups.Platynecine.n}} compounds)

For the modifications of the necine base, the following structural features were chosen:

  - N-oxide-type ({{pa.groups.N_oxide.n}} compounds)
  - Tertiary-type (PAs which were neither from the N-oxide- nor DHP-type, {{pa.groups.Tertiary_PA.n}} compounds)
  - Dehydropyrrolizidine-type (pyrrolic ester, {{pa.groups.Dehydropyrrolizidine.n}} compounds)

For the necic acid, the following structural features were chosen:

  - Monoester-type ({{pa.groups.Monoester.n}} compounds)
  - Open-ring diester-type ({{pa.groups.Diester.n}} compounds)
  - Macrocyclic diester-type ({{pa.groups.Macrocyclic_diester.n}} compounds)

The compilation of the PA dataset is described in detail in @Schoening2017.

Descriptors
-----------

### MolPrint2D (*MP2D*) fingerprints

MolPrint2D fingerprints (@OBoyle2011a) use atom environments as molecular
representation.  They determine for each atom in a molecule, the atom types of
its connected atoms to represent their chemical environment.  This resembles
basically the chemical concept of functional groups.

In contrast to predefined lists of fragments (e.g. FP3, FP4 or MACCs
fingerprints) or descriptors (e.g CDK) they are generated dynamically from
chemical structures. This has the advantage that they can capture unknown
substructures of toxicological relevance that are not included in other
descriptors. In addition they allow the efficient calculation of chemical
similarities (e.g. Tanimoto indices) with simple set operations.

MolPrint2D fingerprints were calculated with the OpenBabel cheminformatics
library (@OBoyle2011a). They can be obtained from the following locations:

*Training data:*

  - sparse representation (<https://git.in-silico.ch/mutagenicity-paper/tree/mutagenicity/mp2d/fingerprints.mp2d>)
  - descriptor matrix (<https://git.in-silico.ch/mutagenicity-paper/tree/mutagenicity/mp2d/mutagenicity-fingerprints.csv.gz>)

*Pyrrolizidine alkaloids:*

  - sparse representation (<https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/mp2d/fingerprints.mp2d>)
  - descriptor matrix (<https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/mp2d/pa-fingerprints.csv.gz>)

#### Chemistry Development Kit (*CDK*) descriptors

Molecular 1D and 2D descriptors were calculated with the PaDEL-Descriptors
program (<http://www.yapcwsoft.com> version 2.21, @Yap2011). PaDEL uses the
Chemistry Development Kit (*CDK*, <https://cdk.github.io/index.html>) library
for descriptor calculations.

As the training dataset contained {{cv.n_uniq}} instances, it was decided to
delete instances with missing values during data pre-processing. Furthermore,
substances with equivocal outcome were removed. The final training dataset
contained {{cv.cdk.n_descriptors}} descriptors for {{cv.cdk.n_compounds}}
compounds.

CDK training data can be obtained from <https://git.in-silico.ch/mutagenicity-paper/tree/mutagenicity/cdk/mutagenicity-mod-2.new.csv>.

The same procedure was applied for the pyrrolizidine dataset yielding 
 {{pa.cdk.n_descriptors}} descriptors for {{pa.cdk.n_compounds}}
compounds. CDK features for pyrrolizidine alkaloids are available at  <https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/cdk/PA-Padel-2D_m2.csv>.

Algorithms
----------

### `lazar`

`lazar` (*lazy structure activity relationships*) is a modular framework
for read-across model development and validation. It follows the
following basic workflow: For a given chemical structure `lazar`:

-   searches in a database for similar structures (neighbours) with
    experimental data,

-   builds a local QSAR model with these neighbours and

-   uses this model to predict the unknown activity of the query
    compound.

This procedure resembles an automated version of read across predictions
in toxicology, in machine learning terms it would be classified as a
k-nearest-neighbour algorithm.

Apart from this basic workflow, `lazar` is completely modular and allows
the researcher to use arbitrary algorithms for similarity searches and local
QSAR (*Quantitative structure--activity relationship*) modelling.
Algorithms used within this study are described in the following
sections.

#### Feature preprocessing

MolPrint2D features were used without preprocessing. Near zero variance and
strongly correlated CDK descriptors were removed and the remaining descriptor
values were centered and scaled. Preprocessing was performed with the R `caret`
preProcess function using the methods "nzv","corr","center" and "scale" with
default settings.

#### Neighbour identification

Utilizing this modularity, similarity calculations were based both on
MolPrint2D fingerprints and on CDK descriptors.

For MolPrint2D fingerprints chemical similarity between two compounds $a$ and
$b$ is expressed as the proportion between atom environments common in both
structures $A \cap B$ and the total number of atom environments $A \cup B$
(Jaccard/Tanimoto index).

$$sim = \frac{\lvert A\  \cap B \rvert}{\lvert A\  \cup B \rvert}$$

For CDK descriptors chemical similarity between two compounds $a$ and $b$ is
expressed as the cosine similarity between the descriptor vectors $A$ for $a$
and $B$ for $b$.

$$sim = \frac{A \cdot B}{\lvert A \rvert \lvert B \rvert}$$


Threshold selection is a trade-off between prediction accuracy (high
threshold) and the number of predictable compounds (low threshold). As
it is in many practical cases desirable to make predictions even in the
absence of closely related neighbours, we follow a tiered approach:

-   First a similarity threshold of 0.5 (MP2D/Tanimoto) or 0.9 (CDK/Cosine) is
    used to collect neighbours, to create a local QSAR model and to make a
    prediction for the query compound. This are predictions with *high
    confidence*.

-   If any of these steps fails, the procedure is repeated with a similarity
    threshold of 0.2 (MP2D/Tanimoto) or 0.7 (CDK/Cosine) and the prediction is
    flagged with a warning that it might be out of the applicability domain of
    the training data (*low confidence*).

-   These Similarity thresholds are the default values chosen
    by software developers and remained unchanged during the
    course of these experiments.

Compounds with the same structure as the query structure are
automatically eliminated from neighbours to obtain unbiased predictions
in the presence of duplicates.

#### Local QSAR models and predictions

Only similar compounds (neighbours) above the threshold are used for
local QSAR models. In this investigation, we are using a weighted
majority vote from the neighbour's experimental data for mutagenicity
classifications. Probabilities for both classes (mutagenic/non-mutagenic) are
calculated according to the following formula and the class with the higher
probability is used as prediction outcome.

$$p_{c} = \ \frac{\sum_{}^{}\text{sim}_{n,c}}{\sum_{}^{}\text{sim}_{n}}$$

$p_{c}$ Probability of class c (e.g. mutagenic or non-mutagenic)\
$\sum_{}^{}\text{sim}_{n,c}$ Sum of similarities of neighbours with
class c\
$\sum_{}^{}\text{sim}_{n}$ Sum of all neighbours

#### Applicability domain

The applicability domain (AD) of `lazar` models is determined by the
structural diversity of the training data. If no similar compounds are
found in the training data no predictions will be generated. Warnings are
issued if the similarity threshold had to be lowered from 0.5 to 0.2 in order
to enable predictions. Predictions without warnings can be considered as close
to the applicability domain (*high confidence*) and predictions with warnings
as more distant from the applicability domain (*low confidence*). Quantitative
applicability domain information can be obtained from the similarities of
individual neighbours.

#### Validation

10-fold cross validation was performed for model evaluation.

#### Pyrrolizidine alkaloid predictions

For the prediction of pyrrolizidine alkaloids models were generated with the
MP2D and CDK training datasets. The complete feature set was used for MP2D
predictions, for CDK predictions the intersection between training and
pyrrolizidine alkaloid features was used.

#### Availability

  - Source code for this manuscript (GPL3):
    <https://git.in-silico.ch/lazar/tree/?h=mutagenicity-paper>
  
  - Crossvalidation experiments (GPL3):
    <https://git.in-silico.ch/lazar/tree/models/?h=mutagenicity-paper>
  
  - Pyrrolizidine alkaloid predictions (GPL3):
    <https://git.in-silico.ch/lazar/tree/predictions/?h=mutagenicity-paper>
  
  - Public web interface:
    <https://lazar.in-silico.ch>

### Tensorflow models

#### Feature Preprocessing

For preprocessing of the CDK features we used a quantile transformation 
to a uniform distribution. MP2D features were not preprocessed.

#### Random forests (*RF*)

For the random forest classifier we used the parameters 
n_estimators=1000and max_leaf_nodes=200. For the other parameters we 
used the scikit-learn default values.

#### Logistic regression (SGD) (*LR-sgd*)

For the logistic regression we used an ensemble of five trained models. 
For each model we used a batch size of 64 and trained for 50 epoch. As 
an optimizer ADAM was chosen. For the other parameters we used the 
tensorflow default values.

#### Logistic regression (scikit) (*LR-scikit*)

For the logistic regression we used as parameters the scikit-learn 
default values.

#### Neural Nets (*NN*)

For the neural network we used an ensemble of five trained models. For 
each model we used a batch size of 64 and trained for 50 epoch. As an 
optimizer ADAM was chosen. The neural network had 4 hidden layers with 
64 nodes each and a ReLu activation function. For the other parameters 
we used the tensorflow default values.

#### Support vector machines (*SVM*)

We used the SVM implemented in scikit-learn. We used the parameters 
kernel='rbf', gamma='scale'. For the other parameters we used the 
scikit-learn default values.

#### Validation

10-fold cross-validation was used for all Tensorflow models.

#### Pyrrolizidine alkaloid predictions

For the prediction of pyrrolizidine alkaloids we trained the model described above on the 
training data. For training and prediction only the features were used 
that were in the intersection of features from the training data and the 
pyrrolizidine alkaloids.

#### Availability

Jupyter notebooks for these experiments can be found at the following locations

*Crossvalidation:*

  - MolPrint2D fingerprints: <https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/mp2d/tensorflow>
  - CDK descriptors: <https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/cdk/tensorflow>

*Pyrrolizidine alkaloids:*

  - MolPrint2D fingerprints: <https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/mp2d/tensorflow>
  - CDK descriptors: <https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/cdk/tensorflow>
  - CDK desc

Results
=======

10-fold crossvalidations
------------------------

Crossvalidation results are summarized in the following tables: @tbl:cv-mp2d
shows results with MolPrint2D descriptors and @tbl:cv-cdk with CDK descriptors.

|  | lazar-HC | lazar-all | RF | LR-sgd | LR-scikit | NN | SVM |
|:-|----------|-----------|----|--------|-----------|----|-----|
Accuracy | {{cv.mp2d_lazar_high_confidence.acc_perc}} | {{cv.mp2d_lazar_all.acc_perc}} | {{cv.mp2d_rf.acc_perc}} | {{cv.mp2d_lr.acc_perc}} | {{cv.mp2d_lr2.acc_perc}} | {{cv.mp2d_nn.acc_perc}} | {{cv.mp2d_svm.acc_perc}} |
True positive rate | {{cv.mp2d_lazar_high_confidence.tpr_perc}} | {{cv.mp2d_lazar_all.tpr_perc}} | {{cv.mp2d_rf.tpr_perc}} | {{cv.mp2d_lr.tpr_perc}} | {{cv.mp2d_lr2.tpr_perc}} | {{cv.mp2d_nn.tpr_perc}} | {{cv.mp2d_svm.tpr_perc}} |
True negative rate | {{cv.mp2d_lazar_high_confidence.tnr_perc}} | {{cv.mp2d_lazar_all.tnr_perc}} | {{cv.mp2d_rf.tnr_perc}} | {{cv.mp2d_lr.tnr_perc}} | {{cv.mp2d_lr2.tnr_perc}} | {{cv.mp2d_nn.tnr_perc}} | {{cv.mp2d_svm.tnr_perc}} |
Positive predictive value | {{cv.mp2d_lazar_high_confidence.ppv_perc}} | {{cv.mp2d_lazar_all.ppv_perc}} | {{cv.mp2d_rf.ppv_perc}} | {{cv.mp2d_lr.ppv_perc}} | {{cv.mp2d_lr2.ppv_perc}} | {{cv.mp2d_nn.ppv_perc}} | {{cv.mp2d_svm.ppv_perc}} |
Negative predictive value | {{cv.mp2d_lazar_high_confidence.npv_perc}} | {{cv.mp2d_lazar_all.npv_perc}} | {{cv.mp2d_rf.npv_perc}} | {{cv.mp2d_lr.npv_perc}} | {{cv.mp2d_lr2.npv_perc}} | {{cv.mp2d_nn.npv_perc}} | {{cv.mp2d_svm.npv_perc}} |
Nr. predictions | {{cv.mp2d_lazar_high_confidence.n}} | {{cv.mp2d_lazar_all.n}} | {{cv.mp2d_rf.n}} | {{cv.mp2d_lr.n}} | {{cv.mp2d_lr2.n}} | {{cv.mp2d_nn.n}} | {{cv.mp2d_svm.n}} |

: Summary of crossvalidation results with MolPrint2D descriptors (lazar-HC: lazar with high confidence, lazar-all: all lazar predictions, RF: random forests, LR-sgd: logistic regression (stochastic gradient descent), LR-scikit: logistic regression (scikit), NN: neural networks, SVM: support vector machines) {#tbl:cv-mp2d}


|  | lazar-HC | lazar-all | RF | LR-sgd | LR-scikit | NN | SVM |
|:-|----------|-----------|----|--------|-----------|----|-----|
Accuracy | {{cv.cdk_lazar_high_confidence.acc_perc}} | {{cv.cdk_lazar_all.acc_perc}} | {{cv.cdk_rf.acc_perc}} | {{cv.cdk_lr.acc_perc}} | {{cv.cdk_lr2.acc_perc}} | {{cv.cdk_nn.acc_perc}} | {{cv.cdk_svm.acc_perc}} |
True positive rate | {{cv.cdk_lazar_high_confidence.tpr_perc}} | {{cv.cdk_lazar_all.tpr_perc}} | {{cv.cdk_rf.tpr_perc}} | {{cv.cdk_lr.tpr_perc}} | {{cv.cdk_lr2.tpr_perc}} | {{cv.cdk_nn.tpr_perc}} | {{cv.cdk_svm.tpr_perc}} |
True negative rate | {{cv.cdk_lazar_high_confidence.tnr_perc}} | {{cv.cdk_lazar_all.tnr_perc}} | {{cv.cdk_rf.tnr_perc}} | {{cv.cdk_lr.tnr_perc}} | {{cv.cdk_lr2.tnr_perc}} | {{cv.cdk_nn.tnr_perc}} | {{cv.cdk_svm.tnr_perc}} |
Positive predictive value | {{cv.cdk_lazar_high_confidence.ppv_perc}} | {{cv.cdk_lazar_all.ppv_perc}} | {{cv.cdk_rf.ppv_perc}} | {{cv.cdk_lr.ppv_perc}} | {{cv.cdk_lr2.ppv_perc}} | {{cv.cdk_nn.ppv_perc}} | {{cv.cdk_svm.ppv_perc}} |
Negative predictive value | {{cv.cdk_lazar_high_confidence.npv_perc}} | {{cv.cdk_lazar_all.npv_perc}} | {{cv.cdk_rf.npv_perc}} | {{cv.cdk_lr.npv_perc}} | {{cv.cdk_lr2.npv_perc}} | {{cv.cdk_nn.npv_perc}} | {{cv.cdk_svm.npv_perc}} |
Nr. predictions | {{cv.cdk_lazar_high_confidence.n}} | {{cv.cdk_lazar_all.n}} | {{cv.cdk_rf.n}} | {{cv.cdk_lr.n}} | {{cv.cdk_lr2.n}} | {{cv.cdk_nn.n}} | {{cv.cdk_svm.n}} |

: Summary of crossvalidation results with CDK descriptors (lazar-HC: lazar with high confidence, lazar-all: all lazar predictions, RF: random forests, LR-sgd: logistic regression (stochastic gradient descent), LR-scikit: logistic regression (scikit), NN: neural networks, SVM: support vector machines) {#tbl:cv-cdk}

@fig:roc depicts the position of all crossvalidation results in receiver operating characteristic (ROC) space.

![ROC plot of crossvalidation results (lazar-HC: lazar with high confidence, lazar-all: all lazar predictions, RF: random forests, LR-sgd: logistic regression (stochastic gradient descent), LR-scikit: logistic regression (scikit), NN: neural networks, SVM: support vector machines).](figures/roc.png){#fig:roc}

Confusion matrices for all models are available from the git repository
https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/confusion-matrices/,
individual predictions can be found in
https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/predictions/.

All investigated algorithm/descriptor combinations
give accuracies between (80 and 85%) which is equivalent to the experimental
variability of the *Salmonella typhimurium* mutagenicity bioassay (80-85%,
@Benigni1988). Sensitivities and specificities are balanced in all of
these models.

Pyrrolizidine alkaloid mutagenicity predictions 
-----------------------------------------------

Mutagenicity predictions of {{pa.n}} pyrrolizidine alkaloids (PAs) from all
investigated models can be downloaded from
<https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/pa-predictions.csv>.
A visual representation of all PA predictions can be found at
<https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/pa-predictions.pdf>.

<!--
@tbl:pa-mp2d and @tbl:pa-cdk summarise the outcome of pyrrolizidine alkaloid predictions from all models with MolPrint2D and CDK descriptors.

| Model  | mutagenic | non-mutagenic | Nr. predictions |
|-------:|-----------|---------------|-----------------|
| lazar-all | {{pa.mp2d_lazar_all.mut_perc}}% ({{pa.mp2d_lazar_all.mut}}) | {{pa.mp2d_lazar_all.non_mut_perc}}% ({{pa.mp2d_lazar_all.non_mut}}) | {{pa.mp2d_lazar_all.n_perc}}% ({{pa.mp2d_lazar_all.n}}) |
| lazar-HC | {{pa.mp2d_lazar_high_confidence.mut_perc}}% ({{pa.mp2d_lazar_high_confidence.mut}}) | {{pa.mp2d_lazar_high_confidence.non_mut_perc}}% ({{pa.mp2d_lazar_high_confidence.non_mut}}) | {{pa.mp2d_lazar_high_confidence.n_perc}}% ({{pa.mp2d_lazar_high_confidence.n}}) |
| RF | {{pa.mp2d_rf.mut_perc}}% ({{pa.mp2d_rf.mut}}) | {{pa.mp2d_rf.non_mut_perc}}% ({{pa.mp2d_rf.non_mut}}) | {{pa.mp2d_rf.n_perc}}% ({{pa.mp2d_rf.n}}) |
| LR-sgd | {{pa.mp2d_lr.mut_perc}}% ({{pa.mp2d_lr.mut}}) | {{pa.mp2d_lr.non_mut_perc}}% ({{pa.mp2d_lr.non_mut}}) | {{pa.mp2d_lr.n_perc}}% ({{pa.mp2d_lr.n}}) |
| LR-scikit | {{pa.mp2d_lr2.mut_perc}}% ({{pa.mp2d_lr2.mut}}) | {{pa.mp2d_lr2.non_mut_perc}}% ({{pa.mp2d_lr2.non_mut}}) | {{pa.mp2d_lr2.n_perc}}% ({{pa.mp2d_lr2.n}}) |
| NN | {{pa.mp2d_nn.mut_perc}}% ({{pa.mp2d_nn.mut}}) | {{pa.mp2d_nn.non_mut_perc}}% ({{pa.mp2d_nn.non_mut}}) | {{pa.mp2d_nn.n_perc}}% ({{pa.mp2d_nn.n}}) |
| SVM | {{pa.mp2d_svm.mut_perc}}% ({{pa.mp2d_svm.mut}}) | {{pa.mp2d_svm.non_mut_perc}}% ({{pa.mp2d_svm.non_mut}}) | {{pa.mp2d_svm.n_perc}}% ({{pa.mp2d_svm.n}}) |

: Summary of MolPrint2D pyrrolizidine alkaloid predictions {#tbl:pa-mp2d}

| Model  | mutagenic | non-mutagenic | Nr. predictions |
|-------:|-----------|---------------|-----------------|
| lazar-all | {{pa.cdk_lazar_all.mut_perc}}% ({{pa.cdk_lazar_all.mut}}) | {{pa.cdk_lazar_all.non_mut_perc}}% ({{pa.cdk_lazar_all.non_mut}}) | {{pa.cdk_lazar_all.n_perc}}% ({{pa.cdk_lazar_all.n}}) |
| lazar-HC | {{pa.cdk_lazar_high_confidence.mut_perc}}% ({{pa.cdk_lazar_high_confidence.mut}}) | {{pa.cdk_lazar_high_confidence.non_mut_perc}}% ({{pa.cdk_lazar_high_confidence.non_mut}}) | {{pa.cdk_lazar_high_confidence.n_perc}}% ({{pa.cdk_lazar_high_confidence.n}}) |
| RF | {{pa.cdk_rf.mut_perc}}% ({{pa.cdk_rf.mut}}) | {{pa.cdk_rf.non_mut_perc}}% ({{pa.cdk_rf.non_mut}}) | {{pa.cdk_rf.n_perc}}% ({{pa.cdk_rf.n}}) |
| LR-sgd | {{pa.cdk_lr.mut_perc}}% ({{pa.cdk_lr.mut}}) | {{pa.cdk_lr.non_mut_perc}}% ({{pa.cdk_lr.non_mut}}) | {{pa.cdk_lr.n_perc}}% ({{pa.cdk_lr.n}}) |
| LR-scikit | {{pa.cdk_lr2.mut_perc}}% ({{pa.cdk_lr2.mut}}) | {{pa.cdk_lr2.non_mut_perc}}% ({{pa.cdk_lr2.non_mut}}) | {{pa.cdk_lr2.n_perc}}% ({{pa.cdk_lr2.n}}) |
| NN | {{pa.cdk_nn.mut_perc}}% ({{pa.cdk_nn.mut}}) | {{pa.cdk_nn.non_mut_perc}}% ({{pa.cdk_nn.non_mut}}) | {{pa.cdk_nn.n_perc}}% ({{pa.cdk_nn.n}}) |
| SVM | {{pa.cdk_svm.mut_perc}}% ({{pa.cdk_svm.mut}}) | {{pa.cdk_svm.non_mut_perc}}% ({{pa.cdk_svm.non_mut}}) | {{pa.cdk_svm.n_perc}}% ({{pa.cdk_svm.n}}) |

: Summary of CDK pyrrolizidine alkaloid predictions {#tbl:pa-cdk}
-->

@fig:pa-groups displays the proportion of positive mutagenicity predictions
from all models for the different pyrrolizidine alkaloid groups. Tensorflow
models predicted all {{pa.n}} pyrrolizidine alkaloids, `lazar` MP2D models
predicted {{pa.mp2d_lazar_all.n}} compounds
({{pa.mp2d_lazar_high_confidence.n}} with high confidence) and `lazar` CDK
models {{pa.cdk_lazar_all.n}} compounds ({{pa.cdk_lazar_high_confidence.n}}
with high confidence).

![Summary of pyrrolizidine alkaloid predictions](figures/pa-groups.png){#fig:pa-groups}

<!--
![Summary of Diester predictions](figures/Diester.png){#fig:die}

![Summary of Macrocyclic-diester predictions](figures/Macrocyclic.diester.png){#fig:mcdie}

![Summary of Monoester predictions](figures/Monoester.png){#fig:me}

![Summary of N-oxide predictions](figures/N.oxide.png){#fig:nox}

![Summary of Otonecine predictions](figures/Otonecine.png){#fig:oto}

![Summary of Platynecine predictions](figures/Platynecine.png){#fig:plat}

![Summary of Retronecine predictions](figures/Retronecine.png){#fig:ret}

![Summary of Tertiary PA predictions](figures/Tertiary.PA.png){#fig:tert}
-->

For the visualisation of the position of pyrrolizidine alkaloids in respect to
the training data set we have applied t-distributed stochastic neighbor
embedding (t-SNE, @Maaten2008) for MolPrint2D and CDK descriptors.  t-SNE maps
each high-dimensional object (chemical) to a two-dimensional point, maintaining
the high-dimensional distances of the objects. Similar objects are represented
by nearby points and dissimilar objects are represented by distant points.
t-SNE coordinates were calculated with the R `Rtsne` package using the default
settings (perplexity = 30, theta = 0.5, max_iter = 1000).

@fig:tsne-mp2d shows the t-SNE of pyrrolizidine alkaloids (PA) and the
mutagenicity training data in MP2D space (Tanimoto/Jaccard similarity), which
resembles basically the structural diversity of the investigated compounds.

![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA) in MP2D space](figures/tsne-mp2d-mutagenicity.png){#fig:tsne-mp2d}

@fig:tsne-cdk shows the t-SNE of pyrrolizidine alkaloids (PA) and the
mutagenicity training data in CDK space (Euclidean similarity), which resembles
basically the physical-chemical properties of the investigated compounds.

![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA) in CDK space](figures/tsne-cdk-mutagenicity.png){#fig:tsne-cdk}

@fig:tsne-mp2d-rf and @fig:tsne-cdk-lazar-all depict two example pyrrolizidine alkaloid
mutagenicity predictions in the context of training data. t-SNE visualisations of all investigated models can be downloaded from <https://git.in-silico.ch/mutagenicity-paper/figures>.

<!--
![t-SNE visualisation of all MP2D lazar predictions](figures/tsne-mp2d-lazar-all-classifications.png){#fig:tsne-mp2d-lazar-all}

![t-SNE visualisation of MP2D lazar high-confidence predictions](figures/tsne-mp2d-lazar-high-confidence-classifications.png){#fig:tsne-mp2d-lazar-high-confidence}

![t-SNE visualisation of MP2D logistic regression (sgd) predictions](figures/tsne-mp2d-lr-classifications.png){#fig:tsne-mp2d-lr}

![t-SNE visualisation of MP2D logistic regression (scikit) predictions](figures/tsne-mp2d-lr2-classifications.png){#fig:tsne-mp2d-lr2}

![t-SNE visualisation of MP2D neural network predictions](figures/tsne-mp2d-nn-classifications.png){#fig:tsne-mp2d-nn}
-->

![t-SNE visualisation of MP2D random forest predictions](figures/tsne-mp2d-rf-classifications.png){#fig:tsne-mp2d-rf}

<!--
![t-SNE visualisation of MP2D support vector machine predictions](figures/tsne-mp2d-svm-classifications.png){#fig:tsne-mp2d-svm}
-->

![t-SNE visualisation of all CDK lazar predictions](figures/tsne-cdk-lazar-all-classifications.png){#fig:tsne-cdk-lazar-all}

<!--
![t-SNE visualisation of CDK lazar high-confidence predictions](figures/tsne-cdk-lazar-high-confidence-classifications.png){#fig:tsne-cdk-lazar-high-confidence}

![t-SNE visualisation of CDK logistic regression (sgd) predictions](figures/tsne-cdk-lr-classifications.png){#fig:tsne-cdk-lr}

![t-SNE visualisation of CDK logistic regression (scikit) predictions](figures/tsne-cdk-lr2-classifications.png){#fig:tsne-cdk-lr2}

![t-SNE visualisation of CDK neural network predictions](figures/tsne-cdk-nn-classifications.png){#fig:tsne-cdk-nn}

![t-SNE visualisation of CDK random forest predictions](figures/tsne-cdk-rf-classifications.png){#fig:tsne-cdk-rf}

![t-SNE visualisation of CDK support vector machine predictions](figures/tsne-cdk-svm-classifications.png){#fig:tsne-cdk-svm}
-->

Discussion
==========

Data
----

A new training dataset for *Salmonella* mutagenicity was created from three
different sources (@Kazius2005, @Hansen2009, @EFSA2016). It contains {{cv.n_uniq}}
unique chemical structures, which is according to our knowledge the largest
public mutagenicity dataset presently available. The new training data can be
downloaded from
<https://git.in-silico.ch/mutagenicity-paper/tree/mutagenicity/mutagenicity.csv>.

Algorithms
----------

`lazar` is formally a *k-nearest-neighbor* algorithm that searches for similar
structures for a given compound and calculates the prediction based on the
experimental data for these structures. The QSAR literature calls such models
frequently *local models*, because models are generated specifically for each
query compound. The investigated tensorflow models are in contrast *global
models*, i.e. a single model is used to make predictions for all compounds. It
has been postulated in the past, that local models are more accurate, because
they can account better for mechanisms, that affect only a subset of the
training data.

@tbl:cv-mp2d, @tbl:cv-cdk and @fig:roc show that the crossvalidation accuracies
of all models are comparable to the experimental variability of the *Salmonella
typhimurium* mutagenicity bioassay (80-85% according to @Benigni1988). All of
these models have balanced sensitivity (true position rate) and specificity
(true negative rate) and provide highly significant concordance with
experimental data (as determined by McNemar's Test). This is a clear indication
that *in-silico* predictions can be as reliable as the bioassays. Given that
the variability of experimental data is similar to model variability it is
impossible to decide which model gives the most accurate predictions, as models
with higher accuracies might just approximate experimental errors better than
more robust models.

Our results do not support the assumption that local models are superior to
global models for classification purposes. For regression models (lowest
observed effect level) we have found however that local models may outperform
global models (@Helma2018) with accuracies similar to experimental variability.

As all investigated algorithms give similar accuracies the selection will
depend more on practical considerations than on intrinsic  properties. Nearest
neighbor algorithms like `lazar` have the practical advantage that the
rationales for individual predictions can be presented in a  straightforward
manner that is understandable without a background in statistics or machine
learning (@fig:lazar). This allows a critical examination of individual
predictions and prevents blind trust in models that are intransparent to users
with a toxicological background.

![Lazar screenshot of 12,21-Dihydroxy-4-methyl-4,8-secosenecinonan-8,11,16-trione mutagenicity prediction](figures/lazar-screenshot.png){#fig:lazar}

Descriptors
-----------

This study uses two types of descriptors for the characterisation of chemical
structures:

*MolPrint2D* fingerprints (MP2D, @Bender2004) use atom environments (i.e.
connected atom types for all atoms in a molecule) as molecular representation,
which resembles basically the chemical concept of functional groups. MP2D
descriptors are used to determine chemical similarities in the default `lazar`
settings, and previous experiments have shown, that they give more accurate
results than predefined fingerprints (e.g.  MACCS, FP2-4).

*Chemistry Development Kit* (CDK, @Willighagen2017) descriptors 
were calculated with the PaDEL graphical interface (@Yap2011). They include 
1D and 2D topological descriptors as well as physical-chemical properties.

All investigated algorithms obtained models within the experimental variability
for both types of descriptors (@tbl:cv-mp2d, @tbl:cv-cdk, @fig:roc).

Given that similar predictive accuracies are obtainable from both types of
descriptors the choice depends once more on practical considerations:

MolPrint2D fragments can be calculated very efficiently for every well defined
chemical structure with OpenBabel (@OBoyle2011a). CDK descriptor calculations
are in contrast much more resource intensive and may fail for a significant
number of compounds ({{cv.cdk.n_failed}} from {{cv.n_uniq}}). 

MolPrint2D fragments are generated dynamically from chemical structures and can
be used to determine if a compound contains structural features that are absent
in training data. This feature can be used to determine applicability domains.
CDK descriptors contain in contrast a predefined set of descriptors with
unknown toxicological relevance.

MolPrint2D fingerprints can be represented very efficiently as sets of features
that are present in a given compound which makes similarity calculations very
efficient. Due to the large number of substructures present in training
compounds, they lead however to large and sparsely populated datasets, if they
have to be expanded to a binary matrix (e.g. as input for tensorflow models).
CDK descriptors contain in contrast in every case matrices with
{{cv.cdk.n_descriptors}} columns which can cause substantial computational overhead.

Pyrrolizidine alkaloid mutagenicity predictions
-----------------------------------------------

@fig:pa-groups shows a clear differentiation between the different
pyrrolizidine alkaloid groups. The largest proportion of mutagenic predictions
was observed for Otonecines {{pa.groups.Otonecine.mut_perc}}%
({{pa.groups.Otonecine.mut}}/{{pa.groups.Otonecine.n_pred}}), the lowest for
Monoesters {{pa.groups.Monoester.mut_perc}}%
({{pa.groups.Monoester.mut}}/{{pa.groups.Monoester.n_pred}}) and N-Oxides
{{pa.groups.N_oxide.mut_perc}}%
({{pa.groups.N_oxide.mut}}/{{pa.groups.N_oxide.n_pred}}).

Although most of the models show similar accuracies, sensitivities and
specificities in crossvalidation experiments some of the models (MPD-RF, CDK-RF
and CDK-SVM) predict a lower number of mutagens
({{pa.cdk_rf.mut_perc}}-{{pa.mp2d_rf.mut_perc}}%) than the majority of the
models ({{pa.mp2d_svm.mut_perc}}-{{pa.mp2d_lazar_high_confidence.mut_perc}}%
(@fig:pa-groups). lazar-CDK on the other hand
predicts the largest number of mutagens for all groups with exception of
Otonecines.

These differences between predictions from different algorithms and descriptors
were not expected based on crossvalidation results.

In order to investigate, if any of the investigated models show systematic
errors in the  vicinity of pyrrolizidine-alkaloids we have performed a
detailled t-SNE analysis of all models (see @fig:tsne-mp2d-rf and
@fig:tsne-cdk-lazar-all for two examples, all visualisations can be found at
<https://git.in-silico.ch/mutagenicity-paper/figures>.

Nevertheless none of the models showed obvious deviations from their expected
behaviour, so the reason for the disagreement between some of the models
remains unclear at the moment.  It is however perfectly possible that some
systematic errors are covered up by converting high dimensional spaces to two
coordinates and are thus invisible in t-SNE visualisations.

<!--
non-conflicting CIDs
43040
186980
187805
610955
3033169
6429355
10095536
10251171
10577975
10838897
10992912
10996028
11618501
11827237
11827238
16687858
73893122
91747608
91749688
91751314
91752877
100979630
100979631
101648301
102478913
148322
194088
21626760
91747610
91747612
91749428
91749448
102596226
6440436
4483893
5315247
46930232
67189194
91747354
91749894
101324794
118701599

R RF and SVM models favor very strongly non-mutagenic predictions (only {{pa.r.rf.mut_perc}} and {{pa.r.svm.mut_perc}} % mutagenic PAs), while Tensorflow models classify approximately half of the PAs as mutagenic (RF {{pa.tf.rf.mut_perc}}%, LR-sgd {{pa.tf.lr_sgd}}%, LR-scikit:{{pa.tf.lr_scikit.mut_perc}}, LR-NN:{{pa.tf.nn.mut_perc}}%). `lazar` models predict predominately non-mutagenicity, but to a lesser extend than R models (MP2D:{{pa.lazar.mp2d.all.mut_perc}}, CDK:{{pa.lazar.padel.all.mut_perc}}).

It is interesting to note, that different implementations of the same algorithm show little accordance in their prediction (see e.g R-RF vs. Tensorflow-RF and LR-sgd vs. LR-scikit in Table 4 and @tbl:pa-summary).

**TODO** **Verena, Philipp** habt ihr eine Erklaerung dafuer?

@fig:tsne-mp2d and @fig:tsne-padel show the t-SNE of training data and pyrrolizidine alkaloids. In @fig:tsne-mp2d the PAs are located closely together at the outer border of the training set. In @fig:tsne-padel they are less clearly separated and spread over the space occupied by the training examples.

This is probably the reason why CDK models predicted all instances and the MP2D model only {{pa.lazar.mp2d.all.n}} PAs. Predicting a large number of instances is however not the ultimate goal, we need accurate predictions and an unambiguous estimation of the applicability domain. With CDK descriptors *all* PAs are within the applicability domain of the training data, which is unlikely despite the size of the training set. MolPrint2D descriptors provide a clearer separation, which is also reflected in a better separation between high and low confidence predictions in `lazar` MP2D predictions as compared to `lazar` CDK predictions. Crossvalidation results with substantially higher accuracies for MP2D models than for CDK models also support this argument.

Differences between MP2D and CDK descriptors can be explained by their specific properties: CDK calculates a fixed set of descriptors for all structures, while MolPrint2D descriptors resemble substructures that are present in a compound. For this reason there is no fixed number of MP2D descriptors, the descriptor space are all unique substructures of the training set. If a query compound contains new substructures, this is immediately reflected in a lower similarity to training compounds, which makes applicability domain estimations very straightforward. With CDK (or any other predefined descriptors), the same set of descriptors is calculated for every compound, even if a compound comes from an completely new chemical class. 

From a practical point we still have to face the question, how to choose model predictions, if no experimental data is available (we found two PAs in the training data, but this number is too low, to draw any general conclusions). Based on crossvalidation results and the arguments in favor of MolPrint2D descriptors we would put the highest trust in `lazar` MolPrint2D predictions, especially in high-confidence predictions. `lazar` predictions have a accuracy comparable to experimental variability (@Helma2018) for compounds within the applicability domain. But they should not be trusted blindly. For practical purposes it is important to study the rationales (i.e. neighbors and their experimental activities) for each prediction of relevance. A freely accessible GUI for this purpose has been implemented at https://lazar.in-silico.ch.
-->

Conclusions
===========

A new public *Salmonella* mutagenicity training dataset with 8309 compounds was
created and used it to train `lazar` and Tensorflow models with MolPrint2D
and CDK descriptors.

<!---
The best performance was obtained with `lazar` models
using MolPrint2D descriptors, with prediction accuracies
({{cv.lazar-high-confidence.acc_perc}}%) comparable to the interlaboratory variability
of the Ames test (80-85%). Models based on CDK descriptors had lower
accuracies than MolPrint2D models, but only the `lazar` algorithm could use
MolPrint2D descriptors.

**TODO**: PA Vorhersagen

In this study, an attempt was made to predict the genotoxic potential of
PAs using five different machine learning techniques (LAZAR, RF, SVM, DL
(R-project and Tensorflow). The results of all models fitted only partly
to the findings in literature, with best results obtained with the
Tensorflow DL model. Therefore, modelling allows statements on the
relative risks of genotoxicity of the different PA groups. Individual
predictions for selective PAs appear, however, not reliable on the
current basis of the used training dataset.

This study emphasises the importance of critical assessment of
predictions by QSAR models. This includes not only extensive literature
research to assess the plausibility of the predictions, but also a good
knowledge of the metabolism of the test substances and understanding for
possible mechanisms of toxicity.

In further studies, additional machine learning techniques or a modified
(extended) training dataset should be used for an additional attempt to
predict the genotoxic potential of PAs.
--->

References
==========