mutagenicity.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566

---
title: A comparison of random forest, support vector machine, deep learning and lazar algorithms for predicting mutagenicity
#subtitle: Performance comparison with a new expanded dataset
author:
  - Christoph Helma:
      institute: ist
      email: helma@in-silico.ch
      correspondence: "yes"
  - Verena Schöning:
      institute: zeller
  - Philipp Boss:
      institute: sysbio
  - Jürgen Drewe:
      institute: zeller
institute:
  - ist:
      name: in silico toxicology gmbh
      address: "Rastatterstrasse 41, 4057 Basel, Switzerland"
  - zeller: 
      name: Zeller AG
      address: "Seeblickstrasse 4, 8590 Romanshorn, Switzerland"
  - sysbio:
      name: Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association
      address: "Robert-Rössle-Strasse 10, Berlin, 13125, Germany"
bibliography: bibliography.bib
keywords: mutagenicity, QSAR, lazar, random forest, support vector machine, deep learning

documentclass: scrartcl
tblPrefix: Table
figPrefix: Figure
header-includes:
    - \usepackage{setspace}
    - \doublespacing
    - \usepackage{lineno}
    - \linenumbers
...

Abstract
========

Random forest, support vector machine, deep learning and k-nearest neighbor
(`lazar`) algorithms, were applied to new *Salmonella* mutagenicity dataset
with 8309 unique chemical structures. The best prediction accuracies in
10-fold-crossvalidation were obtained with `lazar` models, that gave accuracies
similar to the interlaboratory variability of the Ames test.

Introduction
============

TODO

The main objectives of this study were

  - to generate a new training dataset, by combining the most comprehensive public mutagenicity datasets
  - to compare the performance of global models (RF, SVM, Neural Nets) with local models (`lazar`)

Materials and Methods
=====================

Data
----

An identical training dataset was used for all models. The
training dataset was compiled from the following sources:

-   Kazius/Bursi Dataset (4337 compounds, @Kazius2005): <http://cheminformatics.org/datasets/bursi/cas_4337.zip>

-   Hansen Dataset (6513 compounds, @Hansen2009): <http://doc.ml.tu-berlin.de/toxbenchmark/Mutagenicity_N6512.csv>

-   EFSA Dataset (695 compounds @EFSA2016): <https://data.europa.eu/euodp/data/storage/f/2017-0719T142131/GENOTOX%20data%20and%20dictionary.xls>

Mutagenicity classifications from Kazius and Hansen datasets were used
without further processing. To achieve consistency with these
datasets, EFSA compounds were classified as mutagenic, if at least one
positive result was found for TA98 or T100 Salmonella strains.

Dataset merges were based on unique SMILES (*Simplified Molecular Input
Line Entry Specification*) strings of the compound structures.
Duplicated experimental data with the same outcome was merged into a
single value, because it is likely that it originated from the same
experiment. Contradictory results were kept as multiple measurements in
the database. The combined training dataset contains 8309 unique
structures.

Source code for all data download, extraction and merge operations is publicly
available from the git repository <https://git.in-silico.ch/mutagenicity-paper>
under a GPL3 License. The new combined dataset can be found at
<https://git.in-silico.ch/mutagenicity-paper/data/mutagenicity.csv>.

Algorithms
----------

### `lazar`

`lazar` (*lazy structure activity relationships*) is a modular framework
for read-across model development and validation. It follows the
following basic workflow: For a given chemical structure `lazar`:

-   searches in a database for similar structures (neighbours) with
    experimental data,

-   builds a local QSAR model with these neighbours and

-   uses this model to predict the unknown activity of the query
    compound.

This procedure resembles an automated version of read across predictions
in toxicology, in machine learning terms it would be classified as a
k-nearest-neighbour algorithm.

Apart from this basic workflow, `lazar` is completely modular and allows
the researcher to use any algorithm for similarity searches and local
QSAR (*Quantitative structure--activity relationship*) modelling.
Algorithms used within this study are described in the following
sections.

#### Neighbour identification

Similarity calculations were based on MolPrint2D fingerprints (*MP2D*,
@Bender2004) from the OpenBabel cheminformatics library (@OBoyle2011a). The
MolPrint2D fingerprint uses atom environments as molecular representation,
which resembles basically the chemical concept of functional groups. For each
atom in a molecule, it represents the chemical environment using the atom types
of connected atoms.

MolPrint2D fingerprints are generated dynamically from chemical
structures and do not rely on predefined lists of fragments (such as
OpenBabel FP3, FP4 or MACCs fingerprints or lists of
toxicophores/toxicophobes). This has the advantage that they may capture
substructures of toxicological relevance that are not included in other
fingerprints.

From MolPrint2D fingerprints a feature vector with all atom environments
of a compound can be constructed that can be used to calculate chemical
similarities.

The chemical similarity between two compounds $a$ and $b$ is expressed as
the proportion between atom environments common in both structures $A \cap B$
and the total number of atom environments $A \cup B$ (Jaccard/Tanimoto
index).

$$sim = \frac{\left| A\  \cap B \right|}{\left| A\  \cup B \right|}$$

Threshold selection is a trade-off between prediction accuracy (high
threshold) and the number of predictable compounds (low threshold). As
it is in many practical cases desirable to make predictions even in the
absence of closely related neighbours, we follow a tiered approach:

-   First a similarity threshold of 0.5 is used to collect neighbours,
    to create a local QSAR model and to make a prediction for the query
    compound. This are predictions with *high confidence*.

-   If any of these steps fails, the procedure is repeated with a
    similarity threshold of 0.2 and the prediction is flagged with a
    warning that it might be out of the applicability domain of the
    training data (*low confidence*).

-   Similarity thresholds of 0.5 and 0.2 are the default values chosen
    by the software developers and remained unchanged during the
    course of these experiments.

Compounds with the same structure as the query structure are
automatically eliminated from neighbours to obtain unbiased predictions
in the presence of duplicates.

#### Local QSAR models and predictions

Only similar compounds (neighbours) above the threshold are used for
local QSAR models. In this investigation, we are using a weighted
majority vote from the neighbour's experimental data for mutagenicity
classifications. Probabilities for both classes
(mutagenic/non-mutagenic) are calculated according to the following
formula and the class with the higher probability is used as prediction
outcome.

$$p_{c} = \ \frac{\sum_{}^{}\text{sim}_{n,c}}{\sum_{}^{}\text{sim}_{n}}$$

$p_{c}$ Probability of class c (e.g. mutagenic or non-mutagenic)\
$\sum_{}^{}\text{sim}_{n,c}$ Sum of similarities of neighbours with
class c\
$\sum_{}^{}\text{sim}_{n}$ Sum of all neighbours

#### Applicability domain

The applicability domain (AD) of `lazar` models is determined by the
structural diversity of the training data. If no similar compounds are
found in the training data no predictions will be generated. Warnings
are issued if the similarity threshold had to be lowered from 0.5 to 0.2
in order to enable predictions. Predictions without warnings can be
considered as close to the applicability domain (*high confidence*) and predictions with
warnings as more distant from the applicability domain (*low confidence*). Quantitative
applicability domain information can be obtained from the similarities
of individual neighbours.

#### Availability

-   `lazar` experiments for this manuscript:
    <https://git.in-silico.ch/mutagenicity-paper>
    (source code, GPL3)

-   `lazar` framework:
    <https://git.in-silico.ch/lazar>
    (source code, GPL3)

-   `lazar` GUI:
    <https://git.in-silico.ch/lazar-gui>
    (source code, GPL3)

-   Public web interface:
    <https://lazar.in-silico.ch>

### R Random Forest, Support Vector Machines, and Deep Learning

#### PaDEL descriptors

For Random Forest (RF), Support Vector Machines (SVM), and Deep
Learning (DL) models, molecular descriptors were calculated
with the PaDEL-Descriptors program (<http://www.yapcwsoft.com> version 2.21, @Yap2011). The same descriptors were used for TensorFlow models.

TODO: **Verena** kannst Du bitte die PaDEL Deskriptoren etwas ausfuehrlicher beschreiben (welche Typen, Anzahl, Bedeutung etc)

For the generation of these models, molecular 1D and 2D descriptors of
the training dataset were calculated using PaDEL-Descriptors (<http://www.yapcwsoft.com> version
2.21, @Yap2011).

As the training dataset contained over 8309 instances, it was decided to
delete instances with missing values during data pre-processing.
Furthermore, substances with equivocal outcome were removed. The final
training dataset contained 8080 instances with known mutagenic
potential. The RF, SVM, and DL models were generated using the R
software (R-project for Statistical Computing,
<https://www.r-project.org/>*;* version 3.3.1), specific R packages used
are identified for each step in the description below. During feature
selection, descriptor with near zero variance were removed using
'*NearZeroVar*'-function (package 'caret'). If the percentage of the
most common value was more than 90% or when the frequency ratio of the
most common value to the second most common value was greater than 95:5
(e.g. 95 instances of the most common value and only 5 or less instances
of the second most common value), a descriptor was classified as having
a near zero variance. After that, highly correlated descriptors were
removed using the '*findCorrelation*'-function (package 'caret') with a
cut-off of 0.9. This resulted in a training dataset with 516
descriptors. These descriptors were scaled to be in the range between 0
and 1 using the '*preProcess*'-function (package 'caret'). The scaling
routine was saved in order to apply the same scaling on the testing
dataset. As these three steps did not consider the outcome, it was
decided that they do not need to be included in the cross-validation of
the model. To further reduce the number of features, a LASSO (*least
absolute shrinkage and selection operator*) regression was performed
using the '*glmnet*'-function (package '*glmnet*'). The reduced dataset
was used for the generation of the pre-trained models.

#### Random Forest

For the RF model, the '*randomForest*'-function (package
'*randomForest*') was used. A forest with 1000 trees with maximal
terminal nodes of 200 was grown for the prediction.

#### Support Vector Machines

The '*svm*'-function (package 'e1071') with a *radial basis function
kernel* was used for the SVM model.

#### Deep Learning

The DL model was generated using the '*h2o.deeplearning*'-function
(package '*h2o*'). The DL contained four hidden layer with 70, 50, 50,
and 10 neurons, respectively. Other hyperparameter were set as follows:
l1=1.0E-7, l2=1.0E-11, epsilon = 1.0E-10, rho = 0.8, and quantile\_alpha
= 0.5. For all other hyperparameter, the default values were used.
Weights and biases were in a first step determined with an unsupervised
DL model. These values were then used for the actual, supervised DL
model.

TODO: **Verena** kannst Du bitte ueberpruefen, ob das noch stimmt und ggf die Figure 1 anpassen

To validate these models, an internal cross-validation approach was
chosen. The training dataset was randomly split in training data, which
contained 95% of the data, and validation data, which contain 5% of the
data. A feature selection with LASSO on the training data was performed,
reducing the number of descriptors to approximately 100. This step was
repeated five times. Based on each of the five different training data,
the predictive models were trained and the performance tested with the
validation data. This step was repeated 10 times. 

![Flowchart of the generation and validation of the models generated in R-project](figures/image1.png){#fig:valid}

#### Applicability domain

The AD of the training dataset and the PA dataset was evaluated using
the Jaccard distance. A Jaccard distance of '0' indicates that the
substances are similar, whereas a value of '1' shows that the substances
are different. The Jaccard distance was below 0.2 for all PAs relative
to the training dataset. Therefore, PA dataset is within the AD of the
training dataset and the models can be used to predict the genotoxic
potential of the PA dataset.

### TensorFlow Deep Learning

Alternatively, a DL model was established with Python-based TensorFlow
program (<https://www.tensorflow.org/>) using the high-level API Keras
(<https://www.tensorflow.org/guide/keras>) to build the models. 

TensorFlow models used the same PaDEL descriptors as the R models.

Data pre-processing was done by rank transformation using the
'*QuantileTransformer*' procedure. A sequential model has been used.
Four layers have been used: input layer, two hidden layers (with 12, 8
and 8 nodes, respectively) and one output layer. For the output layer, a
sigmoidal activation function and for all other layers the ReLU
('*Rectified Linear Unit*') activation function was used. Additionally,
a L^2^-penalty of 0.001 was used for the input layer. For training of
the model, the ADAM algorithm was used to minimise the cross-entropy
loss using the default parameters of Keras. Training was performed for
100 epochs with a batch size of 64. The model was implemented with
Python 3.6 and Keras. For training of the model, a 10-fold
cross-validation was used. 

TODO: **Philipp** kannst Du bitte ueberpruefen ob die Beschreibung noch stimmt
und ob der Ablauf von Verena (Figure 1) auch fuer Deine Modelle gilt

Validation
----------

Results
=======

TODO: **Verena** und **Philipp**: koennt Ihr bitte gegenchecken, ob ich keine Zahlendreher in den Ergebnissen habe

R Models
--------

### Random Forest

10-fold crossvalidation of the R-RF model gave an accuracy of
{{R-RF.acc_perc}}%, a sensitivity of {{R-RF.tpr_perc}}% and a specificity of
{{R-RF.tnr_perc}}%.  The confusion matrix for {{R-RF.n}}
predictions is provided in @tbl:R-RF.

```{#tbl:R-RF .table file="tables/R-RF.csv" caption="Confusion matrix for R Random Forest predictions"}
```

### Support Vector Machines

10-fold crossvalidation of the R-SVM model gave an accuracy of
{{R-SVM.acc_perc}}%, a sensitivity of {{R-SVM.tpr_perc}}% and a specificity of
{{R-SVM.tnr_perc}}%.  The confusion matrix for {{R-SVM.n}}
predictions is provided in @tbl:R-SVM.

```{#tbl:R-SVM .table file="tables/R-SVM.csv" caption="Confusion matrix for R Support Vector Machine predictions"}
```

### Deep Learning

10-fold crossvalidation of the R-DL model gave an accuracy of
{{R-DL.acc_perc}}%, a sensitivity of {{R-DL.tpr_perc}}% and a specificity of
{{R-DL.tnr_perc}}%.  The confusion matrix for {{R-DL.n}}
predictions is provided in @tbl:R-DL.

```{#tbl:R-DL .table file="tables/R-DL.csv" caption="Confusion matrix for R Deep Learning predictions"}
```

TensorFlow Models
-----------------

### Without feature selection

10-fold crossvalidation of the TensorFlow DL model gave an accuracy of
{{tensorflow-all.acc_perc}}%, a sensitivity of {{tensorflow-all.tpr_perc}}% and a specificity of
{{tensorflow-all.tnr_perc}}%.  The confusion matrix for {{tensorflow-all.n}}
predictions is provided in @tbl:tensorflow-all.

```{#tbl:tensorflow-all .table file="tables/tensorflow-all.csv" caption="Confusion matrix for Tensorflow predictions without feature selecetion"}
```

### With feature selection

10-fold crossvalidation of the TensorFlow model with feature selection gave an accuracy of
{{tensorflow-selected.acc_perc}}%, a sensitivity of {{tensorflow-selected.tpr_perc}}% and a specificity of
{{tensorflow-selected.tnr_perc}}%.  The confusion matrix for {{tensorflow-selected.n}}
predictions is provided in @tbl:tensorflow-selected.

```{#tbl:tensorflow-selected .table file="tables/tensorflow-selected.csv" caption="Confusion matrix for Tensorflow predictions with feature selecetion"}
```

`lazar` Models
--------------

### MolPrint2D Descriptors

10-fold crossvalidation of the lazar model with MolPrint2D descriptors gave an accuracy of
{{lazar-all.acc_perc}}%, a sensitivity of {{lazar-all.tpr_perc}}% and a specificity of
{{lazar-all.tnr_perc}}%. 
The confusion matrix for {{lazar-all.n}}
predictions is provided in @tbl:lazar-all.

```{#tbl:lazar-all .table file="tables/lazar-all.csv" caption="Confusion matrix for lazar predictions with MolPrint2D descriptors"}
```

Predictions with high confidence had an accuracy of
{{lazar-high-confidence.acc_perc}}%, a sensitivity of {{lazar-high-confidence.tpr_perc}}% and a specificity of
{{lazar-high-confidence.tnr_perc}}%. 
The confusion matrix for {{lazar-high-confidence.n}}
predictions is provided in @tbl:lazar-high-confidence.


```{#tbl:lazar-high-confidence .table file="tables/lazar-high-confidence.csv" caption="Confusion matrix for high confidence lazar predictions with MolPrint2D descriptors"}
```

### PaDEL Descriptors

10-fold crossvalidation of the lazar model with PaDEL descriptors gave an accuracy of
{{lazar-all.acc_perc}}%, a sensitivity of {{lazar-all.tpr_perc}}% and a specificity of
{{lazar-all.tnr_perc}}%. 
The confusion matrix for {{lazar-all.n}}
predictions is provided in @tbl:lazar-padel-all.

```{#tbl:lazar-padel-all .table file="tables/lazar-padel-all.csv" caption="Confusion matrix for lazar predictions with PaDEL descriptors" }
```

Predictions with high confidence had an accuracy of
{{lazar-high-confidence.acc_perc}}%, a sensitivity of {{lazar-high-confidence.tpr_perc}}% and a specificity of
{{lazar-high-confidence.tnr_perc}}%. 
The confusion matrix for {{lazar-high-confidence.n}}
predictions is provided in @tbl:lazar-padel-high-confidence.

```{#tbl:lazar-padel-high-confidence .table file="tables/lazar-padel-high-confidence.csv" caption="Confusion matrix for high confidence lazar predictions with PaDEL descriptors"}
```

Summary
-------

The results of all crossvalidation experiments are summarized in @tbl:summary.

| |R-RF | R-SVM | R-DL | TF | TF-FS | L | L-HC | L-P | L-P-HC|
|-|-----|-------|------|----|-------|---|------|------|--------|
|Accuracy|{{R-RF.acc}}|{{R-SVM.acc}}|{{R-DL.acc}}|{{tensorflow-all.acc}}|{{tensorflow-selected.acc}}|{{lazar-all.acc}}|{{lazar-high-confidence.acc}}|{{lazar-padel-all.acc}}|{{lazar-padel-high-confidence.acc}}|
|Sensitivity|{{R-RF.tpr}}|{{R-SVM.tpr}}|{{R-DL.tpr}}|{{tensorflow-all.tpr}}|{{tensorflow-selected.tpr}}|{{lazar-all.tpr}}|{{lazar-high-confidence.tpr}}|{{lazar-padel-all.tpr}}|{{lazar-padel-high-confidence.tpr}}|
|Specificity|{{R-RF.tnr}}|{{R-SVM.tnr}}|{{R-DL.tnr}}|{{tensorflow-all.tnr}}|{{tensorflow-selected.tnr}}|{{lazar-all.tnr}}|{{lazar-high-confidence.tnr}}|{{lazar-padel-all.tnr}}|{{lazar-padel-high-confidence.tnr}}|
|PPV|{{R-RF.ppv}}|{{R-SVM.ppv}}|{{R-DL.ppv}}|{{tensorflow-all.ppv}}|{{tensorflow-selected.ppv}}|{{lazar-all.ppv}}|{{lazar-high-confidence.ppv}}|{{lazar-padel-all.ppv}}|{{lazar-padel-high-confidence.ppv}}|
|NPV|{{R-RF.npv}}|{{R-SVM.npv}}|{{R-DL.npv}}|{{tensorflow-all.npv}}|{{tensorflow-selected.npv}}|{{lazar-all.npv}}|{{lazar-high-confidence.npv}}|{{lazar-padel-all.npv}}|{{lazar-padel-high-confidence.npv}}|
|Nr. predictions|{{R-RF.n}}|{{R-SVM.n}}|{{R-DL.n}}|{{tensorflow-all.n}}|{{tensorflow-selected.n}}|{{lazar-all.n}}|{{lazar-high-confidence.n}}|{{lazar-padel-all.n}}|{{lazar-padel-high-confidence.n}}|

: Summary of crossvalidation results. *R-RF*: R Random Forests, *R-SVM*: R Support Vector Machines, *R-DL*: R Deep Learning, *TF*: TensorFlow without feature selection, *TF-FS*: TensorFlow with feature selection, *L*: lazar, *L-HC*: lazar high confidence predictions, *L-P*: lazar with PaDEL descriptors, *L-P-HC*: lazar PaDEL high confidence predictions, *PPV*: Positive predictive value (Precision), *NPV*: Negative predictive value {#tbl:summary}

@fig:roc shows the position of crossvalidation results in receiver operating characteristic (ROC) space.

![ROC plot of crossvalidation results. *R-RF*: R Random Forests, *R-SVM*: R Support Vector Machines, *R-DL*: R Deep Learning, *TF*: TensorFlow without feature selection, *TF-FS*: TensorFlow with feature selection, *L*: lazar, *L-HC*: lazar high confidence predictions, *L-P*: lazar with PaDEL descriptors, *L-P-HC*: lazar PaDEL high confidence predictions (overlaps with L-P)](figures/roc.png){#fig:roc}

Discussion
==========

Data
----

A new training dataset for *Salmonella* mutagenicity was created from three
different sources (@Kazius2005, @Hansen2009, @EFSA2016). It contains 8309
unique chemical structures, which is according to our knowledge the largest
public mutagenicity dataset presently available. The new training data can be
downloaded from
<https://git.in-silico.ch/mutagenicity-paper/data/mutagenicity.csv>.

Model performance
-----------------

@tbl:summary and @fig:roc show that the standard `lazar` algorithm (with MP2D
fingerprints) give the most accurate crossvalidation results. R Random Forests,
Support Vector Machines and TensorFlow models have similar accuracies with
balanced sensitivity (true position rate) and specificity (true negative rate).
`lazar` models with PaDEL descriptors have low sensitivity and R Deep Learning
models have low specificity.

The accuracy of `lazar` *in-silico* predictions are comparable to the
interlaboratory variability of the Ames test (80-85% according to
@Benigni1988), especially for predictions with high confidence
({{lazar-high-confidence.acc_perc}}%). This is a clear indication that
*in-silico* predictions can be as reliable as the bioassays, if the compounds
are close to the applicability domain. This conclusion is also supported by our
analysis of `lazar` lowest observed effect level predictions, which are also
similar to the experimental variability (@Helma2018).

The lowest number of predictions ({{lazar-padel-high-confidence.n}}) has been
obtained from `lazar`/PaDEL high confidence predictions, the largest number of
predictions comes from TensorFlow models ({{tensorflow-all.n}}). Standard
`lazar` give a slightly lower number of predictions ({{lazar-all.n}}) than R
and TensorFlow models. This is not necessarily a disadvantage, because `lazar`
abstains from predictions, if the query compound is very dissimilar from the
compounds in the training set and thus avoids to make predictions for compounds
that do not fall into its applicability domain. 

There are two major differences between `lazar` and R/TensorFlow models, which
might explain the different prediction accuracies:

- `lazar` uses MolPrint2D fingerprints, while all other models use PaDEL descriptors
- `lazar` creates local models for each query compound and the other models use a single global model for all predictions

We will discuss both options in the following sections.

Descriptors
-----------

This study uses two types of descriptors to characterize chemical structures.

MolPrint2D fingerprints (MP2D, @Bender2004) use atom environments (i.e.
connected atoms for all atoms in a molecule) as molecular representation, which
resembles basically the chemical concept of functional groups. MP2D descriptors
are used to determine chemical similarities in lazar, and previous experiments
have shown, that they give more accurate results than predefined descriptors
(e.g.  MACCS, FP2-4) for all investigated endpoints.

PaDEL calculates topological and physical-chemical descriptors.

TODO: **Verena** kannst Du bitte die Deskriptoren nochmals kurz beschreiben

PaDEL descriptors were used for the R and TensorFlow models. In addition we
have used PaDEL descriptors to calculate cosine similarities for the `lazar`
algorithm and compared the results with standard MP2D similarities, which led
to a significant decrease of `lazar` prediction accuracies. Based on this
result we can conclude, that PaDEL descriptors are less suited for similarity
calculations than MP2D descriptors.

In order to investigate, if MP2D fingerprints are also a better option for
global models we have tried to build R and TensorFlow models both with and
without unsupervised feature selection. Unfortunately none of the algorithms
was capable to deal with the large and sparsely populated descriptor matrix.
Based on this result we can conclude, that MP2D descriptors are at the moment
unsuitable for standard global machine learning algorithms. Please note that
`lazar` does not suffer from the sparseness problem, because (a) it utilizes
internally a much more efficient occurrence based representation and (b) it
uses fingerprints only for similarity calculations and mot as model parameters.

Based on these results we can conclude, that PaDEL descriptors are less suited
for similarity calculations than MP2D fingerprints and that current standard
machine learning algorithms are not capable to utilize chemical fingerprints.

Algorithms
----------

`lazar` is formally a *k-nearest-neighbor* algorithm that searches for similar
structures for a given compound and calculates the prediction based on the
experimental data for these structures. The QSAR literature calls such models
frequently *local models*, because models are generated specifically for each
query compound. R and TensorFlow models are in contrast *global models*, i.e. a
single model is used to make predictions for all compounds. It has been
postulated in the past, that local models are more accurate, because they can
account better for mechanisms, that affect only a subset of the training data.
Our results seem to support this assumption, because `lazar` models perform
better than global models. Both types of models use however different
descriptors, and for this reason we cannot draw a definitive conclusion if the
model algorithm or the descriptor type are the reason for the observed
differences. In order to answer this question, we would have to use global
modelling algorithms that are capable to handle large, sparse binary matrices.

Conclusions
===========

A new public *Salmonella* mutagenicity training dataset with 8309 compounds was
created and used it to train `lazar`, R and TensorFlow models. The best
performance was obtained with `lazar` models using MolPrint2D descriptors, with
prediction accuracies comparable to the interlaboratory variability of the Ames
test. Differences between algorithms (local vs. global models) and/or
descriptors (MolPrint2D vs PaDEL) may be responsible for the different
prediction accuracies. 

References
==========