--- title: A comparison of nine machine learning models based on an expanded mutagenicity dataset and their application for predicting pyrrolizidine alkaloid mutagenicity author: - Christoph Helma: institute: ist email: helma@in-silico.ch correspondence: "yes" - Verena Schöning: institute: insel - Philipp Boss: institute: sysbio - Jürgen Drewe: institute: zeller institute: - ist: name: in silico toxicology gmbh address: "Rastatterstrasse 41, 4057 Basel, Switzerland" - zeller: name: Zeller AG address: "Seeblickstrasse 4, 8590 Romanshorn, Switzerland" - sysbio: name: Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association address: "Robert-Rössle-Strasse 10, Berlin, 13125, Germany" - insel: name: Clinical Pharmacology and Toxicology, Department of General Internal Medicine, Bern University Hospital, University of Bern address: "Inselspital, 3010 Bern, Switzerland" bibliography: bibliography.bib keywords: mutagenicity, QSAR, lazar, random forest, support vector machine, linear regression, neural nets, deep learning, pyrrolizidine alkaloids, OpenBabel, CDK documentclass: scrartcl tblPrefix: Table figPrefix: Figure header-includes: - \usepackage{lineno, setspace, color, colortbl, longtable} - \doublespacing - \linenumbers ... Abstract ======== Random forest, support vector machine, logistic regression, neural networks and k-nearest neighbor (`lazar`) algorithms, were applied to new *Salmonella* mutagenicity dataset with {{cv.n_uniq}} unique chemical structures. **TODO**: PA results Introduction ============ **TODO**: rationale for investigation The main objectives of this study were - to generate a new mutagenicity training dataset, by combining the most comprehensive public datasets - to compare the performance of MolPrint2D (*MP2D*) fingerprints with Chemistry Development Kit (*CDK*) descriptors - to compare the performance of global QSAR models (random forests (*RF*), support vector machines (*SVM*), logistic regression (*LR*), neural nets (*NN*)) with local models (`lazar`) - to apply these models for the prediction of pyrrolizidine alkaloid mutagenicity Materials and Methods ===================== Data ---- ### Mutagenicity training data An identical training dataset was used for all models. The training dataset was compiled from the following sources: - Kazius/Bursi Dataset (4337 compounds, @Kazius2005): - Hansen Dataset (6513 compounds, @Hansen2009): - EFSA Dataset (695 compounds @EFSA2016): Mutagenicity classifications from Kazius and Hansen datasets were used without further processing. To achieve consistency with these datasets, EFSA compounds were classified as mutagenic, if at least one positive result was found for TA98 or T100 Salmonella strains. Dataset merges were based on unique SMILES (*Simplified Molecular Input Line Entry Specification*) strings of the compound structures. Duplicated experimental data with the same outcome was merged into a single value, because it is likely that it originated from the same experiment. Contradictory results were kept as multiple measurements in the database. The combined training dataset contains {{cv.n_uniq}} unique structures and {{cv.n}} individual measurements. Source code for all data download, extraction and merge operations is publicly available from the git repository under a GPL3 License. The new combined dataset can be found at . ### Pyrrolizidine alkaloid (PA) dataset The pyrrolizidine alkaloid dataset was created from five independent, necine base substructure searches in PubChem (https://pubchem.ncbi.nlm.nih.gov/) and compared to the PAs listed in the EFSA publication @EFSA2011 and the book by Mattocks @Mattocks1986, to ensure, that all major PAs were included. PAs mentioned in these publications which were not found in the downloaded substances were searched individually in PubChem and, if available, downloaded separately. Non-PA substances, duplicates, and isomers were removed from the files, but artificial PAs, even if unlikely to occur in nature, were kept. The resulting PA dataset comprised a total of {{pa.n}} different PAs. The PAs in the dataset were classified according to structural features. A total of 9 different structural features were assigned to the necine base, modifications of the necine base and to the necic acid: For the necine base, the following structural features were chosen: - Retronecine-type (1,2-unstaturated necine base, {{pa.groups.Retronecine.n}} compounds) - Otonecine-type (1,2-unstaturated necine base, {{pa.groups.Otonecine.n}} compounds) - Platynecine-type (1,2-saturated necine base, {{pa.groups.Platynecine.n}} compounds) For the modifications of the necine base, the following structural features were chosen: - N-oxide-type ({{pa.groups.N_oxide.n}} compounds) - Tertiary-type (PAs which were neither from the N-oxide- nor DHP-type, {{pa.groups.Tertiary_PA.n}} compounds) - Dehydropyrrolizidine-type (pyrrolic ester, {{pa.groups.Dehydropyrrolizidine.n}} compounds) For the necic acid, the following structural features were chosen: - Monoester-type ({{pa.groups.Monoester.n}} compounds) - Open-ring diester-type ({{pa.groups.Diester.n}} compounds) - Macrocyclic diester-type ({{pa.groups.Macrocyclic_diester.n}} compounds) The compilation of the PA dataset is described in detail in @Schoening2017. Descriptors ----------- ### MolPrint2D (*MP2D*) fingerprints MolPrint2D fingerprints (@OBoyle2011a) use atom environments as molecular representation. They determine for each atom in a molecule, the atom types of its connected atoms to represent their chemical environment. This resembles basically the chemical concept of functional groups. In contrast to predefined lists of fragments (e.g. FP3, FP4 or MACCs fingerprints) or descriptors (e.g CDK) they are generated dynamically from chemical structures. This has the advantage that they can capture unknown substructures of toxicological relevance that are not included in other descriptors. In addition they allow the efficient calculation of chemical similarities (e.g. Tanimoto indices) with simple set operations. MolPrint2D fingerprints were calculated with the OpenBabel cheminformatics library (@OBoyle2011a). They can be obtained from the following locations: *Training data:* - sparse representation () - descriptor matrix () *Pyrrolizidine alkaloids:* - sparse representation () - descriptor matrix () #### Chemistry Development Kit (*CDK*) descriptors Molecular 1D and 2D descriptors were calculated with the PaDEL-Descriptors program ( version 2.21, @Yap2011). PaDEL uses the Chemistry Development Kit (*CDK*, ) library for descriptor calculations. As the training dataset contained {{cv.n_uniq}} instances, it was decided to delete instances with missing values during data pre-processing. Furthermore, substances with equivocal outcome were removed. The final training dataset contained {{cv.cdk.n_descriptors}} descriptors for {{cv.cdk.n_compounds}} compounds. CDK training data can be obtained from . The same procedure was applied for the pyrrolizidine dataset yielding {{pa.cdk.n_descriptors}} descriptors for {{pa.cdk.n_compounds}} compounds. CDK features for pyrrolizidine alkaloids are available at . Algorithms ---------- ### `lazar` `lazar` (*lazy structure activity relationships*) is a modular framework for read-across model development and validation. It follows the following basic workflow: For a given chemical structure `lazar`: - searches in a database for similar structures (neighbours) with experimental data, - builds a local QSAR model with these neighbours and - uses this model to predict the unknown activity of the query compound. This procedure resembles an automated version of read across predictions in toxicology, in machine learning terms it would be classified as a k-nearest-neighbour algorithm. Apart from this basic workflow, `lazar` is completely modular and allows the researcher to use arbitrary algorithms for similarity searches and local QSAR (*Quantitative structure--activity relationship*) modelling. Algorithms used within this study are described in the following sections. #### Neighbour identification Utilizing this modularity, similarity calculations were based both on MolPrint2D fingerprints and on CDK descriptors. For MolPrint2D fingerprints chemical similarity between two compounds $a$ and $b$ is expressed as the proportion between atom environments common in both structures $A \cap B$ and the total number of atom environments $A \cup B$ (Jaccard/Tanimoto index). $$sim = \frac{\lvert A\ \cap B \rvert}{\lvert A\ \cup B \rvert}$$ For CDK descriptors chemical similarity between two compounds $a$ and $b$ is expressed as the cosine similarity between the descriptor vectors $A$ for $a$ and $B$ for $b$. $$sim = \frac{A \cdot B}{\lvert A \rvert \lvert B \rvert}$$ Threshold selection is a trade-off between prediction accuracy (high threshold) and the number of predictable compounds (low threshold). As it is in many practical cases desirable to make predictions even in the absence of closely related neighbours, we follow a tiered approach: - First a similarity threshold of 0.5 is used to collect neighbours, to create a local QSAR model and to make a prediction for the query compound. This are predictions with *high confidence*. - If any of these steps fails, the procedure is repeated with a similarity threshold of 0.2 and the prediction is flagged with a warning that it might be out of the applicability domain of the training data (*low confidence*). - Similarity thresholds of 0.5 and 0.2 are the default values chosen by the software developers and remained unchanged during the course of these experiments. Compounds with the same structure as the query structure are automatically eliminated from neighbours to obtain unbiased predictions in the presence of duplicates. #### Local QSAR models and predictions Only similar compounds (neighbours) above the threshold are used for local QSAR models. In this investigation, we are using a weighted majority vote from the neighbour's experimental data for mutagenicity classifications. Probabilities for both classes (mutagenic/non-mutagenic) are calculated according to the following formula and the class with the higher probability is used as prediction outcome. $$p_{c} = \ \frac{\sum_{}^{}\text{sim}_{n,c}}{\sum_{}^{}\text{sim}_{n}}$$ $p_{c}$ Probability of class c (e.g. mutagenic or non-mutagenic)\ $\sum_{}^{}\text{sim}_{n,c}$ Sum of similarities of neighbours with class c\ $\sum_{}^{}\text{sim}_{n}$ Sum of all neighbours #### Applicability domain The applicability domain (AD) of `lazar` models is determined by the structural diversity of the training data. If no similar compounds are found in the training data no predictions will be generated. Warnings are issued if the similarity threshold had to be lowered from 0.5 to 0.2 in order to enable predictions. Predictions without warnings can be considered as close to the applicability domain (*high confidence*) and predictions with warnings as more distant from the applicability domain (*low confidence*). Quantitative applicability domain information can be obtained from the similarities of individual neighbours. #### Availability - Source code for this manuscript (GPL3): - Crossvalidation experiments (GPL3): - Pyrrolizidine alkaloid predictions (GPL3): - Public web interface: ### Tensorflow models **TODO**: **Philipp** Kannst Du bitte die folgenden Absaetze ergaenzen und die Vorgangsweise fuer MP2D/CDK bzw CV/PA Vorhersagen beschreiben. #### Random forests (*RF*) #### Logistic regression (SGD) (*LR-sgd*) #### Logistic regression (scikit) (*LR-scikit*) #### Neural Nets (*NN*) #### Support vector machines (*SVM*) Validation ---------- 10-fold cross-validation was used for all Tensorflow models. #### Availability Jupyter notebooks for these experiments can be found at the following locations *Crossvalidation:* - MolPrint2D fingerprints: - CDK descriptors: *Pyrrolizidine alkaloids:* - MolPrint2D fingerprints: - CDK descriptors: - CDK desc Results ======= 10-fold crossvalidations ------------------------ Crossvalidation results are summarized in the following tables: @tbl:cv-mp2d shows results with MolPrint2D descriptors and @tbl:cv-cdk with CDK descriptors. | | lazar-HC | lazar-all | RF | LR-sgd | LR-scikit | NN | SVM | |:-|----------|-----------|----|--------|-----------|----|-----| Accuracy | {{cv.mp2d_lazar_high_confidence.acc_perc}} | {{cv.mp2d_lazar_all.acc_perc}} | {{cv.mp2d_rf.acc_perc}} | {{cv.mp2d_lr.acc_perc}} | {{cv.mp2d_lr2.acc_perc}} | {{cv.mp2d_nn.acc_perc}} | {{cv.mp2d_svm.acc_perc}} | True positive rate | {{cv.mp2d_lazar_high_confidence.tpr_perc}} | {{cv.mp2d_lazar_all.tpr_perc}} | {{cv.mp2d_rf.tpr_perc}} | {{cv.mp2d_lr.tpr_perc}} | {{cv.mp2d_lr2.tpr_perc}} | {{cv.mp2d_nn.tpr_perc}} | {{cv.mp2d_svm.tpr_perc}} | True negative rate | {{cv.mp2d_lazar_high_confidence.tnr_perc}} | {{cv.mp2d_lazar_all.tnr_perc}} | {{cv.mp2d_rf.tnr_perc}} | {{cv.mp2d_lr.tnr_perc}} | {{cv.mp2d_lr2.tnr_perc}} | {{cv.mp2d_nn.tnr_perc}} | {{cv.mp2d_svm.tnr_perc}} | Positive predictive value | {{cv.mp2d_lazar_high_confidence.ppv_perc}} | {{cv.mp2d_lazar_all.ppv_perc}} | {{cv.mp2d_rf.ppv_perc}} | {{cv.mp2d_lr.ppv_perc}} | {{cv.mp2d_lr2.ppv_perc}} | {{cv.mp2d_nn.ppv_perc}} | {{cv.mp2d_svm.ppv_perc}} | Negative predictive value | {{cv.mp2d_lazar_high_confidence.npv_perc}} | {{cv.mp2d_lazar_all.npv_perc}} | {{cv.mp2d_rf.npv_perc}} | {{cv.mp2d_lr.npv_perc}} | {{cv.mp2d_lr2.npv_perc}} | {{cv.mp2d_nn.npv_perc}} | {{cv.mp2d_svm.npv_perc}} | Nr. predictions | {{cv.mp2d_lazar_high_confidence.n}} | {{cv.mp2d_lazar_all.n}} | {{cv.mp2d_rf.n}} | {{cv.mp2d_lr.n}} | {{cv.mp2d_lr2.n}} | {{cv.mp2d_nn.n}} | {{cv.mp2d_svm.n}} | : Summary of crossvalidation results with MolPrint2D descriptors (lazar-HC: lazar with high confidence, lazar-all: all lazar predictions, RF: random forests, LR-sgd: logistic regression (stochastic gradient descent), LR-scikit: logistic regression (scikit), NN: neural networks, SVM: support vector machines) {#tbl:cv-mp2d} | | lazar-HC | lazar-all | RF | LR-sgd | LR-scikit | NN | SVM | |:-|----------|-----------|----|--------|-----------|----|-----| Accuracy | {{cv.cdk_lazar_high_confidence.acc_perc}} | {{cv.cdk_lazar_all.acc_perc}} | {{cv.cdk_rf.acc_perc}} | {{cv.cdk_lr.acc_perc}} | {{cv.cdk_lr2.acc_perc}} | {{cv.cdk_nn.acc_perc}} | {{cv.cdk_svm.acc_perc}} | True positive rate | {{cv.cdk_lazar_high_confidence.tpr_perc}} | {{cv.cdk_lazar_all.tpr_perc}} | {{cv.cdk_rf.tpr_perc}} | {{cv.cdk_lr.tpr_perc}} | {{cv.cdk_lr2.tpr_perc}} | {{cv.cdk_nn.tpr_perc}} | {{cv.cdk_svm.tpr_perc}} | True negative rate | {{cv.cdk_lazar_high_confidence.tnr_perc}} | {{cv.cdk_lazar_all.tnr_perc}} | {{cv.cdk_rf.tnr_perc}} | {{cv.cdk_lr.tnr_perc}} | {{cv.cdk_lr2.tnr_perc}} | {{cv.cdk_nn.tnr_perc}} | {{cv.cdk_svm.tnr_perc}} | Positive predictive value | {{cv.cdk_lazar_high_confidence.ppv_perc}} | {{cv.cdk_lazar_all.ppv_perc}} | {{cv.cdk_rf.ppv_perc}} | {{cv.cdk_lr.ppv_perc}} | {{cv.cdk_lr2.ppv_perc}} | {{cv.cdk_nn.ppv_perc}} | {{cv.cdk_svm.ppv_perc}} | Negative predictive value | {{cv.cdk_lazar_high_confidence.npv_perc}} | {{cv.cdk_lazar_all.npv_perc}} | {{cv.cdk_rf.npv_perc}} | {{cv.cdk_lr.npv_perc}} | {{cv.cdk_lr2.npv_perc}} | {{cv.cdk_nn.npv_perc}} | {{cv.cdk_svm.npv_perc}} | Nr. predictions | {{cv.cdk_lazar_high_confidence.n}} | {{cv.cdk_lazar_all.n}} | {{cv.cdk_rf.n}} | {{cv.cdk_lr.n}} | {{cv.cdk_lr2.n}} | {{cv.cdk_nn.n}} | {{cv.cdk_svm.n}} | : Summary of crossvalidation results with CDK descriptors (lazar-HC: lazar with high confidence, lazar-all: all lazar predictions, RF: random forests, LR-sgd: logistic regression (stochastic gradient descent), LR-scikit: logistic regression (scikit), NN: neural networks, SVM: support vector machines) {#tbl:cv-cdk} @fig:roc depicts the position of all crossvalidation results in receiver operating characteristic (ROC) space. ![ROC plot of crossvalidation results (lazar-HC: lazar with high confidence, lazar-all: all lazar predictions, RF: random forests, LR-sgd: logistic regression (stochastic gradient descent), LR-scikit: logistic regression (scikit), NN: neural networks, SVM: support vector machines).](figures/roc.png){#fig:roc} Confusion matrices for all models are available from the git repository https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/confusion-matrices/, individual predictions can be found in https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/predictions/. With exception of lazar/CDK all investigated algorithm/descriptor combinations give accuracies between (80 and 85%) which is equivalent to the experimental variability of the *Salmonella typhimurium* mutagenicity bioassay (80-85%, @Benigni1988). Sensitivities and specificities are balanced in all of these models. Pyrrolizidine alkaloid mutagenicity predictions ----------------------------------------------- Mutagenicity predictions from all investigated models for {{pa.n}} pyrrolizidine alkaloids (PAs) can be downloaded from . A visual representation of all PA predictions can be found at . @tbl:pa-mp2d and @tbl:pa-cdk summarise the outcome of pyrrolizidine alkaloid predictions from all models with MolPrint2D and CDK descriptors. | Model | mutagenic | non-mutagenic | Nr. predictions | |-------:|-----------|---------------|-----------------| | lazar-all | {{pa.mp2d_lazar_all.mut_perc}}% ({{pa.mp2d_lazar_all.mut}}) | {{pa.mp2d_lazar_all.non_mut_perc}}% ({{pa.mp2d_lazar_all.non_mut}}) | {{pa.mp2d_lazar_all.n_perc}}% ({{pa.mp2d_lazar_all.n}}) | | lazar-HC | {{pa.mp2d_lazar_high_confidence.mut_perc}}% ({{pa.mp2d_lazar_high_confidence.mut}}) | {{pa.mp2d_lazar_high_confidence.non_mut_perc}}% ({{pa.mp2d_lazar_high_confidence.non_mut}}) | {{pa.mp2d_lazar_high_confidence.n_perc}}% ({{pa.mp2d_lazar_high_confidence.n}}) | | RF | {{pa.mp2d_rf.mut_perc}}% ({{pa.mp2d_rf.mut}}) | {{pa.mp2d_rf.non_mut_perc}}% ({{pa.mp2d_rf.non_mut}}) | {{pa.mp2d_rf.n_perc}}% ({{pa.mp2d_rf.n}}) | | LR-sgd | {{pa.mp2d_lr.mut_perc}}% ({{pa.mp2d_lr.mut}}) | {{pa.mp2d_lr.non_mut_perc}}% ({{pa.mp2d_lr.non_mut}}) | {{pa.mp2d_lr.n_perc}}% ({{pa.mp2d_lr.n}}) | | LR-scikit | {{pa.mp2d_lr2.mut_perc}}% ({{pa.mp2d_lr2.mut}}) | {{pa.mp2d_lr2.non_mut_perc}}% ({{pa.mp2d_lr2.non_mut}}) | {{pa.mp2d_lr2.n_perc}}% ({{pa.mp2d_lr2.n}}) | | NN | {{pa.mp2d_nn.mut_perc}}% ({{pa.mp2d_nn.mut}}) | {{pa.mp2d_nn.non_mut_perc}}% ({{pa.mp2d_nn.non_mut}}) | {{pa.mp2d_nn.n_perc}}% ({{pa.mp2d_nn.n}}) | | SVM | {{pa.mp2d_svm.mut_perc}}% ({{pa.mp2d_svm.mut}}) | {{pa.mp2d_svm.non_mut_perc}}% ({{pa.mp2d_svm.non_mut}}) | {{pa.mp2d_svm.n_perc}}% ({{pa.mp2d_svm.n}}) | : Summary of MolPrint2D pyrrolizidine alkaloid predictions {#tbl:pa-mp2d} | Model | mutagenic | non-mutagenic | Nr. predictions | |-------:|-----------|---------------|-----------------| | lazar-all | {{pa.cdk_lazar_all.mut_perc}}% ({{pa.cdk_lazar_all.mut}}) | {{pa.cdk_lazar_all.non_mut_perc}}% ({{pa.cdk_lazar_all.non_mut}}) | {{pa.cdk_lazar_all.n_perc}}% ({{pa.cdk_lazar_all.n}}) | | lazar-HC | {{pa.cdk_lazar_high_confidence.mut_perc}}% ({{pa.cdk_lazar_high_confidence.mut}}) | {{pa.cdk_lazar_high_confidence.non_mut_perc}}% ({{pa.cdk_lazar_high_confidence.non_mut}}) | {{pa.cdk_lazar_high_confidence.n_perc}}% ({{pa.cdk_lazar_high_confidence.n}}) | | RF | {{pa.cdk_rf.mut_perc}}% ({{pa.cdk_rf.mut}}) | {{pa.cdk_rf.non_mut_perc}}% ({{pa.cdk_rf.non_mut}}) | {{pa.cdk_rf.n_perc}}% ({{pa.cdk_rf.n}}) | | LR-sgd | {{pa.cdk_lr.mut_perc}}% ({{pa.cdk_lr.mut}}) | {{pa.cdk_lr.non_mut_perc}}% ({{pa.cdk_lr.non_mut}}) | {{pa.cdk_lr.n_perc}}% ({{pa.cdk_lr.n}}) | | LR-scikit | {{pa.cdk_lr2.mut_perc}}% ({{pa.cdk_lr2.mut}}) | {{pa.cdk_lr2.non_mut_perc}}% ({{pa.cdk_lr2.non_mut}}) | {{pa.cdk_lr2.n_perc}}% ({{pa.cdk_lr2.n}}) | | NN | {{pa.cdk_nn.mut_perc}}% ({{pa.cdk_nn.mut}}) | {{pa.cdk_nn.non_mut_perc}}% ({{pa.cdk_nn.non_mut}}) | {{pa.cdk_nn.n_perc}}% ({{pa.cdk_nn.n}}) | | SVM | {{pa.cdk_svm.mut_perc}}% ({{pa.cdk_svm.mut}}) | {{pa.cdk_svm.non_mut_perc}}% ({{pa.cdk_svm.non_mut}}) | {{pa.cdk_svm.n_perc}}% ({{pa.cdk_svm.n}}) | : Summary of CDK pyrrolizidine alkaloid predictions {#tbl:pa-cdk} @fig:dhp - @fig:tert display the proportion of positive mutagenicity predictions from all models for the different pyrrolizidine alkaloid groups. ![Summary of Dehydropyrrolizidine predictions](figures/Dehydropyrrolizidine.png){#fig:dhp} ![Summary of Diester predictions](figures/Diester.png){#fig:die} ![Summary of Macrocyclic-diester predictions](figures/Macrocyclic.diester.png){#fig:mcdie} ![Summary of Monoester predictions](figures/Monoester.png){#fig:me} ![Summary of N-oxide predictions](figures/N.oxide.png){#fig:nox} ![Summary of Otonecine predictions](figures/Otonecine.png){#fig:oto} ![Summary of Platynecine predictions](figures/Platynecine.png){#fig:plat} ![Summary of Retronecine predictions](figures/Retronecine.png){#fig:ret} ![Summary of Tertiary PA predictions](figures/Tertiary.PA.png){#fig:tert} For the visualisation of the position of pyrrolizidine alkaloids in respect to the training data set we have applied t-distributed stochastic neighbor embedding (t-SNE, @Maaten2008) for MolPrint2D and CDK descriptors. t-SNE maps each high-dimensional object (chemical) to a two-dimensional point, maintaining the high-dimensional distances of the objects. Similar objects are represented by nearby points and dissimilar objects are represented by distant points. @fig:tsne-mp2d shows the t-SNE of pyrrolizidine alkaloids (PA) and the mutagenicity training data in MP2D space (Tanimoto/Jaccard similarity). ![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-mp2d-mutagenicity.png){#fig:tsne-mp2d} @fig:tsne-cdk shows the t-SNE of pyrrolizidine alkaloids (PA) and the mutagenicity training data in CDK space (Euclidean similarity). ![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-cdk-mutagenicity.png){#fig:tsne-cdk} Discussion ========== Data ---- A new training dataset for *Salmonella* mutagenicity was created from three different sources (@Kazius2005, @Hansen2009, @EFSA2016). It contains {{cv.n_uniq}} unique chemical structures, which is according to our knowledge the largest public mutagenicity dataset presently available. The new training data can be downloaded from . Algorithms ---------- `lazar` is formally a *k-nearest-neighbor* algorithm that searches for similar structures for a given compound and calculates the prediction based on the experimental data for these structures. The QSAR literature calls such models frequently *local models*, because models are generated specifically for each query compound. The investigated tensorflow models are in contrast *global models*, i.e. a single model is used to make predictions for all compounds. It has been postulated in the past, that local models are more accurate, because they can account better for mechanisms, that affect only a subset of the training data. @tbl:cv-mp2d, @tbl:cv-cdk and @fig:roc show that all models with the exception of lazar-CDK have similar crossvalidation accuracies that are comparable to the experimental variability of the *Salmonella typhimurium* mutagenicity bioassay (80-85% according to @Benigni1988). All of these models have balanced sensitivity (true position rate) and specificity (true negative rate) and provide highly significant concordance with experimental data (as determined by McNemar's Test). This is a clear indication that *in-silico* predictions can be as reliable as the bioassays. Given that the variability of experimental data is similar to model variability it is impossible to decide which model gives the most accurate predictions, as models with higher accuracies (e.g. NN-CDK) might just approximate experimental errors better than more robust models. `lazar` predictions with CDK descriptors are a notable exception, as it has a much lower overall accuracy ({{lazar_all_cdk.acc}}) than all other models. `lazar` uses basically a k-nearest-neighbor (with variable k) and it seems that CDK descriptors are not very well suited for chemical similarity calculations. We have confirmed this independently by validating k-nn models from the `R caret` package, which give also sub-par accuracies (data not shown). @fig:tsne-cdk is another indication that similarity calculations with CDK descriptors are not as useful as fingerprint based similarities, because it shows a less clearer separation between chemical classes and mutagens/non-mutagens than @fig:tsne-mp2d. It seems that more complex models than simple k-nn are required to utilize CDK descriptors efficiently. Our results do not support the assumption that local models are superior to global models for classification purposes. For regression models (lowest observed effect level) we have found however that local models may outperform global models (@Helma2018) with accuracies similar to experimental variability. Descriptors ----------- This study uses two types of descriptors for the characterisation of chemical structures: *MolPrint2D* fingerprints (MP2D, @Bender2004) use atom environments (i.e. connected atom types for all atoms in a molecule) as molecular representation, which resembles basically the chemical concept of functional groups. MP2D descriptors are used to determine chemical similarities in the default `lazar` settings, and previous experiments have shown, that they give more accurate results than predefined fingerprints (e.g. MACCS, FP2-4). *Chemistry Development Kit* (CDK, @Willighagen2017) descriptors were calculated with the PaDEL graphical interface (@Yap2011). They include 1D and 2D topological descriptors as well as physical-chemical properties. With exception of `lazar` all investigated algorithms obtained models within the experimental variability for both types of descriptors. As discussed before CDK descriptors seem to be less suitable for chemical similarity calculations than MolPrint2D descriptors. Given that similar predictive accuracies are obtainable from both types of descriptors the choice depends more on practical considerations: MolPrint2D fragments can be calculated very efficiently for every well defined chemical structure with OpenBabel (@OBoyle2011a). CDK descriptor calculations are in contrast much more resource intensive and may fail for a significant number of compounds ({{cv.cdk.n_failed}} from {{cv.n_uniq}}). MolPrint2D fragments are generated dynamically from chemical structures and can be used to determine if a compound contains structural features that are absent in training data. This feature can be used to determine applicability domains. CDK descriptors contain in contrast a predefined set of descriptors with unknown toxicological relevance. MolPrint2D fingerprints can be represented very efficiently as sets of features that are present in a given compound which makes similarity calculations very efficient. Due to the large number of substructures present in training compounds, they lead however to large and sparsely populated datasets, if they have to be expanded to a binary matrix (e.g. as input for tensorflow models). CDK descriptors contain in contrast in every case matrices with {{cv.cdk.n_descriptors}} columns. Pyrrolizidine alkaloid mutagenicity predictions ----------------------------------------------- @fig:dhp - @fig:tert show a clear differentiation between the different pyrrolizidine alkaloid groups. The largest proportion of mutagenic predictions was observed for Otonecines {{pa.groups.Otonecine.mut_perc}}% ({{pa.groups.Otonecine.mut}}/{{pa.groups.Otonecine.n_pred}}), the lowest for Monoesters {{pa.groups.Monoester.mut_perc}}% ({{pa.groups.Monoester.mut}}/{{pa.groups.Monoester.n_pred}}) and N-Oxides {{pa.groups.N_oxide.mut_perc}}% ({{pa.groups.N_oxide.mut}}/{{pa.groups.N_oxide.n_pred}}). Although most of the models show similar accuracies, sensitivities and specificities in crossvalidation experiments some of the models (MPD-RF, CDK-RF and CDK-SVM) predict a lower number of mutagens ({{pa.cdk_rf.mut_perc}}-{{pa.mp2d_rf.mut_perc}}%) than the majority of the models ({{pa.mp2d_svm.mut_perc}}-{{pa.mp2d_lazar_high_confidence.mut_perc}}% @tbl:pa-mp2d, @tbl:pa-cdk, @fig:dhp - @fig:tert). From a practical point we still have to face the question, how to choose model predictions, if no experimental data is available (we found two PAs in the training data, but this number is too low, to draw any general conclusions). **TODO**: **Verena** Hier ist ein alter Text von Dir zum Recylen: Necic acid The rank order of the necic acid is comparable in the four models considered (LAZAR, RF and DL (R-project and Tensorflow). PAs from the monoester type had the lowest genotoxic potential, followed by PAs from the open-ring diester type. PAs with macrocyclic diesters had the highest genotoxic potential. The result fit well with current state of knowledge: in general, PAs, which have a macrocyclic diesters as necic acid, are considered more toxic than those with an open-ring diester or monoester [EFSA 2011](#_ENREF_36)[Fu et al. 2004](#_ENREF_45)[Ruan et al. 2014b](#_ENREF_115)(; ; ). Necine base The rank order of necine base is comparable in LAZAR, RF, and DL (R-project) models: with platynecine being less or as genotoxic as retronecine, and otonecine being the most genotoxic. In the Tensorflow-generate DL model, platynecine also has the lowest genotoxic probability, but are then followed by the otonecines and last by retronecine. These results partly correspond to earlier published studies. Saturated PAs of the platynecine-type are generally accepted to be less or non-toxic and have been shown in *in vitro* experiments to form no DNA-adducts [Xia et al. 2013](#_ENREF_139)(). Therefore, it is striking, that 1,2-unsaturated PAs of the retronecine-type should have an almost comparable genotoxic potential in the LAZAR and DL (R-project) model. In literature, otonecine-type PAs were shown to be more toxic than those of the retronecine-type [Li et al. 2013](#_ENREF_80)(). Modifications of necine base The group-specific results of the Tensorflow-generated DL model appear to reflect the expected relationship between the groups: the low genotoxic potential of *N*-oxides and the highest potential of dehydropyrrolizidines [Chen et al. 2010](#_ENREF_26)(). In the LAZAR model, the genotoxic potential of dehydropyrrolizidines (DHP) (using the extended AD) is comparable to that of tertiary PAs. Since, DHP is regarded as the toxic principle in the metabolism of PAs, and known to produce protein- and DNA-adducts [Chen et al. 2010](#_ENREF_26)(), the LAZAR model did not meet this expectation it predicted the majority of DHP as being not genotoxic. However, the following issues need to be considered. On the one hand, all DHP were outside of the stricter AD of 0.5. This indicates that in general, there might be a problem with the AD. In addition, DHP has two unsaturated double bounds in its necine base, making it highly reactive. DHP and other comparable molecules have a very short lifespan, and usually cannot be used in *in vitro* experiments. This might explain the absence of suitable neighbours in LAZAR. Furthermore, the probabilities for this substance groups needs to be considered, and not only the consolidated prediction. In the LAZAR model, all DHPs had probabilities for both outcomes (genotoxic and not genotoxic) mainly below 30%. Additionally, the probabilities for both outcomes were close together, often within 10% of each other. The fact that for both outcomes, the probabilities were low and close together, indicates a lower confidence in the prediction of the model for DHPs. In the DL (R-project) and RF model, *N*-oxides have a by far more genotoxic potential that tertiary PAs or dehydropyrrolizidines. As PA *N*-oxides are easily conjugated for extraction, they are generally considered as detoxification products, which are *in vivo* quickly renally eliminated [Chen et al. 2010](#_ENREF_26)(). On the other hand, *N*-oxides can be also back-transformed to the corresponding tertiary PA [Wang et al. 2005](#_ENREF_134)(). Therefore, it may be questioned, whether *N*-oxides themselves are generally less genotoxic than the corresponding tertiary PAs. However, in the groups of modification of the necine base, dehydropyrrolizidine, the toxic principle of PAs, should have had the highest genotoxic potential. Taken together, the predictions of the modifications of the necine base from the LAZAR, RF and R-generated DL model cannot - in contrast to the Tensorflow DL model - be considered as reliable. Overall, when comparing the prediction results of the PAs to current published knowledge, it can be concluded that the performance of most models was low to moderate. This might be contributed to the following issues: 1. In the LAZAR model, only 26.6% PAs were within the stricter AD. With the extended AD, 92.3% of the PAs could be included in the prediction. Even though the Jaccard distance between the training dataset and the PA dataset for the RF, SVM, and DL (R-project and Tensorflow) models was small, suggesting a high similarity, the LAZAR indicated that PAs have only few local neighbours, which might adversely affect the prediction of the mutagenic potential of PAs. 2. All above-mentioned models were used to predict the mutagenicity of PAs. PAs are generally considered to be genotoxic, and the mode of action is also known. Therefore, the fact that some models predict the majority of PAs as not genotoxic seems contradictory. To understand this result, the basis, the training dataset, has to be considered. The mutagenicity of in the training dataset are based on data of mutagenicity in bacteria. There are some studies, which show mutagenicity of PAs in the AMES test [Chen et al. 2010](#_ENREF_26)(). Also, [Rubiolo et al. (1992)](#_ENREF_116) examined several different PAs and several different extracts of PA-containing plants in the AMES test. They found that the AMES test was indeed able to detect mutagenicity of PAs, but in general, appeared to have a low sensitivity. The pre-incubation phase for metabolic activation of PAs by microsomal enzymes was the sensitivity-limiting step. This could very well mean that this is also reflected in the QSAR models. Conclusions =========== A new public *Salmonella* mutagenicity training dataset with 8309 compounds was created and used it to train `lazar` and Tensorflow models with MolPrint2D and CDK descriptors. References ==========