1 files changed, 532 insertions, 0 deletions
diff --git a/paper/mutagenicity.md b/paper/mutagenicity.md
new file mode 100644
index 0000000..c316543
--- /dev/null
+++ b/paper/mutagenicity.md
@@ -0,0 +1,532 @@
+---
+title: A comparison of random forest, support vector machine, deep learning and lazar algorithms for predicting mutagenicity
+#subtitle: Performance comparison with a new expanded dataset
+author:
+  - Christoph Helma:
+      institute: ist
+      email: helma@in-silico.ch
+      correspondence: "yes"
+  - Verena Schöning:
+      institute: zeller
+  - Philipp Boss:
+      institute: zeller
+  - Jürgen Drewe:
+      institute: zeller
+institute:
+  - ist:
+      name: in silico toxicology gmbh
+      address: "Rastatterstrasse 41, 4057 Basel, Switzerland"
+  - zeller: 
+      name: Zeller AG
+      address: "Seeblickstrasse 4, 8590 Romanshorn, Switzerland"
+bibliography: bibliography.bib
+keywords: mutagenicity, (Q)SAR, lazar, random forest, support vector machine, deep learning
+documentclass: scrartcl
+...
+
+Abstract
+========
+
+k-nearest neighbor (`lazar`), random forest, support vector machine and deep
+learning algorithms were applied to a new *Salmonella* mutagenicity dataset
+with 8281 unique chemical structures. Algorithm performance was evaluated
+using 5-fold crossvalidation. 
+TODO
+- results
+- conclusion
+
+Introduction
+============
+
+TODO: algo history
+
+TODO: dataset history
+
+TODO: open problems
+
+The main objective of this study was 
+
+  - to generate a new training dataset, by combining the most comprehensive public mutagenicity datasets
+  - to compare the performance of global models (RF, SVM, Neural Nets) with local models (`lazar`)
+
+Materials and Methods
+=====================
+
+Data
+----
+
+For all methods, the same training dataset was used. The
+training dataset was compiled from the following sources:
+
+-   Kazius/Bursi Dataset (4337 compounds, @Kazius2005): <http://cheminformatics.org/datasets/bursi/cas_4337.zip>
+
+-   Hansen Dataset (6513 compounds, @Hansen2009): <http://doc.ml.tu-berlin.de/toxbenchmark/Mutagenicity_N6512.csv>
+
+-   EFSA Dataset (695 compounds): <https://data.europa.eu/euodp/data/storage/f/2017-0719T142131/GENOTOX%20data%20and%20dictionary.xls>
+
+Mutagenicity classifications from Kazius and Hansen datasets were used
+without further processing. To achieve consistency with these
+datasets, EFSA compounds were classified as mutagenic, if at least one
+positive result was found for TA98 or T100 Salmonella strains.
+
+Dataset merges were based on unique SMILES (*Simplified Molecular Input
+Line Entry Specification*) strings of the compound structures.
+Duplicated experimental data with the same outcome was merged into a
+single value, because it is likely that it originated from the same
+experiment. Contradictory results were kept as multiple measurements in
+the database. The combined training dataset contains 8281 unique
+structures.
+
+Source code for all data download, extraction and merge operations is
+publicly available from the git repository
+<https://git.in-silico.ch/mutagenicity-paper> under a GPL3 License.
+
+TODO: check/fix git repo
+
+Algorithms
+----------
+
+### `lazar`
+
+`lazar` (*lazy structure activity relationships*) is a modular framework
+for read-across model development and validation. It follows the
+following basic workflow: For a given chemical structure `lazar`:
+
+-   searches in a database for similar structures (neighbours) with
+    experimental data,
+
+-   builds a local QSAR model with these neighbours and
+
+-   uses this model to predict the unknown activity of the query
+    compound.
+
+This procedure resembles an automated version of read across predictions
+in toxicology, in machine learning terms it would be classified as a
+k-nearest-neighbour algorithm.
+
+Apart from this basic workflow, `lazar` is completely modular and allows
+the researcher to use any algorithm for similarity searches and local
+QSAR (*Quantitative structure--activity relationship*) modelling.
+Algorithms used within this study are described in the following
+sections.
+
+#### Neighbour identification
+
+Similarity calculations were based on MolPrint2D fingerprints (@Bender2004) from the OpenBabel cheminformatics library
+(@OBoyle2011a). The MolPrint2D fingerprint uses
+atom environments as molecular representation, which resembles basically
+the chemical concept of functional groups. For each atom in a molecule,
+it represents the chemical environment using the atom types of connected
+atoms.
+
+MolPrint2D fingerprints are generated dynamically from chemical
+structures and do not rely on predefined lists of fragments (such as
+OpenBabel FP3, FP4 or MACCs fingerprints or lists of
+toxicophores/toxicophobes). This has the advantage that they may capture
+substructures of toxicological relevance that are not included in other
+fingerprints.
+
+From MolPrint2D fingerprints a feature vector with all atom environments
+of a compound can be constructed that can be used to calculate chemical
+similarities.
+
+The chemical similarity between two compounds a and b is expressed as
+the proportion between atom environments common in both structures A ∩ B
+and the total number of atom environments A U B (Jaccard/Tanimoto
+index).
+
+$$sim = \frac{\left| A\  \cap B \right|}{\left| A\  \cup B \right|}$$
+
+Threshold selection is a trade-off between prediction accuracy (high
+threshold) and the number of predictable compounds (low threshold). As
+it is in many practical cases desirable to make predictions even in the
+absence of closely related neighbours, we follow a tiered approach:
+
+-   First a similarity threshold of 0.5 is used to collect neighbours,
+    to create a local QSAR model and to make a prediction for the query
+    compound.
+
+-   If any of these steps fails, the procedure is repeated with a
+    similarity threshold of 0.2 and the prediction is flagged with a
+    warning that it might be out of the applicability domain of the
+    training data.
+
+-   Similarity thresholds of 0.5 and 0.2 are the default values chosen
+    > by the software developers and remained unchanged during the
+    > course of these experiments.
+
+Compounds with the same structure as the query structure are
+automatically eliminated from neighbours to obtain unbiased predictions
+in the presence of duplicates.
+
+#### Local QSAR models and predictions
+
+Only similar compounds (neighbours) above the threshold are used for
+local QSAR models. In this investigation, we are using a weighted
+majority vote from the neighbour's experimental data for mutagenicity
+classifications. Probabilities for both classes
+(mutagenic/non-mutagenic) are calculated according to the following
+formula and the class with the higher probability is used as prediction
+outcome.
+
+$$p_{c} = \ \frac{\sum_{}^{}\text{sim}_{n,c}}{\sum_{}^{}\text{sim}_{n}}$$
+
+$p_{c}$ Probability of class c (e.g. mutagenic or non-mutagenic)\
+$\sum_{}^{}\text{sim}_{n,c}$ Sum of similarities of neighbours with
+class c\
+$\sum_{}^{}\text{sim}_{n}$ Sum of all neighbours
+
+#### Applicability domain
+
+The applicability domain (AD) of `lazar` models is determined by the
+structural diversity of the training data. If no similar compounds are
+found in the training data no predictions will be generated. Warnings
+are issued if the similarity threshold had to be lowered from 0.5 to 0.2
+in order to enable predictions. Predictions without warnings can be
+considered as close to the applicability domain and predictions with
+warnings as more distant from the applicability domain. Quantitative
+applicability domain information can be obtained from the similarities
+of individual neighbours.
+
+#### Availability
+
+-   `lazar` experiments for this manuscript:
+    <https://git.in-silico.ch/mutagenicity-paper>
+    (source code, GPL3)
+
+-   `lazar` framework:
+    <https://git.in-silico.ch/lazar>
+    (source code, GPL3)
+
+-   `lazar` GUI:
+    <https://git.in-silico.ch/lazar-gui>
+    (source code, GPL3)
+
+-   Public web interface:
+    <https://lazar.in-silico.ch>
+
+### Random Forest, Support Vector Machines, and Deep Learning in R-project
+
+For the Random Forest (RF), Support Vector Machines (SVM), and Deep
+Learning (DL) models, molecular descriptors were calculated
+with the PaDEL-Descriptors program (<http://www.yapcwsoft.com> version 2.21, @Yap2011).
+
+TODO: sentence ??
+
+From these descriptors were
+chosen, which were actually used for the generation of the DL model.
+
+
+In comparison to `lazar`, three other models (Random Forest (RF), Support
+Vector Machines (SVM), and Deep Learning (DL)) were evaluated.
+
+For the generation of these models, molecular 1D and 2D descriptors of
+the training dataset were calculated using PaDEL-Descriptors (<http://www.yapcwsoft.com> version
+2.21, @Yap2011).
+
+As the training dataset contained over 8280 instances, it was decided to
+delete instances with missing values during data pre-processing.
+Furthermore, substances with equivocal outcome were removed. The final
+training dataset contained 8080 instances with known mutagenic
+potential. The RF, SVM, and DL models were generated using the R
+software (R-project for Statistical Computing,
+<https://www.r-project.org/>*;* version 3.3.1), specific R packages used
+are identified for each step in the description below. During feature
+selection, descriptor with near zero variance were removed using
+'*NearZeroVar*'-function (package 'caret'). If the percentage of the
+most common value was more than 90% or when the frequency ratio of the
+most common value to the second most common value was greater than 95:5
+(e.g. 95 instances of the most common value and only 5 or less instances
+of the second most common value), a descriptor was classified as having
+a near zero variance. After that, highly correlated descriptors were
+removed using the '*findCorrelation*'-function (package 'caret') with a
+cut-off of 0.9. This resulted in a training dataset with 516
+descriptors. These descriptors were scaled to be in the range between 0
+and 1 using the '*preProcess*'-function (package 'caret'). The scaling
+routine was saved in order to apply the same scaling on the testing
+dataset. As these three steps did not consider the outcome, it was
+decided that they do not need to be included in the cross-validation of
+the model. To further reduce the number of features, a LASSO (*least
+absolute shrinkage and selection operator*) regression was performed
+using the '*glmnet*'-function (package '*glmnet*'). The reduced dataset
+was used for the generation of the pre-trained models.
+
+For the RF model, the '*randomForest*'-function (package
+'*randomForest*') was used. A forest with 1000 trees with maximal
+terminal nodes of 200 was grown for the prediction.
+
+The '*svm*'-function (package 'e1071') with a *radial basis function
+kernel* was used for the SVM model.
+
+The DL model was generated using the '*h2o.deeplearning*'-function
+(package '*h2o*'). The DL contained four hidden layer with 70, 50, 50,
+and 10 neurons, respectively. Other hyperparameter were set as follows:
+l1=1.0E-7, l2=1.0E-11, epsilon = 1.0E-10, rho = 0.8, and quantile\_alpha
+= 0.5. For all other hyperparameter, the default values were used.
+Weights and biases were in a first step determined with an unsupervised
+DL model. These values were then used for the actual, supervised DL
+model.
+
+To validate these models, an internal cross-validation approach was
+chosen. The training dataset was randomly split in training data, which
+contained 95% of the data, and validation data, which contain 5% of the
+data. A feature selection with LASSO on the training data was performed,
+reducing the number of descriptors to approximately 100. This step was
+repeated five times. Based on each of the five different training data,
+the predictive models were trained and the performance tested with the
+validation data. This step was repeated 10 times. Furthermore, a
+y-randomisation using the RF model was performed. During
+y-randomisation, the outcome (y-variable) is randomly permuted. The
+theory is that after randomisation of the outcome, the model should not
+be able to correlate the outcome to the properties (descriptor values)
+of the substances. The performance of the model should therefore
+indicate a by change prediction with an accuracy of about 50%. If this
+is true, it can be concluded that correlation between actual outcome and
+properties of the substances is real and not by chance (@Rücker2007).
+
+![](media/image1.png){width="6.26875in" height="5.486111111111111in"}
+
+Figure 1: Flowchart of the generation and validation of the models
+generated in R-project
+
+#### Applicability domain
+
+The AD of the training dataset and the PA dataset was evaluated using
+the Jaccard distance. A Jaccard distance of '0' indicates that the
+substances are similar, whereas a value of '1' shows that the substances
+are different. The Jaccard distance was below 0.2 for all PAs relative
+to the training dataset. Therefore, PA dataset is within the AD of the
+training dataset and the models can be used to predict the genotoxic
+potential of the PA dataset.
+
+#### y-randomisation
+
+After y-randomisation of the outcome, the accuracy and CCR are around
+50%, indicating a chance in the distribution of the results. This shows,
+that the outcome is actually related to the predictors and not by
+chance.
+
+### Deep Learning in TensorFlow
+
+Alternatively, a DL model was established with Python-based TensorFlow
+program (<https://www.tensorflow.org/>) using the high-level API Keras
+(<https://www.tensorflow.org/guide/keras>) to build the models.
+
+Data pre-processing was done by rank transformation using the
+'*QuantileTransformer*' procedure. A sequential model has been used.
+Four layers have been used: input layer, two hidden layers (with 12, 8
+and 8 nodes, respectively) and one output layer. For the output layer, a
+sigmoidal activation function and for all other layers the ReLU
+('*Rectified Linear Unit*') activation function was used. Additionally,
+a L^2^-penalty of 0.001 was used for the input layer. For training of
+the model, the ADAM algorithm was used to minimise the cross-entropy
+loss using the default parameters of Keras. Training was performed for
+100 epochs with a batch size of 64. The model was implemented with
+Python 3.6 and Keras. For training of the model, a 6-fold
+cross-validation was used. Accuracy was estimated by ROC-AUC and
+confusion matrix.
+
+Validation
+----------
+
+Results
+=======
+
+`lazar`
+-----
+
+Random Forest
+-------------
+
+The validation showed that the RF model has an accuracy of 64%, a
+sensitivity of 66% and a specificity of 63%. The confusion matrix of the
+model, calculated for 8080 instances, is provided in Table 1.
+
+Table 1: Confusion matrix of the RF model
+
+                          Predicted genotoxicity                         
+  ----------------------- ------------------------ ---------- ---------- -------------
+  Measured genotoxicity                            ***PP***   ***PN***   ***Total***
+                          ***TP***                 2274       1163       3437
+                          ***TN***                 1736       2907       4643
+                          ***Total***              4010       4070       8080
+
+PP: Predicted positive; PN: Predicted negative, TP: True positive, TN:
+True negative
+
+Support Vector Machines
+-----------------------
+
+The validation showed that the SVM model has an accuracy of 62%, a
+sensitivity of 65% and a specificity of 60%. The confusion matrix of SVM
+model, calculated for 8080 instances, is provided in Table 2.
+
+Table 2: Confusion matrix of the SVM model
+
+                          Predicted genotoxicity                         
+  ----------------------- ------------------------ ---------- ---------- -------------
+  Measured genotoxicity                            ***PP***   ***PN***   ***Total***
+                          ***TP***                 2057       1107       3164
+                          ***TN***                 1953       2963       4916
+                          ***Total***              4010       4070       8080
+
+PP: Predicted positive; PN: Predicted negative, TP: True positive, TN:
+True negative
+
+Deep Learning (R-project)
+-------------------------
+
+The validation showed that the DL model generated in R has an accuracy
+of 59%, a sensitivity of 89% and a specificity of 30%. The confusion
+matrix of the model, normalised to 8080 instances, is provided in Table
+3.
+
+Table 3: Confusion matrix of the DL model (R-project)
+
+                          Predicted genotoxicity                         
+  ----------------------- ------------------------ ---------- ---------- -------------
+  Measured genotoxicity                            ***PP***   ***PN***   ***Total***
+                          ***TP***                 3575       435        4010
+                          ***TN***                 2853       1217       4070
+                          ***Total***              6428       1652       8080
+
+PP: Predicted positive; PN: Predicted negative, TP: True positive, TN:
+True negative
+
+DL model (TensorFlow)
+---------------------
+
+The validation showed that the DL model generated in TensorFlow has an
+accuracy of 68%, a sensitivity of 70% and a specificity of 46%. The
+confusion matrix of the model, normalised to 8080 instances, is provided
+in Table 4.
+
+Table 4: Confusion matrix of the DL model (TensorFlow)
+
+                          Predicted genotoxicity                         
+  ----------------------- ------------------------ ---------- ---------- -------------
+  Measured genotoxicity                            ***PP***   ***PN***   ***Total***
+                          ***TP***                 2851       1227       4078
+                          ***TN***                 1825       2177       4002
+                          ***Total***              4676       3404       8080
+
+PP: Predicted positive; PN: Predicted negative, TP: True positive, TN:
+True negative
+
+The ROC curves from the 6-fold validation are shown in Figure 7.
+
+![](media/image7.png){width="3.825in"
+height="2.7327045056867894in"}
+
+Figure 7: Six-fold cross-validation of TensorFlow DL model show an
+average area under the ROC-curve (ROC-AUC; measure of accuracy) of 68%.
+
+In summary, the validation results of the four methods are presented in
+the following table.
+
+Table 5 Results of the cross-validation of the four models and after
+y-randomisation
+
+  ----------------------------------------------------------------------
+                          Accuracy   CCR     Sensitivity   Specificity
+  ----------------------- ---------- ------- ------------- -------------
+  RF model                64.1%      64.4%   66.2%         62.6%
+
+  SVM model               62.1%      62.6%   65.0%         60.3%
+
+  DL model\               59.3%      59.5%   89.2%         29.9%
+  (R-project)                                              
+
+  DL model (TensorFlow)   68%        62.2%   69.9%         45.6%
+
+  y-randomisation         50.5%      50.4%   50.3%         50.6%
+  ----------------------------------------------------------------------
+
+CCR (correct classification rate)
+
+Discussion
+==========
+
+General model performance
+
+Based on the results of the cross-validation for all models, `lazar`, RF,
+SVM, DL (R-project) and DL (TensorFlow) it can be state that the
+prediction results are not optimal due to different reasons. The
+accuracy as measured during cross-validation of the four models (RF,
+SVM, DL (R-project and TensorFlow)) was partly low with CCR values
+between 59.3 and 68%, with the R-generated DL model and the
+TensorFlow-generated DL model showing the worst and the best
+performance, respectively. The validation of the R-generated DL model
+revealed a high sensitivity (89.2%) but an unacceptably low specificity
+of 29.9% indicating a high number of false positive estimates. The
+TensorFlow-generated DL model, however, showed an acceptable but not
+optimal accuracy of 68%, a sensitivity of 69.9% and a specificity of
+45.6%. The low specificity indicates that both DL models tends to
+predict too many instances as positive (genotoxic), and therefore have a
+high false positive rate. This allows at least with the TensorFlow
+generated DL model to make group statements, but the confidence for
+estimations of single PAs appears to be insufficiently low.
+
+Several factors have likely contributed to the low to moderate
+performance of the used methods as shown during the cross-validation:
+
+1.  The outcome in the training dataset was based on the results of AMES
+    tests for genotoxicity [ICH 2011](#_ENREF_63)(), an *in vitro* test
+    in different strains of the bacteria *Salmonella typhimurium*. In
+    this test, mutagenicity is evaluated with and without prior
+    metabolic activation of the test substance. Metabolic activation
+    could result in the formation of genotoxic metabolites from
+    non-genotoxic parent compounds. However, no distinction was made in
+    the training dataset between substances that needed metabolic
+    activation before being mutagenic and those that were mutagenic
+    without metabolic activation. `lazar` is able to handle this
+    'inaccuracy' in the training dataset well due to the way the
+    algorithm works: `lazar` predicts the genotoxic potential based on the
+    neighbours of substances with comparable structural features,
+    considering mutagenic and not mutagenic neighbours. Based on the
+    structural similarity, a probability for mutagenicity and no
+    mutagenicity is calculated independently from each other (meaning
+    that the sum of probabilities does not necessarily adds up to 100%).
+    The class with the higher outcome is then the overall outcome for
+    the substance.
+
+> In contrast, the other models need to be trained first to recognise
+> the structural features that are responsible for genotoxicity.
+> Therefore, the mixture of substances being mutagenic with and without
+> metabolic activation in the training dataset may have adversely
+> affected the ability to separate the dataset in two distinct classes
+> and thus explains the relatively low performance of these models.
+
+2.  Machine learning algorithms try to find an optimized solution in a
+    high-dimensional (one dimension per each predictor) space. Sometimes
+    these methods do not find the global optimum of estimates but only
+    local (not optimal) solutions. Strategies to find the global
+    solutions are systematic variation (grid search) of the
+    hyperparameters of the methods, which may be very time consuming in
+    particular in large datasets.
+
+
+Conclusions
+===========
+
+In this study, an attempt was made to predict the genotoxic potential of
+PAs using five different machine learning techniques (`lazar`, RF, SVM, DL
+(R-project and TensorFlow). The results of all models fitted only partly
+to the findings in literature, with best results obtained with the
+TensorFlow DL model. Therefore, modelling allows statements on the
+relative risks of genotoxicity of the different PA groups. Individual
+predictions for selective PAs appear, however, not reliable on the
+current basis of the used training dataset.
+
+This study emphasises the importance of critical assessment of
+predictions by QSAR models. This includes not only extensive literature
+research to assess the plausibility of the predictions, but also a good
+knowledge of the metabolism of the test substances and understanding for
+possible mechanisms of toxicity.
+
+In further studies, additional machine learning techniques or a modified
+(extended) training dataset should be used for an additional attempt to
+predict the genotoxic potential of PAs.
+
+References
+==========