diff options
Diffstat (limited to 'paper/mutagenicity.md')
-rw-r--r-- | paper/mutagenicity.md | 532 |
1 files changed, 532 insertions, 0 deletions
diff --git a/paper/mutagenicity.md b/paper/mutagenicity.md new file mode 100644 index 0000000..c316543 --- /dev/null +++ b/paper/mutagenicity.md @@ -0,0 +1,532 @@ +--- +title: A comparison of random forest, support vector machine, deep learning and lazar algorithms for predicting mutagenicity +#subtitle: Performance comparison with a new expanded dataset +author: + - Christoph Helma: + institute: ist + email: helma@in-silico.ch + correspondence: "yes" + - Verena Schöning: + institute: zeller + - Philipp Boss: + institute: zeller + - Jürgen Drewe: + institute: zeller +institute: + - ist: + name: in silico toxicology gmbh + address: "Rastatterstrasse 41, 4057 Basel, Switzerland" + - zeller: + name: Zeller AG + address: "Seeblickstrasse 4, 8590 Romanshorn, Switzerland" +bibliography: bibliography.bib +keywords: mutagenicity, (Q)SAR, lazar, random forest, support vector machine, deep learning +documentclass: scrartcl +... + +Abstract +======== + +k-nearest neighbor (`lazar`), random forest, support vector machine and deep +learning algorithms were applied to a new *Salmonella* mutagenicity dataset +with 8281 unique chemical structures. Algorithm performance was evaluated +using 5-fold crossvalidation. +TODO +- results +- conclusion + +Introduction +============ + +TODO: algo history + +TODO: dataset history + +TODO: open problems + +The main objective of this study was + + - to generate a new training dataset, by combining the most comprehensive public mutagenicity datasets + - to compare the performance of global models (RF, SVM, Neural Nets) with local models (`lazar`) + +Materials and Methods +===================== + +Data +---- + +For all methods, the same training dataset was used. The +training dataset was compiled from the following sources: + +- Kazius/Bursi Dataset (4337 compounds, @Kazius2005): <http://cheminformatics.org/datasets/bursi/cas_4337.zip> + +- Hansen Dataset (6513 compounds, @Hansen2009): <http://doc.ml.tu-berlin.de/toxbenchmark/Mutagenicity_N6512.csv> + +- EFSA Dataset (695 compounds): <https://data.europa.eu/euodp/data/storage/f/2017-0719T142131/GENOTOX%20data%20and%20dictionary.xls> + +Mutagenicity classifications from Kazius and Hansen datasets were used +without further processing. To achieve consistency with these +datasets, EFSA compounds were classified as mutagenic, if at least one +positive result was found for TA98 or T100 Salmonella strains. + +Dataset merges were based on unique SMILES (*Simplified Molecular Input +Line Entry Specification*) strings of the compound structures. +Duplicated experimental data with the same outcome was merged into a +single value, because it is likely that it originated from the same +experiment. Contradictory results were kept as multiple measurements in +the database. The combined training dataset contains 8281 unique +structures. + +Source code for all data download, extraction and merge operations is +publicly available from the git repository +<https://git.in-silico.ch/mutagenicity-paper> under a GPL3 License. + +TODO: check/fix git repo + +Algorithms +---------- + +### `lazar` + +`lazar` (*lazy structure activity relationships*) is a modular framework +for read-across model development and validation. It follows the +following basic workflow: For a given chemical structure `lazar`: + +- searches in a database for similar structures (neighbours) with + experimental data, + +- builds a local QSAR model with these neighbours and + +- uses this model to predict the unknown activity of the query + compound. + +This procedure resembles an automated version of read across predictions +in toxicology, in machine learning terms it would be classified as a +k-nearest-neighbour algorithm. + +Apart from this basic workflow, `lazar` is completely modular and allows +the researcher to use any algorithm for similarity searches and local +QSAR (*Quantitative structure--activity relationship*) modelling. +Algorithms used within this study are described in the following +sections. + +#### Neighbour identification + +Similarity calculations were based on MolPrint2D fingerprints (@Bender2004) from the OpenBabel cheminformatics library +(@OBoyle2011a). The MolPrint2D fingerprint uses +atom environments as molecular representation, which resembles basically +the chemical concept of functional groups. For each atom in a molecule, +it represents the chemical environment using the atom types of connected +atoms. + +MolPrint2D fingerprints are generated dynamically from chemical +structures and do not rely on predefined lists of fragments (such as +OpenBabel FP3, FP4 or MACCs fingerprints or lists of +toxicophores/toxicophobes). This has the advantage that they may capture +substructures of toxicological relevance that are not included in other +fingerprints. + +From MolPrint2D fingerprints a feature vector with all atom environments +of a compound can be constructed that can be used to calculate chemical +similarities. + +The chemical similarity between two compounds a and b is expressed as +the proportion between atom environments common in both structures A ∩ B +and the total number of atom environments A U B (Jaccard/Tanimoto +index). + +$$sim = \frac{\left| A\ \cap B \right|}{\left| A\ \cup B \right|}$$ + +Threshold selection is a trade-off between prediction accuracy (high +threshold) and the number of predictable compounds (low threshold). As +it is in many practical cases desirable to make predictions even in the +absence of closely related neighbours, we follow a tiered approach: + +- First a similarity threshold of 0.5 is used to collect neighbours, + to create a local QSAR model and to make a prediction for the query + compound. + +- If any of these steps fails, the procedure is repeated with a + similarity threshold of 0.2 and the prediction is flagged with a + warning that it might be out of the applicability domain of the + training data. + +- Similarity thresholds of 0.5 and 0.2 are the default values chosen + > by the software developers and remained unchanged during the + > course of these experiments. + +Compounds with the same structure as the query structure are +automatically eliminated from neighbours to obtain unbiased predictions +in the presence of duplicates. + +#### Local QSAR models and predictions + +Only similar compounds (neighbours) above the threshold are used for +local QSAR models. In this investigation, we are using a weighted +majority vote from the neighbour's experimental data for mutagenicity +classifications. Probabilities for both classes +(mutagenic/non-mutagenic) are calculated according to the following +formula and the class with the higher probability is used as prediction +outcome. + +$$p_{c} = \ \frac{\sum_{}^{}\text{sim}_{n,c}}{\sum_{}^{}\text{sim}_{n}}$$ + +$p_{c}$ Probability of class c (e.g. mutagenic or non-mutagenic)\ +$\sum_{}^{}\text{sim}_{n,c}$ Sum of similarities of neighbours with +class c\ +$\sum_{}^{}\text{sim}_{n}$ Sum of all neighbours + +#### Applicability domain + +The applicability domain (AD) of `lazar` models is determined by the +structural diversity of the training data. If no similar compounds are +found in the training data no predictions will be generated. Warnings +are issued if the similarity threshold had to be lowered from 0.5 to 0.2 +in order to enable predictions. Predictions without warnings can be +considered as close to the applicability domain and predictions with +warnings as more distant from the applicability domain. Quantitative +applicability domain information can be obtained from the similarities +of individual neighbours. + +#### Availability + +- `lazar` experiments for this manuscript: + <https://git.in-silico.ch/mutagenicity-paper> + (source code, GPL3) + +- `lazar` framework: + <https://git.in-silico.ch/lazar> + (source code, GPL3) + +- `lazar` GUI: + <https://git.in-silico.ch/lazar-gui> + (source code, GPL3) + +- Public web interface: + <https://lazar.in-silico.ch> + +### Random Forest, Support Vector Machines, and Deep Learning in R-project + +For the Random Forest (RF), Support Vector Machines (SVM), and Deep +Learning (DL) models, molecular descriptors were calculated +with the PaDEL-Descriptors program (<http://www.yapcwsoft.com> version 2.21, @Yap2011). + +TODO: sentence ?? + +From these descriptors were +chosen, which were actually used for the generation of the DL model. + + +In comparison to `lazar`, three other models (Random Forest (RF), Support +Vector Machines (SVM), and Deep Learning (DL)) were evaluated. + +For the generation of these models, molecular 1D and 2D descriptors of +the training dataset were calculated using PaDEL-Descriptors (<http://www.yapcwsoft.com> version +2.21, @Yap2011). + +As the training dataset contained over 8280 instances, it was decided to +delete instances with missing values during data pre-processing. +Furthermore, substances with equivocal outcome were removed. The final +training dataset contained 8080 instances with known mutagenic +potential. The RF, SVM, and DL models were generated using the R +software (R-project for Statistical Computing, +<https://www.r-project.org/>*;* version 3.3.1), specific R packages used +are identified for each step in the description below. During feature +selection, descriptor with near zero variance were removed using +'*NearZeroVar*'-function (package 'caret'). If the percentage of the +most common value was more than 90% or when the frequency ratio of the +most common value to the second most common value was greater than 95:5 +(e.g. 95 instances of the most common value and only 5 or less instances +of the second most common value), a descriptor was classified as having +a near zero variance. After that, highly correlated descriptors were +removed using the '*findCorrelation*'-function (package 'caret') with a +cut-off of 0.9. This resulted in a training dataset with 516 +descriptors. These descriptors were scaled to be in the range between 0 +and 1 using the '*preProcess*'-function (package 'caret'). The scaling +routine was saved in order to apply the same scaling on the testing +dataset. As these three steps did not consider the outcome, it was +decided that they do not need to be included in the cross-validation of +the model. To further reduce the number of features, a LASSO (*least +absolute shrinkage and selection operator*) regression was performed +using the '*glmnet*'-function (package '*glmnet*'). The reduced dataset +was used for the generation of the pre-trained models. + +For the RF model, the '*randomForest*'-function (package +'*randomForest*') was used. A forest with 1000 trees with maximal +terminal nodes of 200 was grown for the prediction. + +The '*svm*'-function (package 'e1071') with a *radial basis function +kernel* was used for the SVM model. + +The DL model was generated using the '*h2o.deeplearning*'-function +(package '*h2o*'). The DL contained four hidden layer with 70, 50, 50, +and 10 neurons, respectively. Other hyperparameter were set as follows: +l1=1.0E-7, l2=1.0E-11, epsilon = 1.0E-10, rho = 0.8, and quantile\_alpha += 0.5. For all other hyperparameter, the default values were used. +Weights and biases were in a first step determined with an unsupervised +DL model. These values were then used for the actual, supervised DL +model. + +To validate these models, an internal cross-validation approach was +chosen. The training dataset was randomly split in training data, which +contained 95% of the data, and validation data, which contain 5% of the +data. A feature selection with LASSO on the training data was performed, +reducing the number of descriptors to approximately 100. This step was +repeated five times. Based on each of the five different training data, +the predictive models were trained and the performance tested with the +validation data. This step was repeated 10 times. Furthermore, a +y-randomisation using the RF model was performed. During +y-randomisation, the outcome (y-variable) is randomly permuted. The +theory is that after randomisation of the outcome, the model should not +be able to correlate the outcome to the properties (descriptor values) +of the substances. The performance of the model should therefore +indicate a by change prediction with an accuracy of about 50%. If this +is true, it can be concluded that correlation between actual outcome and +properties of the substances is real and not by chance (@Rücker2007). + +![](media/image1.png){width="6.26875in" height="5.486111111111111in"} + +Figure 1: Flowchart of the generation and validation of the models +generated in R-project + +#### Applicability domain + +The AD of the training dataset and the PA dataset was evaluated using +the Jaccard distance. A Jaccard distance of '0' indicates that the +substances are similar, whereas a value of '1' shows that the substances +are different. The Jaccard distance was below 0.2 for all PAs relative +to the training dataset. Therefore, PA dataset is within the AD of the +training dataset and the models can be used to predict the genotoxic +potential of the PA dataset. + +#### y-randomisation + +After y-randomisation of the outcome, the accuracy and CCR are around +50%, indicating a chance in the distribution of the results. This shows, +that the outcome is actually related to the predictors and not by +chance. + +### Deep Learning in TensorFlow + +Alternatively, a DL model was established with Python-based TensorFlow +program (<https://www.tensorflow.org/>) using the high-level API Keras +(<https://www.tensorflow.org/guide/keras>) to build the models. + +Data pre-processing was done by rank transformation using the +'*QuantileTransformer*' procedure. A sequential model has been used. +Four layers have been used: input layer, two hidden layers (with 12, 8 +and 8 nodes, respectively) and one output layer. For the output layer, a +sigmoidal activation function and for all other layers the ReLU +('*Rectified Linear Unit*') activation function was used. Additionally, +a L^2^-penalty of 0.001 was used for the input layer. For training of +the model, the ADAM algorithm was used to minimise the cross-entropy +loss using the default parameters of Keras. Training was performed for +100 epochs with a batch size of 64. The model was implemented with +Python 3.6 and Keras. For training of the model, a 6-fold +cross-validation was used. Accuracy was estimated by ROC-AUC and +confusion matrix. + +Validation +---------- + +Results +======= + +`lazar` +----- + +Random Forest +------------- + +The validation showed that the RF model has an accuracy of 64%, a +sensitivity of 66% and a specificity of 63%. The confusion matrix of the +model, calculated for 8080 instances, is provided in Table 1. + +Table 1: Confusion matrix of the RF model + + Predicted genotoxicity + ----------------------- ------------------------ ---------- ---------- ------------- + Measured genotoxicity ***PP*** ***PN*** ***Total*** + ***TP*** 2274 1163 3437 + ***TN*** 1736 2907 4643 + ***Total*** 4010 4070 8080 + +PP: Predicted positive; PN: Predicted negative, TP: True positive, TN: +True negative + +Support Vector Machines +----------------------- + +The validation showed that the SVM model has an accuracy of 62%, a +sensitivity of 65% and a specificity of 60%. The confusion matrix of SVM +model, calculated for 8080 instances, is provided in Table 2. + +Table 2: Confusion matrix of the SVM model + + Predicted genotoxicity + ----------------------- ------------------------ ---------- ---------- ------------- + Measured genotoxicity ***PP*** ***PN*** ***Total*** + ***TP*** 2057 1107 3164 + ***TN*** 1953 2963 4916 + ***Total*** 4010 4070 8080 + +PP: Predicted positive; PN: Predicted negative, TP: True positive, TN: +True negative + +Deep Learning (R-project) +------------------------- + +The validation showed that the DL model generated in R has an accuracy +of 59%, a sensitivity of 89% and a specificity of 30%. The confusion +matrix of the model, normalised to 8080 instances, is provided in Table +3. + +Table 3: Confusion matrix of the DL model (R-project) + + Predicted genotoxicity + ----------------------- ------------------------ ---------- ---------- ------------- + Measured genotoxicity ***PP*** ***PN*** ***Total*** + ***TP*** 3575 435 4010 + ***TN*** 2853 1217 4070 + ***Total*** 6428 1652 8080 + +PP: Predicted positive; PN: Predicted negative, TP: True positive, TN: +True negative + +DL model (TensorFlow) +--------------------- + +The validation showed that the DL model generated in TensorFlow has an +accuracy of 68%, a sensitivity of 70% and a specificity of 46%. The +confusion matrix of the model, normalised to 8080 instances, is provided +in Table 4. + +Table 4: Confusion matrix of the DL model (TensorFlow) + + Predicted genotoxicity + ----------------------- ------------------------ ---------- ---------- ------------- + Measured genotoxicity ***PP*** ***PN*** ***Total*** + ***TP*** 2851 1227 4078 + ***TN*** 1825 2177 4002 + ***Total*** 4676 3404 8080 + +PP: Predicted positive; PN: Predicted negative, TP: True positive, TN: +True negative + +The ROC curves from the 6-fold validation are shown in Figure 7. + +![](media/image7.png){width="3.825in" +height="2.7327045056867894in"} + +Figure 7: Six-fold cross-validation of TensorFlow DL model show an +average area under the ROC-curve (ROC-AUC; measure of accuracy) of 68%. + +In summary, the validation results of the four methods are presented in +the following table. + +Table 5 Results of the cross-validation of the four models and after +y-randomisation + + ---------------------------------------------------------------------- + Accuracy CCR Sensitivity Specificity + ----------------------- ---------- ------- ------------- ------------- + RF model 64.1% 64.4% 66.2% 62.6% + + SVM model 62.1% 62.6% 65.0% 60.3% + + DL model\ 59.3% 59.5% 89.2% 29.9% + (R-project) + + DL model (TensorFlow) 68% 62.2% 69.9% 45.6% + + y-randomisation 50.5% 50.4% 50.3% 50.6% + ---------------------------------------------------------------------- + +CCR (correct classification rate) + +Discussion +========== + +General model performance + +Based on the results of the cross-validation for all models, `lazar`, RF, +SVM, DL (R-project) and DL (TensorFlow) it can be state that the +prediction results are not optimal due to different reasons. The +accuracy as measured during cross-validation of the four models (RF, +SVM, DL (R-project and TensorFlow)) was partly low with CCR values +between 59.3 and 68%, with the R-generated DL model and the +TensorFlow-generated DL model showing the worst and the best +performance, respectively. The validation of the R-generated DL model +revealed a high sensitivity (89.2%) but an unacceptably low specificity +of 29.9% indicating a high number of false positive estimates. The +TensorFlow-generated DL model, however, showed an acceptable but not +optimal accuracy of 68%, a sensitivity of 69.9% and a specificity of +45.6%. The low specificity indicates that both DL models tends to +predict too many instances as positive (genotoxic), and therefore have a +high false positive rate. This allows at least with the TensorFlow +generated DL model to make group statements, but the confidence for +estimations of single PAs appears to be insufficiently low. + +Several factors have likely contributed to the low to moderate +performance of the used methods as shown during the cross-validation: + +1. The outcome in the training dataset was based on the results of AMES + tests for genotoxicity [ICH 2011](#_ENREF_63)(), an *in vitro* test + in different strains of the bacteria *Salmonella typhimurium*. In + this test, mutagenicity is evaluated with and without prior + metabolic activation of the test substance. Metabolic activation + could result in the formation of genotoxic metabolites from + non-genotoxic parent compounds. However, no distinction was made in + the training dataset between substances that needed metabolic + activation before being mutagenic and those that were mutagenic + without metabolic activation. `lazar` is able to handle this + 'inaccuracy' in the training dataset well due to the way the + algorithm works: `lazar` predicts the genotoxic potential based on the + neighbours of substances with comparable structural features, + considering mutagenic and not mutagenic neighbours. Based on the + structural similarity, a probability for mutagenicity and no + mutagenicity is calculated independently from each other (meaning + that the sum of probabilities does not necessarily adds up to 100%). + The class with the higher outcome is then the overall outcome for + the substance. + +> In contrast, the other models need to be trained first to recognise +> the structural features that are responsible for genotoxicity. +> Therefore, the mixture of substances being mutagenic with and without +> metabolic activation in the training dataset may have adversely +> affected the ability to separate the dataset in two distinct classes +> and thus explains the relatively low performance of these models. + +2. Machine learning algorithms try to find an optimized solution in a + high-dimensional (one dimension per each predictor) space. Sometimes + these methods do not find the global optimum of estimates but only + local (not optimal) solutions. Strategies to find the global + solutions are systematic variation (grid search) of the + hyperparameters of the methods, which may be very time consuming in + particular in large datasets. + + +Conclusions +=========== + +In this study, an attempt was made to predict the genotoxic potential of +PAs using five different machine learning techniques (`lazar`, RF, SVM, DL +(R-project and TensorFlow). The results of all models fitted only partly +to the findings in literature, with best results obtained with the +TensorFlow DL model. Therefore, modelling allows statements on the +relative risks of genotoxicity of the different PA groups. Individual +predictions for selective PAs appear, however, not reliable on the +current basis of the used training dataset. + +This study emphasises the importance of critical assessment of +predictions by QSAR models. This includes not only extensive literature +research to assess the plausibility of the predictions, but also a good +knowledge of the metabolism of the test substances and understanding for +possible mechanisms of toxicity. + +In further studies, additional machine learning techniques or a modified +(extended) training dataset should be used for an additional attempt to +predict the genotoxic potential of PAs. + +References +========== |