more script consolidations

author: Christoph Helma <helma@in-silico.ch> 2021-02-22 23:26:29 +0100
committer: Christoph Helma <helma@in-silico.ch> 2021-02-22 23:26:29 +0100
commit: ed83d4c5347ebf43b2de55782b290b66bada4561 (patch)
tree: ddf3ee1eb6d4f5d250835345798086b5204a23ee /mutagenicity.md
parent: 3af0c3d5c5b7f7d506a4582bbe3dca7d22bbefcc (diff)
1 files changed, 155 insertions, 52 deletions
diff --git a/mutagenicity.md b/mutagenicity.md
index fc58a3d..7ab4b52 100644
--- a/mutagenicity.md
+++ b/mutagenicity.md
@@ -169,17 +169,15 @@ under a GPL3 License. The new combined dataset can be found at
 
 ### Pyrrolizidine alkaloid (PA) dataset
 
-The testing dataset consisted of {{pa.n}} different PAs.
-
-The PA dataset was created from five independent, necine base substructure
-searches in PubChem (https://pubchem.ncbi.nlm.nih.gov/) and compared to the PAs
-listed in the EFSA publication @EFSA2011 and the book by Mattocks
-@Mattocks1986, to ensure, that all major PAs were included. PAs mentioned in
-these publications which were not found in the downloaded substances were
-searched individually in PubChem and, if available, downloaded separately.
-Non-PA substances, duplicates, and isomers were removed from the files, but
-artificial PAs, even if unlikely to occur in nature, were kept. The resulting
-PA dataset comprised a total of {{pa.n}} different PAs.
+The pyrrolizidine alkaloid dataset was created from five independent, necine
+base substructure searches in PubChem (https://pubchem.ncbi.nlm.nih.gov/) and
+compared to the PAs listed in the EFSA publication @EFSA2011 and the book by
+Mattocks @Mattocks1986, to ensure, that all major PAs were included. PAs
+mentioned in these publications which were not found in the downloaded
+substances were searched individually in PubChem and, if available, downloaded
+separately.  Non-PA substances, duplicates, and isomers were removed from the
+files, but artificial PAs, even if unlikely to occur in nature, were kept. The
+resulting PA dataset comprised a total of {{pa.n}} different PAs.
 
 The PAs in the dataset were classified according to structural features. A
 total of 9 different structural features were assigned to the necine base,
@@ -520,24 +518,24 @@ shows results with MolPrint2D descriptors and @tbl:cv-cdk with CDK descriptors.
 
 |  | lazar-HC | lazar-all | RF | LR-sgd | LR-scikit | NN | SVM |
 |:-|----------|-----------|----|--------|-----------|----|-----|
-Accuracy | {{cv.lazar-mp2d-high-confidence.acc_perc}} | {{cv.lazar-mp2d-all.acc_perc}} | {{cv.tensorflow-rf-mp2d.acc_perc}} | {{cv.tensorflow-lr-mp2d.acc_perc}} | {{cv.tensorflow-lr2-mp2d.acc_perc}} | {{cv.tensorflow-nn-mp2d.acc_perc}} | {{cv.tensorflow-svm-mp2d.acc_perc}} |
-True positive rate | {{cv.lazar-mp2d-high-confidence.tpr_perc}} | {{cv.lazar-mp2d-all.tpr_perc}} | {{cv.tensorflow-rf-mp2d.tpr_perc}} | {{cv.tensorflow-lr-mp2d.tpr_perc}} | {{cv.tensorflow-lr2-mp2d.tpr_perc}} | {{cv.tensorflow-nn-mp2d.tpr_perc}} | {{cv.tensorflow-svm-mp2d.tpr_perc}} |
-True negative rate | {{cv.lazar-mp2d-high-confidence.tnr_perc}} | {{cv.lazar-mp2d-all.tnr_perc}} | {{cv.tensorflow-rf-mp2d.tnr_perc}} | {{cv.tensorflow-lr-mp2d.tnr_perc}} | {{cv.tensorflow-lr2-mp2d.tnr_perc}} | {{cv.tensorflow-nn-mp2d.tnr_perc}} | {{cv.tensorflow-svm-mp2d.tnr_perc}} |
-Positive predictive value | {{cv.lazar-mp2d-high-confidence.ppv_perc}} | {{cv.lazar-mp2d-all.ppv_perc}} | {{cv.tensorflow-rf-mp2d.ppv_perc}} | {{cv.tensorflow-lr-mp2d.ppv_perc}} | {{cv.tensorflow-lr2-mp2d.ppv_perc}} | {{cv.tensorflow-nn-mp2d.ppv_perc}} | {{cv.tensorflow-svm-mp2d.ppv_perc}} |
-Negative predictive value | {{cv.lazar-mp2d-high-confidence.npv_perc}} | {{cv.lazar-mp2d-all.npv_perc}} | {{cv.tensorflow-rf-mp2d.npv_perc}} | {{cv.tensorflow-lr-mp2d.npv_perc}} | {{cv.tensorflow-lr2-mp2d.npv_perc}} | {{cv.tensorflow-nn-mp2d.npv_perc}} | {{cv.tensorflow-svm-mp2d.npv_perc}} |
-Nr. predictions | {{cv.lazar-mp2d-high-confidence.n}} | {{cv.lazar-mp2d-all.n}} | {{cv.tensorflow-rf-mp2d.n}} | {{cv.tensorflow-lr-mp2d.n}} | {{cv.tensorflow-lr2-mp2d.n}} | {{cv.tensorflow-nn-mp2d.n}} | {{cv.tensorflow-svm-mp2d.n}} |
+Accuracy | {{cv.mp2d_lazar_high_confidence.acc_perc}} | {{cv.mp2d_lazar_all.acc_perc}} | {{cv.mp2d_rf.acc_perc}} | {{cv.mp2d_lr.acc_perc}} | {{cv.mp2d_lr2.acc_perc}} | {{cv.mp2d_nn.acc_perc}} | {{cv.mp2d_svm.acc_perc}} |
+True positive rate | {{cv.mp2d_lazar_high_confidence.tpr_perc}} | {{cv.mp2d_lazar_all.tpr_perc}} | {{cv.mp2d_rf.tpr_perc}} | {{cv.mp2d_lr.tpr_perc}} | {{cv.mp2d_lr2.tpr_perc}} | {{cv.mp2d_nn.tpr_perc}} | {{cv.mp2d_svm.tpr_perc}} |
+True negative rate | {{cv.mp2d_lazar_high_confidence.tnr_perc}} | {{cv.mp2d_lazar_all.tnr_perc}} | {{cv.mp2d_rf.tnr_perc}} | {{cv.mp2d_lr.tnr_perc}} | {{cv.mp2d_lr2.tnr_perc}} | {{cv.mp2d_nn.tnr_perc}} | {{cv.mp2d_svm.tnr_perc}} |
+Positive predictive value | {{cv.mp2d_lazar_high_confidence.ppv_perc}} | {{cv.mp2d_lazar_all.ppv_perc}} | {{cv.mp2d_rf.ppv_perc}} | {{cv.mp2d_lr.ppv_perc}} | {{cv.mp2d_lr2.ppv_perc}} | {{cv.mp2d_nn.ppv_perc}} | {{cv.mp2d_svm.ppv_perc}} |
+Negative predictive value | {{cv.mp2d_lazar_high_confidence.npv_perc}} | {{cv.mp2d_lazar_all.npv_perc}} | {{cv.mp2d_rf.npv_perc}} | {{cv.mp2d_lr.npv_perc}} | {{cv.mp2d_lr2.npv_perc}} | {{cv.mp2d_nn.npv_perc}} | {{cv.mp2d_svm.npv_perc}} |
+Nr. predictions | {{cv.mp2d_lazar_high_confidence.n}} | {{cv.mp2d_lazar_all.n}} | {{cv.mp2d_rf.n}} | {{cv.mp2d_lr.n}} | {{cv.mp2d_lr2.n}} | {{cv.mp2d_nn.n}} | {{cv.mp2d_svm.n}} |
 
 : Summary of crossvalidation results with MolPrint2D descriptors (lazar-HC: lazar with high confidence, lazar-all: all lazar predictions, RF: random forests, LR-sgd: logistic regression (stochastic gradient descent), LR-scikit: logistic regression (scikit), NN: neural networks, SVM: support vector machines) {#tbl:cv-mp2d}
 
 
 |  | lazar-HC | lazar-all | RF | LR-sgd | LR-scikit | NN | SVM |
 |:-|----------|-----------|----|--------|-----------|----|-----|
-Accuracy | {{cv.lazar-cdk-high-confidence.acc_perc}} | {{cv.lazar-cdk-all.acc_perc}} | {{cv.tensorflow-rf-cdk.acc_perc}} | {{cv.tensorflow-lr-cdk.acc_perc}} | {{cv.tensorflow-lr2-cdk.acc_perc}} | {{cv.tensorflow-nn-cdk.acc_perc}} | {{cv.tensorflow-svm-cdk.acc_perc}} |
-True positive rate | {{cv.lazar-cdk-high-confidence.tpr_perc}} | {{cv.lazar-cdk-all.tpr_perc}} | {{cv.tensorflow-rf-cdk.tpr_perc}} | {{cv.tensorflow-lr-cdk.tpr_perc}} | {{cv.tensorflow-lr2-cdk.tpr_perc}} | {{cv.tensorflow-nn-cdk.tpr_perc}} | {{cv.tensorflow-svm-cdk.tpr_perc}} |
-True negative rate | {{cv.lazar-cdk-high-confidence.tnr_perc}} | {{cv.lazar-cdk-all.tnr_perc}} | {{cv.tensorflow-rf-cdk.tnr_perc}} | {{cv.tensorflow-lr-cdk.tnr_perc}} | {{cv.tensorflow-lr2-cdk.tnr_perc}} | {{cv.tensorflow-nn-cdk.tnr_perc}} | {{cv.tensorflow-svm-cdk.tnr_perc}} |
-Positive predictive value | {{cv.lazar-cdk-high-confidence.ppv_perc}} | {{cv.lazar-cdk-all.ppv_perc}} | {{cv.tensorflow-rf-cdk.ppv_perc}} | {{cv.tensorflow-lr-cdk.ppv_perc}} | {{cv.tensorflow-lr2-cdk.ppv_perc}} | {{cv.tensorflow-nn-cdk.ppv_perc}} | {{cv.tensorflow-svm-cdk.ppv_perc}} |
-Negative predictive value | {{cv.lazar-cdk-high-confidence.npv_perc}} | {{cv.lazar-cdk-all.npv_perc}} | {{cv.tensorflow-rf-cdk.npv_perc}} | {{cv.tensorflow-lr-cdk.npv_perc}} | {{cv.tensorflow-lr2-cdk.npv_perc}} | {{cv.tensorflow-nn-cdk.npv_perc}} | {{cv.tensorflow-svm-cdk.npv_perc}} |
-Nr. predictions | {{cv.lazar-cdk-high-confidence.n}} | {{cv.lazar-cdk-all.n}} | {{cv.tensorflow-rf-cdk.n}} | {{cv.tensorflow-lr-cdk.n}} | {{cv.tensorflow-lr2-cdk.n}} | {{cv.tensorflow-nn-cdk.n}} | {{cv.tensorflow-svm-cdk.n}} |
+Accuracy | {{cv.cdk_lazar_high_confidence.acc_perc}} | {{cv.cdk_lazar_all.acc_perc}} | {{cv.cdk_rf.acc_perc}} | {{cv.cdk_lr.acc_perc}} | {{cv.cdk_lr2.acc_perc}} | {{cv.cdk_nn.acc_perc}} | {{cv.cdk_svm.acc_perc}} |
+True positive rate | {{cv.cdk_lazar_high_confidence.tpr_perc}} | {{cv.cdk_lazar_all.tpr_perc}} | {{cv.cdk_rf.tpr_perc}} | {{cv.cdk_lr.tpr_perc}} | {{cv.cdk_lr2.tpr_perc}} | {{cv.cdk_nn.tpr_perc}} | {{cv.cdk_svm.tpr_perc}} |
+True negative rate | {{cv.cdk_lazar_high_confidence.tnr_perc}} | {{cv.cdk_lazar_all.tnr_perc}} | {{cv.cdk_rf.tnr_perc}} | {{cv.cdk_lr.tnr_perc}} | {{cv.cdk_lr2.tnr_perc}} | {{cv.cdk_nn.tnr_perc}} | {{cv.cdk_svm.tnr_perc}} |
+Positive predictive value | {{cv.cdk_lazar_high_confidence.ppv_perc}} | {{cv.cdk_lazar_all.ppv_perc}} | {{cv.cdk_rf.ppv_perc}} | {{cv.cdk_lr.ppv_perc}} | {{cv.cdk_lr2.ppv_perc}} | {{cv.cdk_nn.ppv_perc}} | {{cv.cdk_svm.ppv_perc}} |
+Negative predictive value | {{cv.cdk_lazar_high_confidence.npv_perc}} | {{cv.cdk_lazar_all.npv_perc}} | {{cv.cdk_rf.npv_perc}} | {{cv.cdk_lr.npv_perc}} | {{cv.cdk_lr2.npv_perc}} | {{cv.cdk_nn.npv_perc}} | {{cv.cdk_svm.npv_perc}} |
+Nr. predictions | {{cv.cdk_lazar_high_confidence.n}} | {{cv.cdk_lazar_all.n}} | {{cv.cdk_rf.n}} | {{cv.cdk_lr.n}} | {{cv.cdk_lr2.n}} | {{cv.cdk_nn.n}} | {{cv.cdk_svm.n}} |
 
 : Summary of crossvalidation results with CDK descriptors (lazar-HC: lazar with high confidence, lazar-all: all lazar predictions, RF: random forests, LR-sgd: logistic regression (stochastic gradient descent), LR-scikit: logistic regression (scikit), NN: neural networks, SVM: support vector machines) {#tbl:cv-cdk}
 
@@ -553,7 +551,7 @@ https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/predictions/.
 With exception of lazar/CDK all investigated algorithm/descriptor combinations
 give accuracies between (80 and 85%) which is equivalent to the experimental
 variability of the *Salmonella typhimurium* mutagenicity bioassay (80-85%,
-@Benigni1988). Sensitivities and specificities are well balanced in all of
+@Benigni1988). Sensitivities and specificities are balanced in all of
 these models.
 
 <!--
@@ -576,6 +574,32 @@ pyrrolizidine alkaloids (PAs) can be downloaded from
 A visual representation of all PA predictions can be found at
 <https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/pa-predictions.pdf>.
 
+@tbl:pa-mp2d and @tbl:pa-cdk summarise the outcome of pyrrolizidine alkaloid predictions from all models with MolPrint2D and CDK descriptors.
+
+| Model  | mutagenic | non-mutagenic | Nr. predictions |
+|-------:|-----------|---------------|-----------------|
+| lazar-all | {{pa.mp2d_lazar_all.mut_perc}}% ({{pa.mp2d_lazar_all.mut}}) | {{pa.mp2d_lazar_all.non_mut_perc}}% ({{pa.mp2d_lazar_all.non_mut}}) | {{pa.mp2d_lazar_all.n_perc}}% ({{pa.mp2d_lazar_all.n}}) |
+| lazar-HC | {{pa.mp2d_lazar_high_confidence.mut_perc}}% ({{pa.mp2d_lazar_high_confidence.mut}}) | {{pa.mp2d_lazar_high_confidence.non_mut_perc}}% ({{pa.mp2d_lazar_high_confidence.non_mut}}) | {{pa.mp2d_lazar_high_confidence.n_perc}}% ({{pa.mp2d_lazar_high_confidence.n}}) |
+| RF | {{pa.mp2d_rf.mut_perc}}% ({{pa.mp2d_rf.mut}}) | {{pa.mp2d_rf.non_mut_perc}}% ({{pa.mp2d_rf.non_mut}}) | {{pa.mp2d_rf.n_perc}}% ({{pa.mp2d_rf.n}}) |
+| LR-sgd | {{pa.mp2d_lr.mut_perc}}% ({{pa.mp2d_lr.mut}}) | {{pa.mp2d_lr.non_mut_perc}}% ({{pa.mp2d_lr.non_mut}}) | {{pa.mp2d_lr.n_perc}}% ({{pa.mp2d_lr.n}}) |
+| LR-scikit | {{pa.mp2d_lr2.mut_perc}}% ({{pa.mp2d_lr2.mut}}) | {{pa.mp2d_lr2.non_mut_perc}}% ({{pa.mp2d_lr2.non_mut}}) | {{pa.mp2d_lr2.n_perc}}% ({{pa.mp2d_lr2.n}}) |
+| NN | {{pa.mp2d_nn.mut_perc}}% ({{pa.mp2d_nn.mut}}) | {{pa.mp2d_nn.non_mut_perc}}% ({{pa.mp2d_nn.non_mut}}) | {{pa.mp2d_nn.n_perc}}% ({{pa.mp2d_nn.n}}) |
+| SVM | {{pa.mp2d_svm.mut_perc}}% ({{pa.mp2d_svm.mut}}) | {{pa.mp2d_svm.non_mut_perc}}% ({{pa.mp2d_svm.non_mut}}) | {{pa.mp2d_svm.n_perc}}% ({{pa.mp2d_svm.n}}) |
+
+: Summary of MolPrint2D pyrrolizidine alkaloid predictions {#tbl:pa-mp2d}
+
+| Model  | mutagenic | non-mutagenic | Nr. predictions |
+|-------:|-----------|---------------|-----------------|
+| lazar-all | {{pa.cdk_lazar_all.mut_perc}}% ({{pa.cdk_lazar_all.mut}}) | {{pa.cdk_lazar_all.non_mut_perc}}% ({{pa.cdk_lazar_all.non_mut}}) | {{pa.cdk_lazar_all.n_perc}}% ({{pa.cdk_lazar_all.n}}) |
+| lazar-HC | {{pa.cdk_lazar_high_confidence.mut_perc}}% ({{pa.cdk_lazar_high_confidence.mut}}) | {{pa.cdk_lazar_high_confidence.non_mut_perc}}% ({{pa.cdk_lazar_high_confidence.non_mut}}) | {{pa.cdk_lazar_high_confidence.n_perc}}% ({{pa.cdk_lazar_high_confidence.n}}) |
+| RF | {{pa.cdk_rf.mut_perc}}% ({{pa.cdk_rf.mut}}) | {{pa.cdk_rf.non_mut_perc}}% ({{pa.cdk_rf.non_mut}}) | {{pa.cdk_rf.n_perc}}% ({{pa.cdk_rf.n}}) |
+| LR-sgd | {{pa.cdk_lr.mut_perc}}% ({{pa.cdk_lr.mut}}) | {{pa.cdk_lr.non_mut_perc}}% ({{pa.cdk_lr.non_mut}}) | {{pa.cdk_lr.n_perc}}% ({{pa.cdk_lr.n}}) |
+| LR-scikit | {{pa.cdk_lr2.mut_perc}}% ({{pa.cdk_lr2.mut}}) | {{pa.cdk_lr2.non_mut_perc}}% ({{pa.cdk_lr2.non_mut}}) | {{pa.cdk_lr2.n_perc}}% ({{pa.cdk_lr2.n}}) |
+| NN | {{pa.cdk_nn.mut_perc}}% ({{pa.cdk_nn.mut}}) | {{pa.cdk_nn.non_mut_perc}}% ({{pa.cdk_nn.non_mut}}) | {{pa.cdk_nn.n_perc}}% ({{pa.cdk_nn.n}}) |
+| SVM | {{pa.cdk_svm.mut_perc}}% ({{pa.cdk_svm.mut}}) | {{pa.cdk_svm.non_mut_perc}}% ({{pa.cdk_svm.non_mut}}) | {{pa.cdk_svm.n_perc}}% ({{pa.cdk_svm.n}}) |
+
+: Summary of CDK pyrrolizidine alkaloid predictions {#tbl:pa-cdk}
+
 @fig:dhp - @fig:tert display the proportion of positive mutagenicity predictions from all models for the different pyrrolizidine alkaloid groups.
 
 ![Summary of Dehydropyrrolizidine predictions](figures/Dehydropyrrolizidine.png){#fig:dhp}
@@ -626,9 +650,49 @@ public mutagenicity dataset presently available. The new training data can be
 downloaded from
 <https://git.in-silico.ch/mutagenicity-paper/tree/mutagenicity/mutagenicity.csv>.
 
-Model performance
------------------
+Algorithms
+----------
+
+`lazar` is formally a *k-nearest-neighbor* algorithm that searches for similar
+structures for a given compound and calculates the prediction based on the
+experimental data for these structures. The QSAR literature calls such models
+frequently *local models*, because models are generated specifically for each
+query compound. The investigated tensorflow models are in contrast *global models*, i.e. a
+single model is used to make predictions for all compounds. It has been
+postulated in the past, that local models are more accurate, because they can
+account better for mechanisms, that affect only a subset of the training data.
+
+@tbl:cv-mp2d, @tbl:cv-cdk and @fig:roc show that all models with the exception
+of lazar-CDK have similar crossvalidation accuracies that are comparable to the
+experimental variability of the *Salmonella typhimurium* mutagenicity bioassay
+(80-85% according to @Benigni1988). All of these models have balanced
+sensitivity (true position rate) and specificity (true negative rate) and
+provide highly significant concordance with experimental data (as determined by
+McNemar's Test). This is a clear indication that *in-silico* predictions can be
+as reliable as the bioassays. Given that the variability of experimental data
+is similar to model variability it is impossible to decide which model gives
+the most accurate predictions, as models with higher accuracies (e.g. NN-CDK)
+might just approximate experimental errors better than more robust models.
+
+`lazar` predictions with CDK descriptors are a notable exception, as it has a
+much lower overall accuracy ({{lazar_all_cdk.acc}}) than all other models.
+`lazar` uses basically a k-nearest-neighbor (with variable k) and it seems that
+CDK descriptors are not very well suited for chemical similarity calculations.
+We have confirmed this independently by validating k-nn models from the `R
+caret` package, which give also sub-par accuracies (data not shown).
+
+@fig:tsne-cdk is another indication that similarity calculations with CDK
+descriptors are not as useful as fingerprint based similarities, because it
+shows a less clearer separation between chemical classes and
+mutagens/non-mutagens than @fig:tsne-mp2d.  It seems that more complex models
+than simple k-nn are required to utilize CDK descriptors efficiently.
+
+Our results do not support the assumption that local models are superior to
+global models for classification purposes. For regression models (lowest
+observed effect level) we have found however that local models may outperform
+global models (@Helma2018) with accuracies similar to experimental variability.
 
+<!--
 @tbl:lazar, @tbl:R, @tbl:tensorflow and @fig:roc show that the standard `lazar` algorithm (with MP2D
 fingerprints) give the most accurate crossvalidation results. R Random Forests,
 Support Vector Machines and Tensorflow models have similar accuracies with
@@ -639,11 +703,7 @@ models have low specificity.
 The accuracy of `lazar` *in-silico* predictions are comparable to the
 interlaboratory variability of the Ames test (80-85% according to
 @Benigni1988), especially for predictions with high confidence
-({{cv.lazar-high-confidence.acc_perc}}%). This is a clear indication that
-*in-silico* predictions can be as reliable as the bioassays, if the compounds
-are close to the applicability domain. This conclusion is also supported by our
-analysis of `lazar` lowest observed effect level predictions, which are also
-similar to the experimental variability (@Helma2018).
+({{cv.lazar-high-confidence.acc_perc}}%).
 
 The lowest number of predictions ({{cv.lazar-padel-high-confidence.n}}) has been
 obtained from `lazar`-CDK high confidence predictions, the largest number of
@@ -653,6 +713,7 @@ and Tensorflow models. This is not necessarily a disadvantage, because `lazar`
 abstains from predictions, if the query compound is very dissimilar from the
 compounds in the training set and thus avoids to make predictions for compounds
 out of the applicability domain. 
+-->
 
 Descriptors
 -----------
@@ -665,8 +726,9 @@ connected atom types for all atoms in a molecule) as molecular representation,
 which resembles basically the chemical concept of functional groups. MP2D
 descriptors are used to determine chemical similarities in the default `lazar`
 settings, and previous experiments have shown, that they give more accurate
-results than predefined fragments (e.g.  MACCS, FP2-4).
+results than predefined fingerprints (e.g.  MACCS, FP2-4).
 
+<!--
 In order to investigate, if MP2D fingerprints are also suitable for global
 models we have tried to build R and Tensorflow models, both with and without
 unsupervised feature selection. Unfortunately none of the algorithms was
@@ -678,8 +740,40 @@ unsuitable for standard global machine learning algorithms.
 utilizes internally a much more efficient occurrence based representation and
 (b) it uses fingerprints only for similarity calculations and not as model
 parameters.
+-->
+
+*Chemistry Development Kit* (CDK, @Willighagen2017) descriptors 
+were calculated with the PaDEL graphical interface (@Yap2011). They include 
+1D and 2D topological descriptors as well as physical-chemical properties.
+
+With exception of `lazar` all investigated algorithms obtained models within
+the experimental variability for both types of descriptors. As discussed before
+CDK descriptors seem to be less suitable for chemical similarity calculations
+than MolPrint2D descriptors.
+
+Given that similar predictive accuracies are obtainable from both types of
+descriptors the choice depends more on practical considerations:
+
+MolPrint2D fragments can be calculated very efficiently for every well defined
+chemical structure with OpenBabel (@OBoyle2011a). CDK descriptor calculations
+are in contrast much more resource intensive and may fail for a significant
+number of compounds ({{cv.cdk.n_failed}} from {{cv.n_uniq}}). 
+
+MolPrint2D fragments are generated dynamically from chemical structures and can
+be used to determine if a compound contains structural features that are absent
+in training data. This feature can be used to determine applicability domains.
+CDK descriptors contain in contrast a predefined set of descriptors with
+unknown toxicological relevance.
+
+MolPrint2D fingerprints can be represented very efficiently as sets of features
+that are present in a given compound which makes similarity calculations very
+efficient. Due to the large number of substructures present in training
+compounds, they lead however to large and sparsely populated datasets, if they
+have to be expanded to a binary matrix (e.g. as input for tensorflow models).
+CDK descriptors contain in contrast in every case matrices with
+{{cv.cdk.n_descriptors}} columns.
 
-CDK calculates topological and physical-chemical descriptors.
+<!--
 
 **TODO**: **Verena** kannst Du bitte die Deskriptoren nochmals kurz beschreiben
 
@@ -694,18 +788,6 @@ suited for chemical similarity calculations than MP2D descriptors. It is also
 likely that CDK descriptors lead to less accurate predictions for global
 models, but we cannot draw any definitive conclusion in the absence of MP2D
 models.
-
-Algorithms
-----------
-
-`lazar` is formally a *k-nearest-neighbor* algorithm that searches for similar
-structures for a given compound and calculates the prediction based on the
-experimental data for these structures. The QSAR literature calls such models
-frequently *local models*, because models are generated specifically for each
-query compound. R and Tensorflow models are in contrast *global models*, i.e. a
-single model is used to make predictions for all compounds. It has been
-postulated in the past, that local models are more accurate, because they can
-account better for mechanisms, that affect only a subset of the training data.
 Our results seem to support this assumption, because standard `lazar` models
 with MolPrint2D descriptors perform better than global models. The accuracy of
 `lazar` models with CDK descriptors is however substantially lower and
@@ -720,10 +802,30 @@ Neighbour (and thus similarity) based algorithms like `lazar` have a clear
 advantage in this respect over global machine learning algorithms (e.g. RF, SVM,
 LR, NN), because Tanimoto/Jaccard similarities can be calculated efficiently
 with simple set operations. 
+-->
 
 Pyrrolizidine alkaloid mutagenicity predictions
 -----------------------------------------------
 
+@fig:dhp - @fig:tert show a clear differentiation between the different
+pyrrolizidine alkaloid groups. The largest proportion of mutagenic predictions
+was observed for Otonecines {{pa.groups.Otonecine.mut_perc}}%
+({{pa.groups.Otonecine.mut}}/{{pa.groups.Otonecine.n_pred}}), the lowest for
+Monoesters {{pa.groups.Monoester.mut_perc}}%
+({{pa.groups.Monoester.mut}}/{{pa.groups.Monoester.n_pred}}) and N-Oxides
+{{pa.groups.N_oxide.mut_perc}}%
+({{pa.groups.N_oxide.mut}}/{{pa.groups.N_oxide.n_pred}}).
+
+Although most of the models show similar accuracies, sensitivities and
+specificities in crossvalidation experiments some of the models (MPD-RF, CDK-RF
+and CDK-SVM) predict a lower number of mutagens
+({{pa.cdk_rf.mut_perc}}-{{pa.mp2d_rf.mut_perc}}%) than the majority of the
+models ({{pa.mp2d_svm.mut_perc}}-{{pa.mp2d_lazar_high_confidence.mut_perc}}%
+@tbl:pa-mp2d, @tbl:pa-cdk, @fig:dhp - @fig:tert).
+
+From a practical point we still have to face the question, how to choose model predictions, if no experimental data is available (we found two PAs in the training data, but this number is too low, to draw any general conclusions). 
+
+<!--
 `lazar` models with MolPrint2D descriptors predicted {{pa.lazar.mp2d.all.n_perc}}%
 of the pyrrolizidine alkaloids (PAs) ({{pa.lazar.mp2d.high_confidence.n_perc}}%
 with high confidence), the remaining compounds are not within its applicability
@@ -737,7 +839,6 @@ PAs, with exception of the R deep learning (DL) and the Tensorflow Scikit
 logistic regression models ({{pa.tf.dl.mut_perc}} and
 {{pa.tf.lr_scikit.mut_perc}}% positive predictions). 
 
-<!--
 non-conflicting CIDs
 43040
 186980
@@ -781,7 +882,6 @@ non-conflicting CIDs
 91749894
 101324794
 118701599
--->
 
 R RF and SVM models favor very strongly non-mutagenic predictions (only {{pa.r.rf.mut_perc}} and {{pa.r.svm.mut_perc}} % mutagenic PAs), while Tensorflow models classify approximately half of the PAs as mutagenic (RF {{pa.tf.rf.mut_perc}}%, LR-sgd {{pa.tf.lr_sgd}}%, LR-scikit:{{pa.tf.lr_scikit.mut_perc}}, LR-NN:{{pa.tf.nn.mut_perc}}%). `lazar` models predict predominately non-mutagenicity, but to a lesser extend than R models (MP2D:{{pa.lazar.mp2d.all.mut_perc}}, CDK:{{pa.lazar.padel.all.mut_perc}}).
 
@@ -800,13 +900,15 @@ From a practical point we still have to face the question, how to choose model p
 
 **TODO**: **Verena**  Wenn Du lazar Ergebnisse konkret diskutieren willst, kann ich Dir ausfuehrliche Vorhersagen (mit aehnlichen Verbindungen und deren Aktivitaet) fuer einzelne Beispiele zusammenstellen 
 
-<!---
 Due to the low to moderate predictivity of all models, quantitative
 statement on the genotoxicity of single PAs cannot be made with
 sufficient confidence.
 
 The predictions of the SVM model did not fit with the other models or
 literature, and are therefore not further considered in the discussion.
+-->
+
+**TODO**: **Verena** Hier ist ein alter Text von Dir zum Recylen: 
 
 Necic acid
 
@@ -909,14 +1011,16 @@ issues:
     metabolic activation of PAs by microsomal enzymes was the
     sensitivity-limiting step. This could very well mean that this is
     also reflected in the QSAR models.
---->
 
 Conclusions
 ===========
 
 A new public *Salmonella* mutagenicity training dataset with 8309 compounds was
-created and used it to train `lazar`, R and Tensorflow models with MolPrint2D
-and CDK descriptors. The best performance was obtained with `lazar` models
+created and used it to train `lazar` and Tensorflow models with MolPrint2D
+and CDK descriptors.
+
+<!---
+The best performance was obtained with `lazar` models
 using MolPrint2D descriptors, with prediction accuracies
 ({{cv.lazar-high-confidence.acc_perc}}%) comparable to the interlaboratory variability
 of the Ames test (80-85%). Models based on CDK descriptors had lower
@@ -925,7 +1029,6 @@ MolPrint2D descriptors.
 
 **TODO**: PA Vorhersagen
 
-<!---
 In this study, an attempt was made to predict the genotoxic potential of
 PAs using five different machine learning techniques (LAZAR, RF, SVM, DL
 (R-project and Tensorflow). The results of all models fitted only partly
author	Christoph Helma <helma@in-silico.ch>	2021-02-22 23:26:29 +0100
committer	Christoph Helma <helma@in-silico.ch>	2021-02-22 23:26:29 +0100
commit	ed83d4c5347ebf43b2de55782b290b66bada4561 (patch)
tree	ddf3ee1eb6d4f5d250835345798086b5204a23ee /mutagenicity.md
parent	3af0c3d5c5b7f7d506a4582bbe3dca7d22bbefcc (diff)