PA discussion

author: Christoph Helma <helma@in-silico.ch> 2020-10-21 00:13:44 +0200
committer: Christoph Helma <helma@in-silico.ch> 2020-10-21 00:13:44 +0200
commit: 90e674779943891cc7bfdcebcd6ca9e0017cc01d (patch)
tree: 14fcca04028bf055db1ce055ad5ee91452cdc417
parent: 94f0aa4ecdb3e137590420c0bbd38d15108acec4 (diff)
4 files changed, 355 insertions, 3 deletions
diff --git a/Makefile b/Makefile
index 271f680..75e6e6d 100644
--- a/Makefile
+++ b/Makefile
@@ -24,6 +24,7 @@ CONFUSION_MATRICES = $(CONFUSION_MATRICES_DIR)/lazar-all.csv $(CONFUSION_MATRICE
 
 CV_SUMMARY = 10-fold-crossvalidations/summary.yaml
 PA_SUMMARY = pyrrolizidine-alkaloids/summary.yaml
+SUMMARY = summary.yaml
 
 # PA predictions
 
@@ -39,10 +40,10 @@ PA_PREDICTIONS = $(PA_LAZAR_DIR)/pa-mp2d-predictions.csv $(PA_LAZAR_DIR)/pa-pade
 TABLES = tables/lazar-summary.csv tables/r-summary.csv tables/tensorflow-summary.csv tables/pa-tab.tex tables/pa-summary.csv
 FIGURES = figures/roc.png figures/tsne-mp2d.png figures/tsne-padel.png
 
-all: $(TABLES) $(FIGURES) $(CV_SUMMARY) mutagenicity.pdf 
+all: $(TABLES) $(FIGURES) $(SUMMARY) mutagenicity.pdf 
 include $(PANDOC_SCHOLAR_PATH)/Makefile
 
-mutagenicity.mustache.md: $(CV_SUMMARY) mutagenicity.md $(TABLES) $(FIGURES)
+mutagenicity.mustache.md: $(SUMMARY) mutagenicity.md $(TABLES) $(FIGURES)
 	mustache $^ > $@
 
 # figures
@@ -85,6 +86,9 @@ tables/r-summary.csv: $(CV_SUMMARY)
 tables/tensorflow-summary.csv: $(CV_SUMMARY)
 	scripts/summary2table.rb tensorflow > $@
 
+$(SUMMARY): $(PA_SUMMARY) $(CV_SUMMARY)
+	scripts/summary.rb $^ > $@
+
 # PA summary
 
 $(PA_SUMMARY): tables/pa-table.csv
diff --git a/mutagenicity.md b/mutagenicity.md
index 6abd497..dd1aa77 100644
--- a/mutagenicity.md
+++ b/mutagenicity.md
@@ -654,7 +654,73 @@ with simple set operations.
 Pyrrolizidine alkaloid mutagenicity predictions
 -----------------------------------------------
 
-**TODO**: **Verena** Ich wuerde den Grossteil der Diskussion hier dir ueberlassen. Wenn Du lazar Ergebnisse konkret diskutieren willst, kann ich Dir ausfuehrliche Vorhersagen (mit aehnlichen Verbindungen und deren Aktivitaet) fuer einzelne Beispiele zusammenstellen 
+`lazar` models with MolPrint2D descriptors predicted {{lazar.mp2d.all.n_perc}}% of the pyrrolizidine alkaloids (PAs) ({{lazar.mp2d.high_confidence.n_perc}}% with high confidence), the remaining compounds are not within its applicability domain. All other models predicted 100% of the 602 compounds, indicating that all compounds are within their applicability domain.
+
+Mutagenicity predictions from different models show little agreement in general (table 4). 42 from 602 PAs have non-conflicting predictions (all of them non-mutagenic).
+Most models predict predominantly a non-mutagenic outcome for PAs, with exception of the R deep learning (DL) and the Tensorflow Scikit logistic regression models ({{pa.dl.mut_perc}} and {{pa.tf.lr_scikit.mut_perc}}% positive predictions). 
+
+<!--
+non-conflicting CIDs
+43040
+186980
+187805
+610955
+3033169
+6429355
+10095536
+10251171
+10577975
+10838897
+10992912
+10996028
+11618501
+11827237
+11827238
+16687858
+73893122
+91747608
+91749688
+91751314
+91752877
+100979630
+100979631
+101648301
+102478913
+148322
+194088
+21626760
+91747610
+91747612
+91749428
+91749448
+102596226
+6440436
+4483893
+5315247
+46930232
+67189194
+91747354
+91749894
+101324794
+118701599
+-->
+
+R RF and SVM models favor very strongly non-mutagenic predictions (only {{pa.r.rf.mut_perc}} and {{pa.r.svm.mut_perc}} % mutagenic PAs), while Tensorflow models classify approximately half of the PAs as mutagenic (RF {{pa.tf.rf.mut_perc}}%, LR-sgd {{pa.tf.lr_sgd}}%, LR-scikit:{{pa.tf.lr_scikit.mut_perc}}, LR-NN:{{pa.tf.nn.mut_perc}}%). `lazar` models predict predominately non-mutagenicity, but to a lesser extend than R models (MP2D:{{pa.lazar.all.mut_perc}}, PaDEL:{{pa.lazar.padel.mut_perc}}).
+
+It is interesting to note, that different implementations of the same algorithm show little accordance in their prediction (see e.g R-RF vs. Tensorflow-RF and LR-sgd vs. LR-scikit in Table4 and @tab:pa-summary).
+
+**TODO** **Verena, Philipp** habt ihr eine Erklaerung dafuer?
+
+@fig:tsne-mp2d and @fig:tsne-padel show the t-SNE of training data and pyrrolizidine alkaloids. In @fig:tsne-mp2d the PAs are located closely together at the outer border of the training set. In @fig:tsne-padel they are less clearly separated and spread over the space occupied by the training examples.
+
+This is probably the reason why PaDEL models predicted all instances and the MP2D model only {{pa.lazar.mp2d.all.n}} PAs. Predicting a large number of instances is however not the ultimate goal, we need accurate predictions and an unambiguous estimation of the applicability domain. With PaDEL descriptors *all* PAs are within the applicability domain of the training data, which is unlikely despite the size of the training set. MolPrint2D descriptors provide a clearer separation, which is also reflected in a better separation between high and low confidence predictions in `lazar` MP2D predictions as compared to `lazar` PaDEL predictions. Crossvalidation results with substantially higher accuracies for MP2D models than for PaDEL models also support this argument.
+
+Differences between MP2D and PaDEL descriptors can be explained by their specific properties: PaDEL calculates a fixed set of descriptors for all structures, while MolPrint2D descriptors resemble substructures that are present in a compound. For this reason there is no fixed number of MP2D descriptors, the descriptor space are all unique substructures of the training set. If a query compound contains new substructures, this is immediately reflected in a lower similarity to training compounds, which makes applicability domain estimations very straightforward. With PaDEL (or any other predefined descriptors), the same set of descriptors is calculated for every compound, even if a compound comes from an completely new chemical class. 
+
+From a practical point we still have to face the question, how to choose model predictions, if no experimental data is available (we found two PAs in the training data, but this number is too low, to draw any general conclusions). Based on crossvalidation results and the arguments in favor of MolPrint2D descriptors we would put the highest trust in `lazar` MolPrint2D predictions, especially in high-confidence predictions. `lazar` predictions have a accuracy comparable to experimental variability (@Helma2018) for compounds within the applicability domain. But they should not be trusted blindly. For practical purposes it is important to study the rationales (i.e. neighbors and their experimental activities) for each prediction of relevance. A freely accessible GUI for this purpose has been implemented at https://lazar.in-silico.ch.
+
+
+**TODO**: **Verena**  Wenn Du lazar Ergebnisse konkret diskutieren willst, kann ich Dir ausfuehrliche Vorhersagen (mit aehnlichen Verbindungen und deren Aktivitaet) fuer einzelne Beispiele zusammenstellen 
 
 <!---
 Due to the low to moderate predictivity of all models, quantitative
diff --git a/scripts/summary.rb b/scripts/summary.rb
new file mode 100755
index 0000000..7a23e2c
--- /dev/null
+++ b/scripts/summary.rb
@@ -0,0 +1,9 @@
+#!/usr/bin/env ruby
+require 'yaml'
+
+summary = {}
+ARGV.each do |f|
+ summary.merge!(YAML.load_file(f))
+end
+
+puts summary.to_yaml
diff --git a/summary.yaml b/summary.yaml
new file mode 100644
index 0000000..07bcd56
--- /dev/null
+++ b/summary.yaml
@@ -0,0 +1,273 @@
+---
+:pa:
+  :n: 602
+  :lazar:
+    :mp2d:
+      :all:
+        :n: 560
+        :mut: 111
+        :non_mut: 449
+        :n_perc: 93
+        :mut_perc: 20
+        :non_mut_perc: 80
+      :high_confidence:
+        :n: 301
+        :mut: 76
+        :non_mut: 225
+        :n_perc: 50
+        :mut_perc: 25
+        :non_mut_perc: 75
+    :padel:
+      :all:
+        :n: 600
+        :mut: 83
+        :non_mut: 517
+        :n_perc: 100
+        :mut_perc: 14
+        :non_mut_perc: 86
+      :high_confidence:
+        :n: 0
+        :mut: 0
+        :non_mut: 0
+        :n_perc: 0
+        :mut_perc: 0
+        :non_mut_perc: 0
+  :r:
+    :rf:
+      :n: 602
+      :mut: 18
+      :non_mut: 584
+      :n_perc: 100
+      :mut_perc: 3
+      :non_mut_perc: 97
+    :svm:
+      :n: 602
+      :mut: 11
+      :non_mut: 591
+      :n_perc: 100
+      :mut_perc: 2
+      :non_mut_perc: 98
+    :dl:
+      :n: 602
+      :mut: 521
+      :non_mut: 81
+      :n_perc: 100
+      :mut_perc: 87
+      :non_mut_perc: 13
+  :tf:
+    :rf:
+      :n: 602
+      :mut: 186
+      :non_mut: 416
+      :n_perc: 100
+      :mut_perc: 31
+      :non_mut_perc: 69
+    :lr_sgd:
+      :n: 602
+      :mut: 286
+      :non_mut: 316
+      :n_perc: 100
+      :mut_perc: 48
+      :non_mut_perc: 52
+    :lr_scikit:
+      :n: 602
+      :mut: 395
+      :non_mut: 207
+      :n_perc: 100
+      :mut_perc: 66
+      :non_mut_perc: 34
+    :nn:
+      :n: 602
+      :mut: 295
+      :non_mut: 307
+      :n_perc: 100
+      :mut_perc: 49
+      :non_mut_perc: 51
+:cv:
+  lazar-all:
+    :tp: 3326
+    :fp: 833
+    :tn: 3039
+    :fn: 583
+    :n: 7781
+    :acc: 0.82
+    :tpr: 0.85
+    :fpr: 0.22
+    :tnr: 0.78
+    :ppv: 0.8
+    :npv: 0.84
+    :acc_perc: 82
+    :tpr_perc: 85
+    :tnr_perc: 78
+    :ppv_perc: 80
+    :npv_perc: 84
+  lazar-high-confidence:
+    :tp: 2816
+    :fp: 571
+    :tn: 2138
+    :fn: 365
+    :n: 5890
+    :acc: 0.84
+    :tpr: 0.89
+    :fpr: 0.21
+    :tnr: 0.79
+    :ppv: 0.83
+    :npv: 0.85
+    :acc_perc: 84
+    :tpr_perc: 89
+    :tnr_perc: 79
+    :ppv_perc: 83
+    :npv_perc: 85
+  lazar-padel-all:
+    :tp: 593
+    :fp: 466
+    :tn: 1777
+    :fn: 1253
+    :n: 4089
+    :acc: 0.58
+    :tpr: 0.32
+    :fpr: 0.21
+    :tnr: 0.79
+    :ppv: 0.56
+    :npv: 0.59
+    :acc_perc: 58
+    :tpr_perc: 32
+    :tnr_perc: 79
+    :ppv_perc: 56
+    :npv_perc: 59
+  lazar-padel-high-confidence:
+    :tp: 593
+    :fp: 466
+    :tn: 1771
+    :fn: 1251
+    :n: 4081
+    :acc: 0.58
+    :tpr: 0.32
+    :fpr: 0.21
+    :tnr: 0.79
+    :ppv: 0.56
+    :npv: 0.59
+    :acc_perc: 58
+    :tpr_perc: 32
+    :tnr_perc: 79
+    :ppv_perc: 56
+    :npv_perc: 59
+  R-RF:
+    :tp: 2259
+    :fp: 1173
+    :tn: 2897
+    :fn: 1741
+    :n: 8070
+    :acc: 0.64
+    :tpr: 0.56
+    :fpr: 0.29
+    :tnr: 0.71
+    :ppv: 0.66
+    :npv: 0.62
+    :acc_perc: 64
+    :tpr_perc: 56
+    :tnr_perc: 71
+    :ppv_perc: 66
+    :npv_perc: 62
+  R-SVM:
+    :tp: 2243
+    :fp: 1353
+    :tn: 2717
+    :fn: 1757
+    :n: 8070
+    :acc: 0.61
+    :tpr: 0.56
+    :fpr: 0.33
+    :tnr: 0.67
+    :ppv: 0.62
+    :npv: 0.61
+    :acc_perc: 61
+    :tpr_perc: 56
+    :tnr_perc: 67
+    :ppv_perc: 62
+    :npv_perc: 61
+  R-DL:
+    :tp: 3517
+    :fp: 3099
+    :tn: 971
+    :fn: 483
+    :n: 8070
+    :acc: 0.56
+    :tpr: 0.88
+    :fpr: 0.76
+    :tnr: 0.24
+    :ppv: 0.53
+    :npv: 0.67
+    :acc_perc: 56
+    :tpr_perc: 88
+    :tnr_perc: 24
+    :ppv_perc: 53
+    :npv_perc: 67
+  tensorflow-rf.v3:
+    :tp: 2362
+    :fp: 1243
+    :tn: 2835
+    :fn: 1640
+    :n: 8080
+    :acc: 0.64
+    :tpr: 0.59
+    :fpr: 0.3
+    :tnr: 0.7
+    :ppv: 0.66
+    :npv: 0.63
+    :acc_perc: 64
+    :tpr_perc: 59
+    :tnr_perc: 70
+    :ppv_perc: 66
+    :npv_perc: 63
+  tensorflow-lr.v3:
+    :tp: 2395
+    :fp: 1427
+    :tn: 2651
+    :fn: 1607
+    :n: 8080
+    :acc: 0.62
+    :tpr: 0.6
+    :fpr: 0.35
+    :tnr: 0.65
+    :ppv: 0.63
+    :npv: 0.62
+    :acc_perc: 62
+    :tpr_perc: 60
+    :tnr_perc: 65
+    :ppv_perc: 63
+    :npv_perc: 62
+  tensorflow-lr2.v3:
+    :tp: 2487
+    :fp: 1497
+    :tn: 2581
+    :fn: 1515
+    :n: 8080
+    :acc: 0.63
+    :tpr: 0.62
+    :fpr: 0.37
+    :tnr: 0.63
+    :ppv: 0.62
+    :npv: 0.63
+    :acc_perc: 63
+    :tpr_perc: 62
+    :tnr_perc: 63
+    :ppv_perc: 62
+    :npv_perc: 63
+  tensorflow-nn.v3:
+    :tp: 2452
+    :fp: 1468
+    :tn: 2610
+    :fn: 1550
+    :n: 8080
+    :acc: 0.63
+    :tpr: 0.61
+    :fpr: 0.36
+    :tnr: 0.64
+    :ppv: 0.63
+    :npv: 0.63
+    :acc_perc: 63
+    :tpr_perc: 61
+    :tnr_perc: 64
+    :ppv_perc: 63
+    :npv_perc: 63
author	Christoph Helma <helma@in-silico.ch>	2020-10-21 00:13:44 +0200
committer	Christoph Helma <helma@in-silico.ch>	2020-10-21 00:13:44 +0200
commit	90e674779943891cc7bfdcebcd6ca9e0017cc01d (patch)
tree	14fcca04028bf055db1ce055ad5ee91452cdc417
parent	94f0aa4ecdb3e137590420c0bbd38d15108acec4 (diff)