discussion started

author: Christoph Helma <helma@in-silico.ch> 2019-10-22 18:24:31 +0200
committer: Christoph Helma <helma@in-silico.ch> 2019-10-22 18:24:31 +0200
commit: ebbd51ce117dc0f5912df6fa25fcce5cb7aaa5fe (patch)
tree: f51933afea3e9433406cbc01c057f1f192be76df
parent: 2e03df94681951a62229b76b52370da094aa1ec6 (diff)
1 files changed, 47 insertions, 0 deletions
diff --git a/mutagenicity.md b/mutagenicity.md
index a9fa116..6c8b7be 100644
--- a/mutagenicity.md
+++ b/mutagenicity.md
@@ -215,6 +215,8 @@ For the Random Forest (RF), Support Vector Machines (SVM), and Deep
 Learning (DL) models, molecular descriptors were calculated
 with the PaDEL-Descriptors program (<http://www.yapcwsoft.com> version 2.21, @Yap2011).
 
+TODO: @Verena PaDEL descriptor description
+
 TODO: sentence ??
 
 From these descriptors were
@@ -436,6 +438,9 @@ predictions is provided in @tbl:lazar-padel-high-confidence.
 ```{#tbl:lazar-padel-high-confidence .table file="tables/lazar-padel-high-confidence.csv" caption="Confusion matrix for high confidence lazar predictions with PaDEL descriptors"}
 ```
 
+Summary
+-------
+
 The results of all crossvalidation experiments are summarized in @tbl:summary.
 
 | |R-RF | R-SVM | R-DL | TF | TF-FS | L | L-HC | L-P | L-P-HC|
@@ -449,10 +454,52 @@ The results of all crossvalidation experiments are summarized in @tbl:summary.
 
 : Summary of crossvalidation results. *R-RF*: R Random Forests, *R-SVM*: R Support Vector Machines, *R-DL*: R Deep Learning, *TF*: TensorFlow without feature selection, *TF-FS*: TensorFlow with feature selection, *L*: lazar, *L-HC*: lazar high confidence predictions, *L-P*: lazar with PaDEL descriptors, *L-P-HC*: lazar PaADEL high confidence predictions, *PPV*: Positive predictive value (Precision), *NPV*: Negative predictive value {#tbl:summary}
 
+TODO ROC curve, also in discussion
 
 Discussion
 ==========
 
+Data
+----
+
+This combined dataset is according to our knowledge the largest dataset for *Salmonella* mutagenicity. I can be downloaded from TODO
+
+Model performance
+-----------------
+
+lazar best
+slightly less predictions (could be a good thing)
+
+There are two major differences between `lazar` and R/TensorFlow models:
+
+- `lazar` uses MolPrint2D fingerprints, while the other models use PaDEL descriptors
+- `lazar` creates local models for each query compound and the other models use a single global model for all predictions
+
+We will discuss both options in the following sections.
+
+Descriptors
+-----------
+
+This study uses two types of descriptors to characterize chemical structures.
+
+MolPrint2D fingerprints (MP2D, @Bender2004) use
+atom environments (i.e. connected atoms for all atoms in a molecule) as
+molecular representation, which resembles basically the chemical concept of
+functional groups. MP2D descriptors are used to determine chemical similarities
+in lazar, and previous experiments have shown, that they give more accurate results than predefined descriptors (e.g.
+MACCS, FP2-4) for all investigated endpoints.
+
+PaDEL calculates topological and physical-chemical descriptors.
+
+TODO: @Verena Beschreibung
+
+PaDEL descriptors were used for the R and TensorFlow models. In addition we have used PaDEL descriptors to calculate cosine similarities for the `lazar` algorithm and compared the results with standard MP2D similarities, which led to a significant decrease of `lazar` prediction accuracies. Based on this result we can conclude, that PaDEL descriptors are less suited for similarity calculations than MP2D descriptors.
+
+In order to investigate, if MP2D fingerprints are also a better option for global models we have tried to build R and TensorFlow models both with and without unsupervised feature selection. Unfortunately none of the algorithms was capable to deal with the large and sparsely populated descriptor matrix. Based on this result we can conclude, that MP2D descriptors are at the moment unsuitable for standard global machine learning algorithms. Please note that `lazar` does not suffer from the sparseness problem, because (a) it utilizes internally a much more efficient occurrence based representation and (b) it uses fingerprints only for similarity calculations and mot as model parameters.
+
+Algorithms
+----------
+
 General model performance
 
 Based on the results of the cross-validation for all models, `lazar`, RF,
author	Christoph Helma <helma@in-silico.ch>	2019-10-22 18:24:31 +0200
committer	Christoph Helma <helma@in-silico.ch>	2019-10-22 18:24:31 +0200
commit	ebbd51ce117dc0f5912df6fa25fcce5cb7aaa5fe (patch)
tree	f51933afea3e9433406cbc01c057f1f192be76df
parent	2e03df94681951a62229b76b52370da094aa1ec6 (diff)