diff options
author | Christoph Helma <helma@in-silico.ch> | 2019-10-21 17:29:52 +0200 |
---|---|---|
committer | Christoph Helma <helma@in-silico.ch> | 2019-10-21 17:29:52 +0200 |
commit | 93f2fb17788b9d02b00935e0d1be7cd1d81ff555 (patch) | |
tree | 95ea869bf48bd41bb0d6d341e6cee7f3e01d2c81 /mutagenicity.md | |
parent | 1035124b854e21998d3fd9de4935780a19a2d3d3 (diff) |
mustache preprocessing
Diffstat (limited to 'mutagenicity.md')
-rw-r--r-- | mutagenicity.md | 139 |
1 files changed, 64 insertions, 75 deletions
diff --git a/mutagenicity.md b/mutagenicity.md index bf4f6d1..2f80bad 100644 --- a/mutagenicity.md +++ b/mutagenicity.md @@ -134,8 +134,8 @@ of a compound can be constructed that can be used to calculate chemical similarities. The chemical similarity between two compounds a and b is expressed as -the proportion between atom environments common in both structures A ∩ B -and the total number of atom environments A U B (Jaccard/Tanimoto +the proportion between atom environments common in both structures $A \cap B$ +and the total number of atom environments $A \cup B$ (Jaccard/Tanimoto index). $$sim = \frac{\left| A\ \cap B \right|}{\left| A\ \cup B \right|}$$ @@ -335,117 +335,106 @@ Validation Results ======= -`lazar` ------ +{{#programs}} +{{name}} Models +-------- +{{#algos}} -Random Forest -------------- +### {{name}} -The validation showed that the RF model has an accuracy of 64%, a -sensitivity of 66% and a specificity of 63%. The confusion matrix of the +10-fold crossvalidation of the {{abbrev}} model gave an accuracy of +{{accuracy_perc}}% +a sensitivity of +{{true_positive_rate_perc}}% +and a specificity of +{{true_negative_rate_perc}}% +The confusion matrix of the model, calculated for 8080 instances, is provided in Table 1. -Table 1: Confusion matrix of the RF model +```{.table file="tables/R-RF.csv" caption="Confusion matrix for R Random Forest predictions"} +``` +{{/algos}} +{{/programs}} - Predicted genotoxicity - ----------------------- ------------------------ ---------- ---------- ------------- - Measured genotoxicity ***PP*** ***PN*** ***Total*** - ***TP*** 2274 1163 3437 - ***TN*** 1736 2907 4643 - ***Total*** 4010 4070 8080 +R Models +-------- -PP: Predicted positive; PN: Predicted negative, TP: True positive, TN: -True negative +### Random Forest -Support Vector Machines ------------------------ +The validation showed that the RF model has an accuracy of +{{R-RF.accuracy}}% +`cat /home/ch/src/mutagenicity-paper/10-fold-crossvalidations/summaries/R-RF.json|jq '.accuracy * 100 | round'`{pipe="sh"}%, +a sensitivity of +`cat /home/ch/src/mutagenicity-paper/10-fold-crossvalidations/summaries/R-RF.json|jq '.true_positive_rate * 100 | round'`{pipe="sh"}%, +and a specificity of +`cat /home/ch/src/mutagenicity-paper/10-fold-crossvalidations/summaries/R-RF.json|jq '.true_negative_rate * 100 | round'`{pipe="sh"}%, +The confusion matrix of the +model, calculated for 8080 instances, is provided in Table 1. + +```{.table file="tables/R-RF.csv" caption="Confusion matrix for R Random Forest predictions"} +``` + +### Support Vector Machines The validation showed that the SVM model has an accuracy of 62%, a sensitivity of 65% and a specificity of 60%. The confusion matrix of SVM model, calculated for 8080 instances, is provided in Table 2. -Table 2: Confusion matrix of the SVM model - - Predicted genotoxicity - ----------------------- ------------------------ ---------- ---------- ------------- - Measured genotoxicity ***PP*** ***PN*** ***Total*** - ***TP*** 2057 1107 3164 - ***TN*** 1953 2963 4916 - ***Total*** 4010 4070 8080 -PP: Predicted positive; PN: Predicted negative, TP: True positive, TN: -True negative +```{.table file="tables/R-SVM.csv" caption="Confusion matrix for R Support Vector Machine predictions"} +``` -Deep Learning (R-project) -------------------------- +### Deep Learning The validation showed that the DL model generated in R has an accuracy of 59%, a sensitivity of 89% and a specificity of 30%. The confusion matrix of the model, normalised to 8080 instances, is provided in Table 3. -Table 3: Confusion matrix of the DL model (R-project) +```{.table file="tables/R-DL.csv" caption="Confusion matrix for R Deep Learning predictions"} +``` - Predicted genotoxicity - ----------------------- ------------------------ ---------- ---------- ------------- - Measured genotoxicity ***PP*** ***PN*** ***Total*** - ***TP*** 3575 435 4010 - ***TN*** 2853 1217 4070 - ***Total*** 6428 1652 8080 +```{.table file="tables/r-summary.csv" caption="Summary of R model validations"} +``` -PP: Predicted positive; PN: Predicted negative, TP: True positive, TN: -True negative - -DL model (TensorFlow) ---------------------- +TensorFlow Models +----------------- The validation showed that the DL model generated in TensorFlow has an accuracy of 68%, a sensitivity of 70% and a specificity of 46%. The confusion matrix of the model, normalised to 8080 instances, is provided in Table 4. -Table 4: Confusion matrix of the DL model (TensorFlow) - - Predicted genotoxicity - ----------------------- ------------------------ ---------- ---------- ------------- - Measured genotoxicity ***PP*** ***PN*** ***Total*** - ***TP*** 2851 1227 4078 - ***TN*** 1825 2177 4002 - ***Total*** 4676 3404 8080 - -PP: Predicted positive; PN: Predicted negative, TP: True positive, TN: -True negative - -The ROC curves from the 6-fold validation are shown in Figure 7. +```{.table file="tables/tensorflow-all.csv" caption="Confusion matrix for Tensorflow predictions without variable selecetion"} +``` -![](figures/image7.png){width="3.825in" -height="2.7327045056867894in"} +```{.table file="tables/tensorflow-selected.csv" caption="Confusion matrix for Tensorflow predictions with variable selecetion"} +``` -Figure 7: Six-fold cross-validation of TensorFlow DL model show an -average area under the ROC-curve (ROC-AUC; measure of accuracy) of 68%. +```{.table file="tables/tf-summary.csv" caption="Summary of TensorFlow model validations"} +``` -In summary, the validation results of the four methods are presented in -the following table. +`lazar` Models +-------------- -Table 5 Results of the cross-validation of the four models and after -y-randomisation +### MolPrint2D Descriptors - ---------------------------------------------------------------------- - Accuracy CCR Sensitivity Specificity - ----------------------- ---------- ------- ------------- ------------- - RF model 64.1% 64.4% 66.2% 62.6% +```{.table file="tables/lazar-all.csv" caption="Confusion matrix for lazar predictions with MolPrint2D descriptors"} +``` - SVM model 62.1% 62.6% 65.0% 60.3% +```{.table file="tables/lazar-high-confidence.csv" caption="Confusion matrix for high confidence lazar predictions with MolPrint2D descriptors"} +``` - DL model\ 59.3% 59.5% 89.2% 29.9% - (R-project) +### PaDEL Descriptors - DL model (TensorFlow) 68% 62.2% 69.9% 45.6% +```{.table file="tables/lazar-padel-all.csv" caption="Confusion matrix for lazar predictions with PaDEL descriptors"} +``` - y-randomisation 50.5% 50.4% 50.3% 50.6% - ---------------------------------------------------------------------- +```{.table file="tables/lazar-padel-high-confidence.csv" caption="Confusion matrix for high confidence lazar predictions with PaDEL descriptors"} +``` -CCR (correct classification rate) +```{.table file="tables/lazar-summary.csv" caption="Summary of lazar model validations"} +``` Discussion ========== |