CDK instead of PaDEL

author: Christoph Helma <helma@in-silico.ch> 2020-12-10 17:14:14 +0100
committer: Christoph Helma <helma@in-silico.ch> 2020-12-10 17:14:14 +0100
commit: ed2625b9b2fde45cfd1739695310d47866b3c0b0 (patch)
tree: 8249000344a9701b52ddf41a36008d9ffb8d940c
parent: ce8db67ce38095e06d2131eced2acfc219661580 (diff)
10 files changed, 47 insertions, 43 deletions
diff --git a/mutagenicity.md b/mutagenicity.md
index 4a7e4b3..b1e576d 100644
--- a/mutagenicity.md
+++ b/mutagenicity.md
@@ -28,7 +28,7 @@ institute:
       address: "Inselspital, 3010 Bern, Switzerland"
 
 bibliography: bibliography.bib
-keywords: mutagenicity, QSAR, lazar, random forest, support vector machine, linear regression, neural nets, deep learning
+keywords: mutagenicity, QSAR, lazar, random forest, support vector machine, linear regression, neural nets, deep learning, pyrrolizidine alkaloids, OpenBabel, CDK
 
 documentclass: scrartcl
 tblPrefix: Table
@@ -42,10 +42,12 @@ header-includes:
 Abstract
 ========
 
-Random forest, support vector machine, logistic regression, neural networks and k-nearest neighbor
-(`lazar`) algorithms, were applied to new *Salmonella* mutagenicity dataset
-with 8309 unique chemical structures. The best prediction accuracies in
-10-fold-crossvalidation were obtained with `lazar` models and MolPrint2D descriptors, that gave accuracies ({{cv.lazar-high-confidence.acc_perc}}%)
+Random forest, support vector machine, logistic regression, neural networks and
+k-nearest neighbor (`lazar`) algorithms, were applied to new *Salmonella*
+mutagenicity dataset with 8309 unique chemical structures. The best prediction
+accuracies in 10-fold-crossvalidation were obtained with `lazar` models and
+MolPrint2D descriptors, that gave accuracies
+({{cv.lazar-high-confidence.acc_perc}}%)
 similar to the interlaboratory variability of the Ames test.
 
 **TODO**: PA results
@@ -123,7 +125,7 @@ pyrrolizidine
 The main objectives of this study were
 
   - to generate a new mutagenicity training dataset, by combining the most comprehensive public datasets
-  - to compare the performance of MolPrint2D (*MP2D*) fingerprints with PaDEL descriptors
+  - to compare the performance of MolPrint2D (*MP2D*) fingerprints with Chemistry Development Kit (*CDK*) descriptors
   - to compare the performance of global QSAR models (random forests (*RF*), support vector machines (*SVM*), logistic regression (*LR*), neural nets (*NN*)) with local models (`lazar`)
   - to apply these models for the prediction of pyrrolizidine alkaloid mutagenicity
 
@@ -211,7 +213,7 @@ its connected atoms to represent their chemical environment.  This resembles
 basically the chemical concept of functional groups.
 
 In contrast to predefined lists of fragments (e.g. FP3, FP4 or MACCs
-fingerprints) or descriptors (e.g PaDEL) they are generated dynamically from
+fingerprints) or descriptors (e.g CDK) they are generated dynamically from
 chemical structures. This has the advantage that they can capture substructures
 of toxicological relevance that are not included in other descriptors. 
 
@@ -224,10 +226,12 @@ of the R and Tensorflow algorithms was capable to use them as descriptors.
 MolPrint2D fingerprints were calculated with the OpenBabel cheminformatics
 library (@OBoyle2011a).
 
-#### PaDEL descriptors
+#### Chemistry Development Kit (*CDK*) descriptors
 
-Molecular 1D and 2D descriptors were calculated
-with the PaDEL-Descriptors program (<http://www.yapcwsoft.com> version 2.21, @Yap2011). 
+Molecular 1D and 2D descriptors were calculated with the PaDEL-Descriptors
+program (<http://www.yapcwsoft.com> version 2.21, @Yap2011).  PaDEL uses the
+Chemistry Development Kit (*CDK*, <https://cdk.github.io/index.html>) library
+for descriptor calculations.
 
 As the training dataset contained over 8309 instances, it was decided to
 delete instances with missing values during data pre-processing.
@@ -253,7 +257,7 @@ number of features, a LASSO (*least absolute shrinkage and selection operator*)
 regression was performed using the '*glmnet*'-function (package '*glmnet*').
 The reduced dataset was used for the generation of the pre-trained models.
 
-PaDEL descriptors were used in global (RF, SVM, LR, NN) and local (`lazar`) models.
+CDK descriptors were used in global (RF, SVM, LR, NN) and local (`lazar`) models.
 
 Algorithms
 ----------
@@ -285,7 +289,7 @@ sections.
 #### Neighbour identification
 
 Utilizing this modularity, similarity calculations were based both on
-MolPrint2D fingerprints and on PaDEL descriptors.
+MolPrint2D fingerprints and on CDK descriptors.
 
 For MolPrint2D fingerprints chemical similarity between two compounds $a$ and
 $b$ is expressed as the proportion between atom environments common in both
@@ -294,7 +298,7 @@ structures $A \cap B$ and the total number of atom environments $A \cup B$
 
 $$sim = \frac{\lvert A\  \cap B \rvert}{\lvert A\  \cup B \rvert}$$
 
-For PaDEL descriptors chemical similarity between two compounds $a$ and $b$ is
+For CDK descriptors chemical similarity between two compounds $a$ and $b$ is
 expressed as the cosine similarity between the descriptor vectors $A$ for $a$
 and $B$ for $b$.
 
@@ -462,7 +466,7 @@ Alternatively, a DL model was established with Python-based Tensorflow
 program (<https://www.tensorflow.org/>) using the high-level API Keras
 (<https://www.tensorflow.org/guide/keras>) to build the models. 
 
-Tensorflow models used the same PaDEL descriptors as the R models.
+Tensorflow models used the same CDK descriptors as the R models.
 
 Validation
 ----------
@@ -480,7 +484,7 @@ Results
 ------------------------
 
 Crossvalidation results are summarized in the following tables: @tbl:lazar
-shows `lazar` results with MolPrint2D and PaDEL descriptors, @tbl:R R results
+shows `lazar` results with MolPrint2D and CDK descriptors, @tbl:R R results
 and @tbl:tensorflow Tensorflow results.
 
 
@@ -505,10 +509,10 @@ https://git.in-silico.ch/mutagenicity-paper/tree/10-fold-crossvalidations/predic
 The most accurate crossvalidation predictions have been obtained with standard
 `lazar` models using MolPrint2D descriptors ({{cv.lazar-high-confidence.acc}}
 for predictions with high confidence, {{cv.lazar-all.acc}} for all
-predictions). Models utilizing PaDEL descriptors have generally lower
+predictions). Models utilizing CDK descriptors have generally lower
 accuracies ranging from {{cv.R-DL.acc}} (R deep learning) to {{cv.R-RF.acc}}
 (R/Tensorflow random forests). Sensitivity and specificity is generally well
-balanced with the exception of `lazar`-PaDEL (low sensitivity) and R deep
+balanced with the exception of `lazar`-CDK (low sensitivity) and R deep
 learning (low specificity) models.
 
 Pyrrolizidine alkaloid mutagenicity predictions 
@@ -529,7 +533,7 @@ downloaded from https://git.in-silico.ch/mutagenicity-paper/tree/tables/pa-table
 
 For the visualisation of the position of pyrrolizidine alkaloids in respect to
 the training data set we have applied t-distributed stochastic neighbor
-embedding (t-SNE, @Maaten2008) for MolPrint2D and PaDEL descriptors.  t-SNE
+embedding (t-SNE, @Maaten2008) for MolPrint2D and CDK descriptors.  t-SNE
 maps each high-dimensional object (chemical) to a two-dimensional point,
 maintaining the high-dimensional distances of the objects. Similar objects are
 represented by nearby points and dissimilar objects are represented by distant
@@ -540,7 +544,7 @@ points.
 
 ![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-mp2d.png){#fig:tsne-mp2d}
 
-@fig:tsne-padel shows the t-SNE of pyrrolizidine alkaloids (PA) and the mutagenicity training data in PaDEL space (Euclidean similarity).
+@fig:tsne-padel shows the t-SNE of pyrrolizidine alkaloids (PA) and the mutagenicity training data in CDK space (Euclidean similarity).
 
 ![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-padel.png){#fig:tsne-padel}
 
@@ -564,7 +568,7 @@ Model performance
 fingerprints) give the most accurate crossvalidation results. R Random Forests,
 Support Vector Machines and Tensorflow models have similar accuracies with
 balanced sensitivity (true position rate) and specificity (true negative rate).
-`lazar` models with PaDEL descriptors have low sensitivity and R Deep Learning
+`lazar` models with CDK descriptors have low sensitivity and R Deep Learning
 models have low specificity.
 
 The accuracy of `lazar` *in-silico* predictions are comparable to the
@@ -577,7 +581,7 @@ analysis of `lazar` lowest observed effect level predictions, which are also
 similar to the experimental variability (@Helma2018).
 
 The lowest number of predictions ({{cv.lazar-padel-high-confidence.n}}) has been
-obtained from `lazar`-PaDEL high confidence predictions, the largest number of
+obtained from `lazar`-CDK high confidence predictions, the largest number of
 predictions comes from Tensorflow models ({{cv.tensorflow-rf.v3.n}}). Standard
 `lazar` give a slightly lower number of predictions ({{cv.lazar-all.n}}) than R
 and Tensorflow models. This is not necessarily a disadvantage, because `lazar`
@@ -610,19 +614,19 @@ utilizes internally a much more efficient occurrence based representation and
 (b) it uses fingerprints only for similarity calculations and not as model
 parameters.
 
-PaDEL calculates topological and physical-chemical descriptors.
+CDK calculates topological and physical-chemical descriptors.
 
 **TODO**: **Verena** kannst Du bitte die Deskriptoren nochmals kurz beschreiben
 
-*PaDEL* descriptors were used for `lazar`, R and Tensorflow models.  All models
-based on PaDEL descriptors had similar crossvalidation accuracies that were
+*CDK* descriptors were used for `lazar`, R and Tensorflow models.  All models
+based on CDK descriptors had similar crossvalidation accuracies that were
 significantly lower than `lazar` MolPrint2D results.  Direct comparisons are
-available only for the `lazar` algorithm, and also in this case PaDEL
+available only for the `lazar` algorithm, and also in this case CDK
 accuracies were lower than MolPrint2D accuracies.
 
-Based on `lazar` results we can conclude, that PaDEL descriptors are less
+Based on `lazar` results we can conclude, that CDK descriptors are less
 suited for chemical similarity calculations than MP2D descriptors. It is also
-likely that PaDEL descriptors lead to less accurate predictions for global
+likely that CDK descriptors lead to less accurate predictions for global
 models, but we cannot draw any definitive conclusion in the absence of MP2D
 models.
 
@@ -639,7 +643,7 @@ postulated in the past, that local models are more accurate, because they can
 account better for mechanisms, that affect only a subset of the training data.
 Our results seem to support this assumption, because standard `lazar` models
 with MolPrint2D descriptors perform better than global models. The accuracy of
-`lazar` models with PaDEL descriptors is however substantially lower and
+`lazar` models with CDK descriptors is however substantially lower and
 comparable to global models with the same descriptors.
 
 This observation may lead to the conclusion that the choice of suitable
@@ -714,7 +718,7 @@ non-conflicting CIDs
 118701599
 -->
 
-R RF and SVM models favor very strongly non-mutagenic predictions (only {{pa.r.rf.mut_perc}} and {{pa.r.svm.mut_perc}} % mutagenic PAs), while Tensorflow models classify approximately half of the PAs as mutagenic (RF {{pa.tf.rf.mut_perc}}%, LR-sgd {{pa.tf.lr_sgd}}%, LR-scikit:{{pa.tf.lr_scikit.mut_perc}}, LR-NN:{{pa.tf.nn.mut_perc}}%). `lazar` models predict predominately non-mutagenicity, but to a lesser extend than R models (MP2D:{{pa.lazar.mp2d.all.mut_perc}}, PaDEL:{{pa.lazar.padel.all.mut_perc}}).
+R RF and SVM models favor very strongly non-mutagenic predictions (only {{pa.r.rf.mut_perc}} and {{pa.r.svm.mut_perc}} % mutagenic PAs), while Tensorflow models classify approximately half of the PAs as mutagenic (RF {{pa.tf.rf.mut_perc}}%, LR-sgd {{pa.tf.lr_sgd}}%, LR-scikit:{{pa.tf.lr_scikit.mut_perc}}, LR-NN:{{pa.tf.nn.mut_perc}}%). `lazar` models predict predominately non-mutagenicity, but to a lesser extend than R models (MP2D:{{pa.lazar.mp2d.all.mut_perc}}, CDK:{{pa.lazar.padel.all.mut_perc}}).
 
 It is interesting to note, that different implementations of the same algorithm show little accordance in their prediction (see e.g R-RF vs. Tensorflow-RF and LR-sgd vs. LR-scikit in Table 4 and @tbl:pa-summary).
 
@@ -722,9 +726,9 @@ It is interesting to note, that different implementations of the same algorithm
 
 @fig:tsne-mp2d and @fig:tsne-padel show the t-SNE of training data and pyrrolizidine alkaloids. In @fig:tsne-mp2d the PAs are located closely together at the outer border of the training set. In @fig:tsne-padel they are less clearly separated and spread over the space occupied by the training examples.
 
-This is probably the reason why PaDEL models predicted all instances and the MP2D model only {{pa.lazar.mp2d.all.n}} PAs. Predicting a large number of instances is however not the ultimate goal, we need accurate predictions and an unambiguous estimation of the applicability domain. With PaDEL descriptors *all* PAs are within the applicability domain of the training data, which is unlikely despite the size of the training set. MolPrint2D descriptors provide a clearer separation, which is also reflected in a better separation between high and low confidence predictions in `lazar` MP2D predictions as compared to `lazar` PaDEL predictions. Crossvalidation results with substantially higher accuracies for MP2D models than for PaDEL models also support this argument.
+This is probably the reason why CDK models predicted all instances and the MP2D model only {{pa.lazar.mp2d.all.n}} PAs. Predicting a large number of instances is however not the ultimate goal, we need accurate predictions and an unambiguous estimation of the applicability domain. With CDK descriptors *all* PAs are within the applicability domain of the training data, which is unlikely despite the size of the training set. MolPrint2D descriptors provide a clearer separation, which is also reflected in a better separation between high and low confidence predictions in `lazar` MP2D predictions as compared to `lazar` CDK predictions. Crossvalidation results with substantially higher accuracies for MP2D models than for CDK models also support this argument.
 
-Differences between MP2D and PaDEL descriptors can be explained by their specific properties: PaDEL calculates a fixed set of descriptors for all structures, while MolPrint2D descriptors resemble substructures that are present in a compound. For this reason there is no fixed number of MP2D descriptors, the descriptor space are all unique substructures of the training set. If a query compound contains new substructures, this is immediately reflected in a lower similarity to training compounds, which makes applicability domain estimations very straightforward. With PaDEL (or any other predefined descriptors), the same set of descriptors is calculated for every compound, even if a compound comes from an completely new chemical class. 
+Differences between MP2D and CDK descriptors can be explained by their specific properties: CDK calculates a fixed set of descriptors for all structures, while MolPrint2D descriptors resemble substructures that are present in a compound. For this reason there is no fixed number of MP2D descriptors, the descriptor space are all unique substructures of the training set. If a query compound contains new substructures, this is immediately reflected in a lower similarity to training compounds, which makes applicability domain estimations very straightforward. With CDK (or any other predefined descriptors), the same set of descriptors is calculated for every compound, even if a compound comes from an completely new chemical class. 
 
 From a practical point we still have to face the question, how to choose model predictions, if no experimental data is available (we found two PAs in the training data, but this number is too low, to draw any general conclusions). Based on crossvalidation results and the arguments in favor of MolPrint2D descriptors we would put the highest trust in `lazar` MolPrint2D predictions, especially in high-confidence predictions. `lazar` predictions have a accuracy comparable to experimental variability (@Helma2018) for compounds within the applicability domain. But they should not be trusted blindly. For practical purposes it is important to study the rationales (i.e. neighbors and their experimental activities) for each prediction of relevance. A freely accessible GUI for this purpose has been implemented at https://lazar.in-silico.ch.
 
@@ -847,10 +851,10 @@ Conclusions
 
 A new public *Salmonella* mutagenicity training dataset with 8309 compounds was
 created and used it to train `lazar`, R and Tensorflow models with MolPrint2D
-and PaDEL descriptors. The best performance was obtained with `lazar` models
+and CDK descriptors. The best performance was obtained with `lazar` models
 using MolPrint2D descriptors, with prediction accuracies
 ({{cv.lazar-high-confidence.acc_perc}}%) comparable to the interlaboratory variability
-of the Ames test (80-85%). Models based on PaDEL descriptors had lower
+of the Ames test (80-85%). Models based on CDK descriptors had lower
 accuracies than MolPrint2D models, but only the `lazar` algorithm could use
 MolPrint2D descriptors.
 
diff --git a/mutagenicity.pdf b/mutagenicity.pdf
index 6c258a7..3eac623 100644
--- a/mutagenicity.pdf
+++ b/mutagenicity.pdf
diff --git a/scripts/pa-summary-table.rb b/scripts/pa-summary-table.rb
index 48546bd..049a7ee 100755
--- a/scripts/pa-summary-table.rb
+++ b/scripts/pa-summary-table.rb
@@ -5,8 +5,8 @@ puts "Model,Nr.predictions,mutagenic,non-mutagenic"
 puts "lazar-MP2D (all),#{data[:pa][:lazar][:mp2d][:all][:n]} (#{data[:pa][:lazar][:mp2d][:all][:n_perc]} %),#{data[:pa][:lazar][:mp2d][:all][:mut]} (#{data[:pa][:lazar][:mp2d][:all][:mut_perc]} %),#{data[:pa][:lazar][:mp2d][:all][:non_mut]} (#{data[:pa][:lazar][:mp2d][:all][:non_mut_perc]} %)"
 puts "lazar-MP2D (high-confidence),#{data[:pa][:lazar][:mp2d][:high_confidence][:n]} (#{data[:pa][:lazar][:mp2d][:high_confidence][:n_perc]} %),#{data[:pa][:lazar][:mp2d][:high_confidence][:mut]} (#{data[:pa][:lazar][:mp2d][:high_confidence][:mut_perc]} %),#{data[:pa][:lazar][:mp2d][:high_confidence][:non_mut]} (#{data[:pa][:lazar][:mp2d][:high_confidence][:non_mut_perc]} %)"
 
-puts "lazar-PaDEL (all),#{data[:pa][:lazar][:padel][:all][:n]} (#{data[:pa][:lazar][:padel][:all][:n_perc]} %),#{data[:pa][:lazar][:padel][:all][:mut]} (#{data[:pa][:lazar][:padel][:all][:mut_perc]} %),#{data[:pa][:lazar][:padel][:all][:non_mut]} (#{data[:pa][:lazar][:padel][:all][:non_mut_perc]} %)"
-puts "lazar-PaDEL (high-confidence),#{data[:pa][:lazar][:padel][:high_confidence][:n]} (#{data[:pa][:lazar][:padel][:high_confidence][:n_perc]} %),#{data[:pa][:lazar][:padel][:high_confidence][:mut]} (#{data[:pa][:lazar][:padel][:high_confidence][:mut_perc]} %),#{data[:pa][:lazar][:padel][:high_confidence][:non_mut]} (#{data[:pa][:lazar][:padel][:high_confidence][:non_mut_perc]} %)"
+puts "lazar-CDK (all),#{data[:pa][:lazar][:padel][:all][:n]} (#{data[:pa][:lazar][:padel][:all][:n_perc]} %),#{data[:pa][:lazar][:padel][:all][:mut]} (#{data[:pa][:lazar][:padel][:all][:mut_perc]} %),#{data[:pa][:lazar][:padel][:all][:non_mut]} (#{data[:pa][:lazar][:padel][:all][:non_mut_perc]} %)"
+puts "lazar-CDK (high-confidence),#{data[:pa][:lazar][:padel][:high_confidence][:n]} (#{data[:pa][:lazar][:padel][:high_confidence][:n_perc]} %),#{data[:pa][:lazar][:padel][:high_confidence][:mut]} (#{data[:pa][:lazar][:padel][:high_confidence][:mut_perc]} %),#{data[:pa][:lazar][:padel][:high_confidence][:non_mut]} (#{data[:pa][:lazar][:padel][:high_confidence][:non_mut_perc]} %)"
 
 puts "R-RF,#{data[:pa][:r][:rf][:n]} (#{data[:pa][:r][:rf][:n_perc]} %),#{data[:pa][:r][:rf][:mut]} (#{data[:pa][:r][:rf][:mut_perc]} %),#{data[:pa][:r][:rf][:non_mut]} (#{data[:pa][:r][:rf][:non_mut_perc]} %)"
 puts "R-SVM,#{data[:pa][:r][:svm][:n]} (#{data[:pa][:r][:svm][:n_perc]} %),#{data[:pa][:r][:svm][:mut]} (#{data[:pa][:r][:svm][:mut_perc]} %),#{data[:pa][:r][:svm][:non_mut]} (#{data[:pa][:r][:svm][:non_mut_perc]} %)"
diff --git a/scripts/pa-table.rb b/scripts/pa-table.rb
index 4e5d438..ba7af63 100755
--- a/scripts/pa-table.rb
+++ b/scripts/pa-table.rb
@@ -1,6 +1,6 @@
 #!/usr/bin/env ruby
 
-header = ["ID","CID","SMILES","Canonical SMILES","Measured","lazar-MP2D","lazar-MP2D-high-confidence","lazar-PaDEL","lazar-PaDEL-high-confidence"]
+header = ["ID","CID","SMILES","Canonical SMILES","Measured","lazar-MP2D","lazar-MP2D-high-confidence","lazar-CDK","lazar-CDK-high-confidence"]
 tab = []
 i = 0
 File.read("pyrrolizidine-alkaloids/180920_PA_complete_SMILES.csv").each_line do |l|
diff --git a/scripts/pa-tex-table.rb b/scripts/pa-tex-table.rb
index 840df13..74410b7 100755
--- a/scripts/pa-tex-table.rb
+++ b/scripts/pa-tex-table.rb
@@ -11,7 +11,7 @@ puts '
 \caption{Summary of pyrrolizidine alkaloid predictions: red: mutagen, green: non-mutagen, grey: no prediction, dark red/green: low confidence} \\\\
 \label{tab:pa}
 PubChem   & & \multicolumn{2}{c}{lazar} & \multicolumn{3}{c}{R} & \multicolumn{4}{c}{Tensorflow}\\\\
-CID & Measured & MP2D & PaDEL & DL & RF & SVM & LR-sgd & LR-scikit & NN & RF \\\\
+CID & Measured & MP2D & CDK & DL & RF & SVM & LR-sgd & LR-scikit & NN & RF \\\\
 \hline
 \renewcommand{\arraystretch}{0.075}
 '
diff --git a/scripts/summary2table.rb b/scripts/summary2table.rb
index 267bb97..557dbd4 100755
--- a/scripts/summary2table.rb
+++ b/scripts/summary2table.rb
@@ -12,7 +12,7 @@ when "tensorflow"
   header = ["RF","LR-sgd","LR-scikit","NN"]
   keys = ["rf","lr","lr2","nn"].collect{|n| "tensorflow-"+n+".v3"}
 when "lazar"
-  header = ["MP2D", "PaDEL"]
+  header = ["MP2D", "CDK"]
   mp2dkeys = ["lazar-all","lazar-high-confidence"]
   padelkeys = ["lazar-padel-all","lazar-padel-high-confidence"]
   puts ","+header.join(",")
diff --git a/tables/lazar-summary.csv b/tables/lazar-summary.csv
index 3a0840e..273f710 100644
--- a/tables/lazar-summary.csv
+++ b/tables/lazar-summary.csv
@@ -1,4 +1,4 @@
-,MP2D,PaDEL
+,MP2D,CDK
 Accuracy,0.82/0.84,0.58/0.58
 True positive rate/Sensitivity,0.85/0.89,0.32/0.32
 True negative rate/Specificity,0.78/0.79,0.79/0.79
diff --git a/tables/pa-summary.csv b/tables/pa-summary.csv
index 0bc0e97..6555227 100644
--- a/tables/pa-summary.csv
+++ b/tables/pa-summary.csv
@@ -1,8 +1,8 @@
 Model,Nr.predictions,mutagenic,non-mutagenic
 lazar-MP2D (all),560 (93 %),111 (20 %),449 (80 %)
 lazar-MP2D (high-confidence),301 (50 %),76 (25 %),225 (75 %)
-lazar-PaDEL (all),600 (100 %),83 (14 %),517 (86 %)
-lazar-PaDEL (high-confidence),0 (0 %),0 (0 %),0 (0 %)
+lazar-CDK (all),600 (100 %),83 (14 %),517 (86 %)
+lazar-CDK (high-confidence),0 (0 %),0 (0 %),0 (0 %)
 R-RF,602 (100 %),18 (3 %),584 (97 %)
 R-SVM,602 (100 %),11 (2 %),591 (98 %)
 R-DL,602 (100 %),521 (87 %),81 (13 %)
diff --git a/tables/pa-tab.tex b/tables/pa-tab.tex
index e4fae23..51355b7 100644
--- a/tables/pa-tab.tex
+++ b/tables/pa-tab.tex
@@ -9,7 +9,7 @@
 \caption{Summary of pyrrolizidine alkaloid predictions: red: mutagen, green: non-mutagen, grey: no prediction, dark red/green: low confidence} \\
 \label{tab:pa}
 PubChem   & & \multicolumn{2}{c}{lazar} & \multicolumn{3}{c}{R} & \multicolumn{4}{c}{Tensorflow}\\
-CID & Measured & MP2D & PaDEL & DL & RF & SVM & LR-sgd & LR-scikit & NN & RF \\
+CID & Measured & MP2D & CDK & DL & RF & SVM & LR-sgd & LR-scikit & NN & RF \\
 \hline
 \renewcommand{\arraystretch}{0.075}
 9415 & \cellcolor{grey} & \cellcolor{green} & \cellcolor{darkgreen} & \cellcolor{red} & \cellcolor{red} & \cellcolor{green} & \cellcolor{red} & \cellcolor{red} & \cellcolor{red} & \cellcolor{red} \\
diff --git a/tables/pa-table.csv b/tables/pa-table.csv
index a0a198b..f567931 100644
--- a/tables/pa-table.csv
+++ b/tables/pa-table.csv
@@ -1,4 +1,4 @@
-ID,CID,SMILES,Canonical SMILES,Measured,lazar-MP2D,lazar-MP2D-high-confidence,lazar-PaDEL,lazar-PaDEL-high-confidence,R-DL,R-RF,R-SVM,TF-LR-sgd,TF-LR-scikit,TF-NN,TF-RF
+ID,CID,SMILES,Canonical SMILES,Measured,lazar-MP2D,lazar-MP2D-high-confidence,lazar-CDK,lazar-CDK-high-confidence,R-DL,R-RF,R-SVM,TF-LR-sgd,TF-LR-scikit,TF-NN,TF-RF
 1,9415,"C[C@H]1C(=O)O[C@@H]2CCN3[C@@H]2C(=CC3)COC(=O)[C@]([C@]1(C)O)(C)O","O=C1O[C@@H]2CCN3[C@@H]2C(=CC3)COC(=O)[C@]([C@]([C@H]1C)(C)O)(C)O",,0,T,0,F,1,1,0,1,1,1,1
 2,5281743,"C/C=C\1/C[C@H]([C@@](C(=O)OCC2=CCN3[C@H]2[C@@H](CC3)OC1=O)(CO)O)C","C/C=C\1/C[C@@H](C)[C@](O)(CO)C(=O)OCC2=CCN3[C@H]2[C@H](OC1=O)CC3",,1,T,0,F,1,1,0,1,1,1,1
 3,73614,"C[C@@H]([C@](C(C)C)(C(=O)OCC1=CCN2[C@H]1[C@@H](CC2)O)O)O","O[C@@H]1CCN2[C@@H]1C(=CC2)COC(=O)[C@]([C@@H](O)C)(C(C)C)O",,0,T,0,F,1,0,0,0,1,0,0
author	Christoph Helma <helma@in-silico.ch>	2020-12-10 17:14:14 +0100
committer	Christoph Helma <helma@in-silico.ch>	2020-12-10 17:14:14 +0100
commit	ed2625b9b2fde45cfd1739695310d47866b3c0b0 (patch)
tree	8249000344a9701b52ddf41a36008d9ffb8d940c
parent	ce8db67ce38095e06d2131eced2acfc219661580 (diff)