summaryrefslogtreecommitdiff
path: root/mutagenicity.md
diff options
context:
space:
mode:
authorChristoph Helma <helma@in-silico.ch>2021-02-18 21:59:37 +0100
committerChristoph Helma <helma@in-silico.ch>2021-02-18 21:59:37 +0100
commit3af0c3d5c5b7f7d506a4582bbe3dca7d22bbefcc (patch)
tree66a0f989c01fdac9085e9d22961fae2de0b568f7 /mutagenicity.md
parent9901f99e546619121a5dc9f31e82865198e7b912 (diff)
further cleanup, detailled pa-predictions separated, text modified until results
Diffstat (limited to 'mutagenicity.md')
-rw-r--r--mutagenicity.md201
1 files changed, 119 insertions, 82 deletions
diff --git a/mutagenicity.md b/mutagenicity.md
index aed1978..fc58a3d 100644
--- a/mutagenicity.md
+++ b/mutagenicity.md
@@ -44,11 +44,14 @@ Abstract
Random forest, support vector machine, logistic regression, neural networks and
k-nearest neighbor (`lazar`) algorithms, were applied to new *Salmonella*
-mutagenicity dataset with 8309 unique chemical structures. The best prediction
+mutagenicity dataset with {{cv.n_uniq}} unique chemical structures.
+<!--
+The best prediction
accuracies in 10-fold-crossvalidation were obtained with `lazar` models and
MolPrint2D descriptors, that gave accuracies
({{cv.lazar-high-confidence.acc_perc}}%)
similar to the interlaboratory variability of the Ames test.
+-->
**TODO**: PA results
@@ -156,17 +159,17 @@ Line Entry Specification*) strings of the compound structures.
Duplicated experimental data with the same outcome was merged into a
single value, because it is likely that it originated from the same
experiment. Contradictory results were kept as multiple measurements in
-the database. The combined training dataset contains 8309 unique
-structures.
+the database. The combined training dataset contains {{cv.n_uniq}} unique
+structures and {{cv.n}} individual measurements.
Source code for all data download, extraction and merge operations is publicly
available from the git repository <https://git.in-silico.ch/mutagenicity-paper>
under a GPL3 License. The new combined dataset can be found at
-<https://git.in-silico.ch/mutagenicity-paper/tree/data/mutagenicity.csv>.
+<https://git.in-silico.ch/mutagenicity-paper/tree/mutagenicity/mutagenicity.csv>.
### Pyrrolizidine alkaloid (PA) dataset
-The testing dataset consisted of 602 different PAs.
+The testing dataset consisted of {{pa.n}} different PAs.
The PA dataset was created from five independent, necine base substructure
searches in PubChem (https://pubchem.ncbi.nlm.nih.gov/) and compared to the PAs
@@ -176,7 +179,7 @@ these publications which were not found in the downloaded substances were
searched individually in PubChem and, if available, downloaded separately.
Non-PA substances, duplicates, and isomers were removed from the files, but
artificial PAs, even if unlikely to occur in nature, were kept. The resulting
-PA dataset comprised a total of 602 different PAs.
+PA dataset comprised a total of {{pa.n}} different PAs.
The PAs in the dataset were classified according to structural features. A
total of 9 different structural features were assigned to the necine base,
@@ -184,21 +187,21 @@ modifications of the necine base and to the necic acid:
For the necine base, the following structural features were chosen:
- - Retronecine-type (1,2-unstaturated necine base)
- - Otonecine-type (1,2-unstaturated necine base)
- - Platynecine-type (1,2-saturated necine base)
+ - Retronecine-type (1,2-unstaturated necine base, {{pa.groups.Retronecine.n}} compounds)
+ - Otonecine-type (1,2-unstaturated necine base, {{pa.groups.Otonecine.n}} compounds)
+ - Platynecine-type (1,2-saturated necine base, {{pa.groups.Platynecine.n}} compounds)
For the modifications of the necine base, the following structural features were chosen:
- - N-oxide-type
- - Tertiary-type (PAs which were neither from the N-oxide- nor DHP-type)
- - DHP-type (pyrrolic ester)
+ - N-oxide-type ({{pa.groups.N_oxide.n}} compounds)
+ - Tertiary-type (PAs which were neither from the N-oxide- nor DHP-type, {{pa.groups.Tertiary_PA.n}} compounds)
+ - Dehydropyrrolizidine-type (pyrrolic ester, {{pa.groups.Dehydropyrrolizidine.n}} compounds)
For the necic acid, the following structural features were chosen:
- - Monoester-type
- - Open-ring diester-type
- - Macrocyclic diester-type
+ - Monoester-type ({{pa.groups.Monoester.n}} compounds)
+ - Open-ring diester-type ({{pa.groups.Diester.n}} compounds)
+ - Macrocyclic diester-type ({{pa.groups.Macrocyclic_diester.n}} compounds)
The compilation of the PA dataset is described in detail in @Schoening2017.
@@ -214,31 +217,51 @@ basically the chemical concept of functional groups.
In contrast to predefined lists of fragments (e.g. FP3, FP4 or MACCs
fingerprints) or descriptors (e.g CDK) they are generated dynamically from
-chemical structures. This has the advantage that they can capture substructures
-of toxicological relevance that are not included in other descriptors.
+chemical structures. This has the advantage that they can capture unknown substructures
+of toxicological relevance that are not included in other descriptors. In addition they
+allow the efficient calculation of
+chemical similarities (e.g. Tanimoto indices) with simple set operations.
+
+MolPrint2D fingerprints were calculated with the OpenBabel cheminformatics
+library (@OBoyle2011a). They can be obtained from the following locations:
+
+*Training data:*
+
+ - sparse representation (<https://git.in-silico.ch/mutagenicity-paper/tree/mutagenicity/mp2d/fingerprints.mp2d>)
+ - descriptor matrix (<https://git.in-silico.ch/mutagenicity-paper/tree/mutagenicity/mp2d/mutagenicity-fingerprints.csv.gz>)
+
+*Pyrrolizidine alkaloids:*
+
+ - sparse representation (<https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/mp2d/fingerprints.mp2d>)
+ - descriptor matrix (<https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/mp2d/pa-fingerprints.csv.gz>)
-Chemical similarities (e.g. Tanimoto indices) can be calculated very
-efficiently with MolPrint2D fingerprints. Using them as descriptors for global
+<!--
+Using them as descriptors for global
models leads however to huge, sparsely populated matrices that cannot be
handled with traditional machine learning algorithms. In our experiments none
of the R and Tensorflow algorithms was capable to use them as descriptors.
-
-MolPrint2D fingerprints were calculated with the OpenBabel cheminformatics
-library (@OBoyle2011a).
+-->
#### Chemistry Development Kit (*CDK*) descriptors
Molecular 1D and 2D descriptors were calculated with the PaDEL-Descriptors
-program (<http://www.yapcwsoft.com> version 2.21, @Yap2011). PaDEL uses the
+program (<http://www.yapcwsoft.com> version 2.21, @Yap2011). PaDEL uses the
Chemistry Development Kit (*CDK*, <https://cdk.github.io/index.html>) library
for descriptor calculations.
-As the training dataset contained over 8309 instances, it was decided to
-delete instances with missing values during data pre-processing.
-Furthermore, substances with equivocal outcome were removed. The final
-training dataset contained 8080 instances with known mutagenic
-potential.
+As the training dataset contained {{cv.n_uniq}} instances, it was decided to
+delete instances with missing values during data pre-processing. Furthermore,
+substances with equivocal outcome were removed. The final training dataset
+contained {{cv.cdk.n_descriptors}} descriptors for {{cv.cdk.n_compounds}}
+compounds.
+CDK training data can be obtained from <https://git.in-silico.ch/mutagenicity-paper/tree/mutagenicity/cdk/mutagenicity-mod-2.new.csv>.
+
+The same procedure was applied for the pyrrolizidine dataset yielding
+ {{pa.cdk.n_descriptors}} descriptors for {{pa.cdk.n_compounds}}
+compounds. CDK features for pyrrolizidine alkaloids are available at <https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/cdk/PA-Padel-2D_m2.csv>.
+
+<!--
During feature selection, descriptors with near zero variance were removed
using '*NearZeroVar*'-function (package 'caret'). If the percentage of the most
common value was more than 90% or when the frequency ratio of the most common
@@ -258,6 +281,7 @@ regression was performed using the '*glmnet*'-function (package '*glmnet*').
The reduced dataset was used for the generation of the pre-trained models.
CDK descriptors were used in global (RF, SVM, LR, NN) and local (`lazar`) models.
+-->
Algorithms
----------
@@ -357,21 +381,19 @@ individual neighbours.
#### Availability
-- `lazar` experiments for this manuscript:
- <https://git.in-silico.ch/mutagenicity-paper>
- (source code, GPL3)
-
-- `lazar` framework:
- <https://git.in-silico.ch/lazar>
- (source code, GPL3)
-
-- `lazar` GUI:
- <https://git.in-silico.ch/lazar-gui>
- (source code, GPL3)
-
-- Public web interface:
+ - Source code for this manuscript (GPL3):
+ <https://git.in-silico.ch/lazar/tree/?h=mutagenicity-paper>
+
+ - Crossvalidation experiments (GPL3):
+ <https://git.in-silico.ch/lazar/tree/models/?h=mutagenicity-paper>
+
+ - Pyrrolizidine alkaloid predictions (GPL3):
+ <https://git.in-silico.ch/lazar/tree/predictions/?h=mutagenicity-paper>
+
+ - Public web interface:
<https://lazar.in-silico.ch>
+<!--
### R Random Forest, Support Vector Machines, and Deep Learning
The RF, SVM, and DL models were generated using the R
@@ -414,10 +436,6 @@ validation data. This step was repeated 10 times.
![Flowchart of the generation and validation of the models generated in R-project](figures/image1.png){#fig:valid}
-<!--
-**TODO**: **Verena** Ich hab die *Applicability domain* section weggelassen, da sie ansc
--->
-
#### Applicability domain
**TODO**: **Verena**: Mit welchen Deskriptoren hast Du den Jaccard index berechnet? Fuer den Jaccard index braucht man binaere Deskriptoren (zB MP2D), mit PaDEL Deskriptoren koennte man zB eine euklidische oder cosinus Distanz berechnen.
@@ -433,9 +451,13 @@ potential of the PA dataset.
#### Availability
R scripts for these experiments can be found in https://git.in-silico.ch/mutagenicity-paper/tree/scripts/R.
+-->
### Tensorflow models
+**TODO**: **Philipp** Kannst Du bitte die folgenden Absaetze ergaenzen und die Vorgangsweise fuer MP2D/CDK bzw CV/PA Vorhersagen beschreiben.
+
+<!--
Data pre-processing was done by rank transformation using the
'*QuantileTransformer*' procedure. A sequential model has been used.
Four layers have been used: input layer, two hidden layers (with 12, 8
@@ -450,7 +472,12 @@ Python 3.6 and Keras.
**TODO**: **Philipp** Ich hab die alten Ergebnisse mit feature selection weggelassen, ist das ok? Dann muesste auch dieser Absatz gestrichen werden, oder?
-**TODO**: **Philipp** Kannst Du bitte die folgenden Absaetze ergaenzen
+Alternatively, a DL model was established with Python-based Tensorflow
+program (<https://www.tensorflow.org/>) using the high-level API Keras
+(<https://www.tensorflow.org/guide/keras>) to build the models.
+
+Tensorflow models used the same CDK descriptors as the R models.
+-->
#### Random forests (*RF*)
@@ -458,15 +485,9 @@ Python 3.6 and Keras.
#### Logistic regression (scikit) (*LR-scikit*)
-**TODO**: **Philipp, Verena** DL oder NN?
-
#### Neural Nets (*NN*)
-Alternatively, a DL model was established with Python-based Tensorflow
-program (<https://www.tensorflow.org/>) using the high-level API Keras
-(<https://www.tensorflow.org/guide/keras>) to build the models.
-
-Tensorflow models used the same CDK descriptors as the R models.
+#### Support vector machines (*SVM*)
Validation
----------
@@ -475,7 +496,18 @@ Validation
#### Availability
-Jupyter notebooks for these experiments can be found in https://git.in-silico.ch/mutagenicity-paper/tree/scripts/tensorflow.
+Jupyter notebooks for these experiments can be found at the following locations
+
+*Crossvalidation:*
+
+ - MolPrint2D fingerprints: <https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/mp2d/tensorflow>
+ - CDK descriptors: <https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/cdk/tensorflow>
+
+*Pyrrolizidine alkaloids:*
+
+ - MolPrint2D fingerprints: <https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/mp2d/tensorflow>
+ - CDK descriptors: <https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/cdk/tensorflow>
+ - CDK desc
Results
=======
@@ -483,11 +515,10 @@ Results
10-fold crossvalidations
------------------------
-Crossvalidation results are summarized in the following tables: @tbl:lazar
-shows `lazar` results with MolPrint2D and CDK descriptors, @tbl:R R results
-and @tbl:tensorflow Tensorflow results.
+Crossvalidation results are summarized in the following tables: @tbl:cv-mp2d
+shows results with MolPrint2D descriptors and @tbl:cv-cdk with CDK descriptors.
-| | lazar-HC | lazar-all | RF | LR-sgi | LR-scikit | NN | SVM |
+| | lazar-HC | lazar-all | RF | LR-sgd | LR-scikit | NN | SVM |
|:-|----------|-----------|----|--------|-----------|----|-----|
Accuracy | {{cv.lazar-mp2d-high-confidence.acc_perc}} | {{cv.lazar-mp2d-all.acc_perc}} | {{cv.tensorflow-rf-mp2d.acc_perc}} | {{cv.tensorflow-lr-mp2d.acc_perc}} | {{cv.tensorflow-lr2-mp2d.acc_perc}} | {{cv.tensorflow-nn-mp2d.acc_perc}} | {{cv.tensorflow-svm-mp2d.acc_perc}} |
True positive rate | {{cv.lazar-mp2d-high-confidence.tpr_perc}} | {{cv.lazar-mp2d-all.tpr_perc}} | {{cv.tensorflow-rf-mp2d.tpr_perc}} | {{cv.tensorflow-lr-mp2d.tpr_perc}} | {{cv.tensorflow-lr2-mp2d.tpr_perc}} | {{cv.tensorflow-nn-mp2d.tpr_perc}} | {{cv.tensorflow-svm-mp2d.tpr_perc}} |
@@ -496,10 +527,10 @@ Positive predictive value | {{cv.lazar-mp2d-high-confidence.ppv_perc}} | {{cv.la
Negative predictive value | {{cv.lazar-mp2d-high-confidence.npv_perc}} | {{cv.lazar-mp2d-all.npv_perc}} | {{cv.tensorflow-rf-mp2d.npv_perc}} | {{cv.tensorflow-lr-mp2d.npv_perc}} | {{cv.tensorflow-lr2-mp2d.npv_perc}} | {{cv.tensorflow-nn-mp2d.npv_perc}} | {{cv.tensorflow-svm-mp2d.npv_perc}} |
Nr. predictions | {{cv.lazar-mp2d-high-confidence.n}} | {{cv.lazar-mp2d-all.n}} | {{cv.tensorflow-rf-mp2d.n}} | {{cv.tensorflow-lr-mp2d.n}} | {{cv.tensorflow-lr2-mp2d.n}} | {{cv.tensorflow-nn-mp2d.n}} | {{cv.tensorflow-svm-mp2d.n}} |
-: Summary of crossvalidation results with MolPrint2D descriptors {#tbl:cv-mp2d}
+: Summary of crossvalidation results with MolPrint2D descriptors (lazar-HC: lazar with high confidence, lazar-all: all lazar predictions, RF: random forests, LR-sgd: logistic regression (stochastic gradient descent), LR-scikit: logistic regression (scikit), NN: neural networks, SVM: support vector machines) {#tbl:cv-mp2d}
-| | lazar-HC | lazar-all | RF | LR-sgi | LR-scikit | NN | SVM |
+| | lazar-HC | lazar-all | RF | LR-sgd | LR-scikit | NN | SVM |
|:-|----------|-----------|----|--------|-----------|----|-----|
Accuracy | {{cv.lazar-cdk-high-confidence.acc_perc}} | {{cv.lazar-cdk-all.acc_perc}} | {{cv.tensorflow-rf-cdk.acc_perc}} | {{cv.tensorflow-lr-cdk.acc_perc}} | {{cv.tensorflow-lr2-cdk.acc_perc}} | {{cv.tensorflow-nn-cdk.acc_perc}} | {{cv.tensorflow-svm-cdk.acc_perc}} |
True positive rate | {{cv.lazar-cdk-high-confidence.tpr_perc}} | {{cv.lazar-cdk-all.tpr_perc}} | {{cv.tensorflow-rf-cdk.tpr_perc}} | {{cv.tensorflow-lr-cdk.tpr_perc}} | {{cv.tensorflow-lr2-cdk.tpr_perc}} | {{cv.tensorflow-nn-cdk.tpr_perc}} | {{cv.tensorflow-svm-cdk.tpr_perc}} |
@@ -508,17 +539,24 @@ Positive predictive value | {{cv.lazar-cdk-high-confidence.ppv_perc}} | {{cv.laz
Negative predictive value | {{cv.lazar-cdk-high-confidence.npv_perc}} | {{cv.lazar-cdk-all.npv_perc}} | {{cv.tensorflow-rf-cdk.npv_perc}} | {{cv.tensorflow-lr-cdk.npv_perc}} | {{cv.tensorflow-lr2-cdk.npv_perc}} | {{cv.tensorflow-nn-cdk.npv_perc}} | {{cv.tensorflow-svm-cdk.npv_perc}} |
Nr. predictions | {{cv.lazar-cdk-high-confidence.n}} | {{cv.lazar-cdk-all.n}} | {{cv.tensorflow-rf-cdk.n}} | {{cv.tensorflow-lr-cdk.n}} | {{cv.tensorflow-lr2-cdk.n}} | {{cv.tensorflow-nn-cdk.n}} | {{cv.tensorflow-svm-cdk.n}} |
-: Summary of crossvalidation results with CDK descriptors {#tbl:cv-cdk}
+: Summary of crossvalidation results with CDK descriptors (lazar-HC: lazar with high confidence, lazar-all: all lazar predictions, RF: random forests, LR-sgd: logistic regression (stochastic gradient descent), LR-scikit: logistic regression (scikit), NN: neural networks, SVM: support vector machines) {#tbl:cv-cdk}
@fig:roc depicts the position of all crossvalidation results in receiver operating characteristic (ROC) space.
-![ROC plot of crossvalidation results.](figures/roc.png){#fig:roc}
+![ROC plot of crossvalidation results (lazar-HC: lazar with high confidence, lazar-all: all lazar predictions, RF: random forests, LR-sgd: logistic regression (stochastic gradient descent), LR-scikit: logistic regression (scikit), NN: neural networks, SVM: support vector machines).](figures/roc.png){#fig:roc}
Confusion matrices for all models are available from the git repository
-https://git.in-silico.ch/mutagenicity-paper/tree/10-fold-crossvalidations/confusion-matrices/,
+https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/confusion-matrices/,
individual predictions can be found in
-https://git.in-silico.ch/mutagenicity-paper/tree/10-fold-crossvalidations/predictions/.
+https://git.in-silico.ch/mutagenicity-paper/tree/crossvalidations/predictions/.
+With exception of lazar/CDK all investigated algorithm/descriptor combinations
+give accuracies between (80 and 85%) which is equivalent to the experimental
+variability of the *Salmonella typhimurium* mutagenicity bioassay (80-85%,
+@Benigni1988). Sensitivities and specificities are well balanced in all of
+these models.
+
+<!--
The most accurate crossvalidation predictions have been obtained with standard
`lazar` models using MolPrint2D descriptors ({{cv.lazar-high-confidence.acc}}
for predictions with high confidence, {{cv.lazar-all.acc}} for all
@@ -527,27 +565,18 @@ accuracies ranging from {{cv.R-DL.acc}} (R deep learning) to {{cv.R-RF.acc}}
(R/Tensorflow random forests). Sensitivity and specificity is generally well
balanced with the exception of `lazar`-CDK (low sensitivity) and R deep
learning (low specificity) models.
+-->
Pyrrolizidine alkaloid mutagenicity predictions
-----------------------------------------------
-Mutagenicity predictions from all investigated models for 602 pyrrolizidine
-alkaloids (PAs) are shown in Table 4. A CSV table with all predictions can be
-downloaded from https://git.in-silico.ch/mutagenicity-paper/tree/tables/pa-table.csv
-
-**TODO** **Verena und Philipp** Koennt Ihr bitte stichprobenweise die Tabelle ueberpruefen
+Mutagenicity predictions from all investigated models for {{pa.n}}
+pyrrolizidine alkaloids (PAs) can be downloaded from
+<https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/pa-predictions.csv>.
+A visual representation of all PA predictions can be found at
+<https://git.in-silico.ch/mutagenicity-paper/tree/pyrrolizidine-alkaloids/pa-predictions.pdf>.
-\input{tables/pa-tab.tex}
-
-@tbl:pa-summary summarises the number of positive and negative mutagenicity predictions for all investigated models.
-
-For the visualisation of the position of pyrrolizidine alkaloids in respect to
-the training data set we have applied t-distributed stochastic neighbor
-embedding (t-SNE, @Maaten2008) for MolPrint2D and CDK descriptors. t-SNE
-maps each high-dimensional object (chemical) to a two-dimensional point,
-maintaining the high-dimensional distances of the objects. Similar objects are
-represented by nearby points and dissimilar objects are represented by distant
-points.
+@fig:dhp - @fig:tert display the proportion of positive mutagenicity predictions from all models for the different pyrrolizidine alkaloid groups.
![Summary of Dehydropyrrolizidine predictions](figures/Dehydropyrrolizidine.png){#fig:dhp}
@@ -568,6 +597,14 @@ points.
![Summary of Tertiary PA predictions](figures/Tertiary.PA.png){#fig:tert}
+For the visualisation of the position of pyrrolizidine alkaloids in respect to
+the training data set we have applied t-distributed stochastic neighbor
+embedding (t-SNE, @Maaten2008) for MolPrint2D and CDK descriptors. t-SNE
+maps each high-dimensional object (chemical) to a two-dimensional point,
+maintaining the high-dimensional distances of the objects. Similar objects are
+represented by nearby points and dissimilar objects are represented by distant
+points.
+
@fig:tsne-mp2d shows the t-SNE of pyrrolizidine alkaloids (PA) and the mutagenicity training data in MP2D space (Tanimoto/Jaccard similarity).
![t-SNE visualisation of mutagenicity training data and pyrrolizidine alkaloids (PA)](figures/tsne-mp2d.png){#fig:tsne-mp2d}
@@ -583,11 +620,11 @@ Data
----
A new training dataset for *Salmonella* mutagenicity was created from three
-different sources (@Kazius2005, @Hansen2009, @EFSA2016). It contains 8309
+different sources (@Kazius2005, @Hansen2009, @EFSA2016). It contains {{cv.n_uniq}}
unique chemical structures, which is according to our knowledge the largest
public mutagenicity dataset presently available. The new training data can be
downloaded from
-<https://git.in-silico.ch/mutagenicity-paper/tree/data/mutagenicity.csv>.
+<https://git.in-silico.ch/mutagenicity-paper/tree/mutagenicity/mutagenicity.csv>.
Model performance
-----------------