summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorChristoph Helma <helma@in-silico.ch>2021-06-30 12:50:00 +0200
committerChristoph Helma <helma@in-silico.ch>2022-08-18 12:51:45 +0200
commit66543ccf5724f55e39775a1650b7b36381ae5ea9 (patch)
treef4b40e1774cdc368143b00e60939c12642bdc6f2
parent1f956a4963f62c90475ac8e1f713b989b5a99b36 (diff)
final revisionHEADmaster
-rw-r--r--bibliography.bib2
-rw-r--r--data.yaml1
-rw-r--r--mutagenicity.md73
-rw-r--r--mutagenicity.pdfbin3301198 -> 3301187 bytes
-rwxr-xr-xscripts/data.rb9
5 files changed, 49 insertions, 36 deletions
diff --git a/bibliography.bib b/bibliography.bib
index d69d4a6..ed98a64 100644
--- a/bibliography.bib
+++ b/bibliography.bib
@@ -134,7 +134,7 @@ abstract = {Pyrrolizidine alkaloids (PAs) are among the most potent natural toxi
}
@misc{ECHA2017,
- author ={European Chemicals Agency (ECHA)},
+ author ={{European Chemicals Agency (ECHA)}},
title = {Guidance on Information Requirements and Chemical Safety Assessment, Chapter R.7a: Endpoint specific guidance},
year = 2017,
note ={\url{https://echa.europa.eu/documents/10162/13632/information_requirements_r6_en.pdf}},
diff --git a/data.yaml b/data.yaml
index 7eed171..55153be 100644
--- a/data.yaml
+++ b/data.yaml
@@ -238,6 +238,7 @@
:tnr_perc: 82
:ppv_perc: 82
:npv_perc: 82
+ :n_mult: 19
:n: 8309
:n_uniq: 8290
:cdk:
diff --git a/mutagenicity.md b/mutagenicity.md
index eb0ce3c..98e25a5 100644
--- a/mutagenicity.md
+++ b/mutagenicity.md
@@ -62,11 +62,11 @@ Introduction
============
The assessment of mutagenicity is an important part in the safety assessment of
-chemical structures, because mutations may lead to cancer and germ
-cells damage. The *Salmonella typhimurium* bacterial reverse mutation
-test (Ames test) is capable to identify substances that cause mutations (e.g.,
-base-pair substitutions, frameshifts, insertions, deletions) and is generally
-used as the first step in genotoxicity and carcinogenicity assessments.
+chemical structures, because mutations may lead to cancer and germ cells
+damage. The bacterial reverse mutation test (Ames test) is capable to identify
+substances that cause mutations (e.g., base-pair substitutions, frameshifts,
+insertions, deletions) and is generally used as the first step in genotoxicity
+and carcinogenicity assessments.
Computer based (*in silico*) mutagenicity predictions can be used in the early
screening of novel compounds (e.g. drug candidates), but they are also gaining
@@ -75,7 +75,7 @@ REACH (@ECHA2017) or the assessment of impurities in pharmaceuticals (ICH M7
guideline, Harmonisation of Technical Requirements for Pharmaceuticals for
Human Use @ICH2017).
-Currently, *Salmonella* mutagenicity is the toxicological endpoint with the
+Currently, mutagenicity is the toxicological endpoint with the
largest amount of public data for almost 10000 structures, whereas datasets for
other endpoints contain typically only a few hundred compounds. The Ames test
itself is relatively reproducible with an interlaboratory variability of 80-85%
@@ -89,7 +89,7 @@ of overfitting experimental errors.
Within this study we attempted
- - to generate a new public mutagenicity training dataset, by combining the most comprehensive public datasets
+ - to generate a new public mutagenicity training dataset focusing on *Salmonella typhimurium*, by combining the most comprehensive public datasets
- to compare the performance of MolPrint2D (*MP2D*) fingerprints with Chemistry Development Kit (*CDK*) descriptors for mutagenicity predictions
- to compare the performance of global QSAR models (random forests (*RF*), support vector machines (*SVM*), logistic regression (*LR*), neural nets (*NN*)) with local models (`lazar`)
@@ -109,20 +109,23 @@ In mammals, PAs are mainly metabolized in the liver. There are three principal m
- Detoxification by hydrolysis of the ester bond on positions C7 and C9 by non-specific esterases to release necine base and necic acid. 
-- N-oxidation of the necine base to form a PA N-oxides, which can be either conjugated by phase II enzymes and then excreted or converted back into the corresponding parent PA (@Wang2005). This detoxification pathway is not possible for otonecine-type PAs, as they are N-methylated (see @fig:pa-schema).
+- N-oxidation of the necine base to form PA N-oxides, which can be either conjugated by phase II enzymes and then excreted or converted back into the corresponding parent PA (@Wang2005). This detoxification pathway is not possible for otonecine-type PAs, as they are N-methylated (see @fig:pa-schema).
- Metabolic activation or toxification by oxidation (for retronecine-type PAs) or oxidative N-demethylation (for otonecine-type Pas) by cytochromes P450 isoforms CYP2B and 3A (@Lin1998, @Ruan2014).
-The latter reactions result in the formation of dehydropyrrolizidine (DHP) that is highly reactive and causes damage by building adducts with protein, lipids and DNA (@Chen2010). On the other hand, open diesters and macrocyclic PAs have a reduced detoxification due to steric hinderance of the respective esterases (@Ruan2014)
+The latter reactions result in the formation of dehydropyrrolizidine (DHP) that
+is highly reactive and causes damage by building adducts with protein, lipids
+and DNA (@Chen2010). On the other hand, open diesters and macrocyclic PAs have
+a reduced detoxification due to steric hinderance of the respective esterases
+(@Ruan2014).
-Therefore the
-mutagenic probability of PAs is highly dependent on structure of necine
-base and necic acid (@Hadi2021; @Allemang2018, @Louisse2019). However, due to
-limited availability of pure substances, only a limited number of PAs have been
-investigated with regards to their structure-specific mutagenicity and
-experimentally in an Ames test. To overcome this bottleneck, the prediction of
-structure-specific mutagenic probabilities of PAs with different machine learning
-models could provide further insights in the mechanisms.
+Therefore, the mutagenic probability of PAs is highly dependent on the
+structure of necine base and necic acid (@Hadi2021; @Allemang2018,
+@Louisse2019). However, due to limited availability of pure substances, only a
+small number of PAs have been investigated experimentally in an Ames test. To
+overcome this bottleneck, the application of different machine learning models
+to predict mutagenic probabilities based on structures and properties
+could provide further insights into the mutagenicity mechanisms of PAs.
Materials and Methods
=====================
@@ -142,16 +145,16 @@ training dataset was compiled from the following sources:
- EFSA Dataset (695 compounds @EFSA2016): <https://data.europa.eu/euodp/data/storage/f/2017-0719T142131/GENOTOX%20data%20and%20dictionary.xls>
Mutagenicity classifications from Kazius and Hansen datasets were used without
-further processing. According to these publications compounds were classified
-as mutagenic, if at least one positive result has been obtained in *Salmonella
-typhimurium* strains TA98, TA100, TA1535, TA1537, TA97, TA102 and 1538 either
+further processing. According to these publications, compounds were classified
+as mutagenic if at least one positive result has been obtained in *Salmonella
+typhimurium* strains TA97, TA98, TA100, TA102, TA1535, TA1537 and TA1538 either
with or without metabolic activation by S9. *E. coli* results were not
considered in these databases. To achieve consistency with these datasets, EFSA
compounds were classified as mutagenic, if at least one positive result was
-found for TA98 or T100 Salmonella strains either with or without metabolic
-activation. The complete dataset contains chemicals for very diverse
-application areas (e.g. pharmaceuticals, pesticides, industrial chemicals,
-environmental contaminants).
+found for the same *Salmonella* strains either with or without metabolic
+activation and as non-mutagenic if no positive result was found. The complete
+dataset contains chemicals from very diverse application areas (e.g.
+pharmaceuticals, pesticides, industrial chemicals, environmental contaminants).
Dataset merges were based on unique SMILES (*Simplified Molecular Input Line
Entry Specification*, @Weininger1989) strings of the compound structures.
@@ -159,7 +162,7 @@ Duplicated experimental data with the same outcome was merged into a single
value, because it is likely that it originated from the same experiment.
Contradictory results were kept as multiple measurements in the database. The
combined training dataset contains {{cv.n_uniq}} unique structures and {{cv.n}}
-individual measurements.
+individual measurements. Contradictory results were found for {{cv.n_mult}} substances.
Source code for all data download, extraction and merge operations is publicly
available from the git repository <https://git.in-silico.ch/mutagenicity-paper>
@@ -170,7 +173,7 @@ under a GPL3 License. The new combined dataset can be found at
The pyrrolizidine alkaloid dataset was created from five independent, necine
base substructure searches in PubChem (https://pubchem.ncbi.nlm.nih.gov/) and
-compared to the PAs listed in the EFSA publication @EFSA2011 and the book by
+compared to the PAs listed in @EFSA2011 and the book by
@Mattocks1986, to ensure, that all major PAs were included. PAs
mentioned in these publications, which were not found in the downloaded
substances were searched individually in PubChem and, if available, downloaded
@@ -182,7 +185,7 @@ Further details about the compilation of the PA dataset are described in @Schoen
The PAs in the dataset were classified according to structural features. A
total of 9 different structural features were assigned to the necine base,
-modifications of the necine base and to the necic acid (@fig:pa-schema):
+to modifications of the necine base and to the necic acid (@fig:pa-schema):
![Structural features of pyrrolizidine alkaloids](figures/PA-Schema.png){#fig:pa-schema}
@@ -222,8 +225,8 @@ descriptors. In addition, they allow the efficient calculation of chemical
similarities (e.g. Tanimoto indices) with simple set operations.
MolPrint2D fingerprints were calculated with the OpenBabel cheminformatics
-library (@OBoyle2011a) for the complete training dataset with {{cv.n}}
-instances. They can be obtained from the following locations:
+library (@OBoyle2011a) for the complete training dataset with {{cv.n_uniq}}
+unique structures. They can be obtained from the following locations:
*Training data:*
@@ -244,7 +247,7 @@ for descriptor calculations.
As the training dataset contained {{cv.n}} instances, it was decided to
delete all instances where CDK descriptor calculations failed during pre-processing. Furthermore,
-all substances with contradictory experimental mutagenicity data were removed. The final training dataset
+{{cv.n_mult}} substances with contradictory experimental results were removed. The final training dataset
contained {{cv.cdk.n_descriptors}} descriptors for {{cv.cdk.n_compounds}}
compounds.
@@ -272,7 +275,7 @@ following basic workflow: For a given chemical structure `lazar`:
compound.
This procedure resembles an automated version of read across predictions
-in toxicology, in machine learning terms it would be classified as a
+in toxicology. In machine learning terms it would be classified as a
k-nearest-neighbour algorithm.
Apart from this basic workflow, `lazar` is completely modular and allows
@@ -399,7 +402,7 @@ used the scikit-learn default values.
#### Logistic regression (SGD) (*LR-sgd*)
-For the logistic regression we used an ensemble of five trained models.
+For the logistic regression we used a combination of five trained models.
For each model we used a batch size of 64 and trained for 50 epochs. As
an optimizer ADAM was chosen. For the other parameters we used the
tensorflow default values.
@@ -411,7 +414,7 @@ default values.
#### Neural Nets (*NN*)
-For the neural network we used an ensemble of five trained models. For
+For the neural network we used a combination of five trained models. For
each model we used a batch size of 64 and trained for 50 epochs. As an
optimizer ADAM was chosen. The neural network had 4 hidden layers with
64 nodes each and a ReLu activation function. For the other parameters
@@ -876,10 +879,10 @@ however a substantially lower number of mutagenicity predictions, despite
similar crossvalidation results and we were unable to identify the reasons for
this discrepancy within this investigation.
-Our data show that large difference exist with regard to genotoxic probabilities
+Our data show that large difference exist with regard to mutagenic probabilities
between different pyrrolizidine subgroups. To adjust risk assessment of
pyrrolizidine contamination, our data supports a tiered risk assessment based
-on *in silico* and experimental data on the relative potency of individual
+on *in silico* predictions and experimental data of individual
pyrrolizidine alkaloids.
References
diff --git a/mutagenicity.pdf b/mutagenicity.pdf
index 32464af..03238c4 100644
--- a/mutagenicity.pdf
+++ b/mutagenicity.pdf
Binary files differ
diff --git a/scripts/data.rb b/scripts/data.rb
index 72e6b28..d24e46b 100755
--- a/scripts/data.rb
+++ b/scripts/data.rb
@@ -6,6 +6,15 @@ data = {}
data.merge!(YAML.load_file(File.join(dir,"summary.yaml")))
end
+mut = {}
+File.readlines("mutagenicity/mutagenicity.csv").each do |line|
+ smi, m = line.chomp.split(",")
+ mut[smi] ||= []
+ mut[smi] << m
+end
+
+data[:cv][:n_mult] = mut.select{|s,m| m.size > 1}.size
+
data[:cv][:n] = `cut -f1 -d ',' mutagenicity/mutagenicity.csv | wc -l`.chomp.to_i - 1
data[:cv][:n_uniq] = `cut -f1 -d ',' mutagenicity/mutagenicity.csv | sort -u | wc -l`.chomp.to_i - 1