final revisionHEAD master

author: Christoph Helma <helma@in-silico.ch> 2021-06-30 12:50:00 +0200
committer: Christoph Helma <helma@in-silico.ch> 2022-08-18 12:51:45 +0200
commit: 66543ccf5724f55e39775a1650b7b36381ae5ea9 (patch)
tree: f4b40e1774cdc368143b00e60939c12642bdc6f2 /mutagenicity.md
parent: 1f956a4963f62c90475ac8e1f713b989b5a99b36 (diff)
1 files changed, 38 insertions, 35 deletions
diff --git a/mutagenicity.md b/mutagenicity.md
index eb0ce3c..98e25a5 100644
--- a/mutagenicity.md
+++ b/mutagenicity.md
@@ -62,11 +62,11 @@ Introduction
 ============
 
 The assessment of mutagenicity is an important part in the safety assessment of
-chemical structures, because mutations may lead to cancer and germ
-cells damage.  The *Salmonella typhimurium* bacterial reverse mutation
-test (Ames test) is capable to identify substances that cause mutations (e.g.,
-base-pair substitutions, frameshifts, insertions, deletions) and is generally
-used as the first step in genotoxicity and carcinogenicity assessments.
+chemical structures, because mutations may lead to cancer and germ cells
+damage.  The bacterial reverse mutation test (Ames test) is capable to identify
+substances that cause mutations (e.g., base-pair substitutions, frameshifts,
+insertions, deletions) and is generally used as the first step in genotoxicity
+and carcinogenicity assessments.
 
 Computer based (*in silico*) mutagenicity predictions can be used in the early
 screening of novel compounds (e.g. drug candidates), but they are also gaining
@@ -75,7 +75,7 @@ REACH (@ECHA2017) or the assessment of impurities in pharmaceuticals (ICH M7
 guideline, Harmonisation of Technical Requirements for Pharmaceuticals for
 Human Use @ICH2017).
 
-Currently, *Salmonella* mutagenicity is the toxicological endpoint with the
+Currently, mutagenicity is the toxicological endpoint with the
 largest amount of public data for almost 10000 structures, whereas datasets for
 other endpoints contain typically only a few hundred compounds. The Ames test
 itself is relatively reproducible with an interlaboratory variability of 80-85%
@@ -89,7 +89,7 @@ of overfitting experimental errors.
 
 Within this study we attempted
 
-  - to generate a new public mutagenicity training dataset, by combining the most comprehensive public datasets
+  - to generate a new public mutagenicity training dataset focusing on *Salmonella typhimurium*, by combining the most comprehensive public datasets
   - to compare the performance of MolPrint2D (*MP2D*) fingerprints with Chemistry Development Kit (*CDK*) descriptors for mutagenicity predictions
   - to compare the performance of global QSAR models (random forests (*RF*), support vector machines (*SVM*), logistic regression (*LR*), neural nets (*NN*)) with local models (`lazar`)
 
@@ -109,20 +109,23 @@ In mammals, PAs are mainly metabolized in the liver. There are three principal m
 
 - Detoxification by hydrolysis of the ester bond on positions C7 and C9 by non-specific esterases to release necine base and necic acid. 
 
-- N-oxidation of the necine base to form a PA N-oxides, which can be either conjugated by phase II enzymes and then excreted or converted back into the corresponding parent PA (@Wang2005). This detoxification pathway is not possible for otonecine-type PAs, as they are N-methylated (see @fig:pa-schema).
+- N-oxidation of the necine base to form PA N-oxides, which can be either conjugated by phase II enzymes and then excreted or converted back into the corresponding parent PA (@Wang2005). This detoxification pathway is not possible for otonecine-type PAs, as they are N-methylated (see @fig:pa-schema).
 
 - Metabolic activation or toxification by oxidation (for retronecine-type PAs) or oxidative N-demethylation (for otonecine-type Pas) by cytochromes P450 isoforms CYP2B and 3A (@Lin1998,  @Ruan2014).
 
-The latter reactions result in the formation of dehydropyrrolizidine (DHP) that is highly reactive and causes damage by building adducts with protein, lipids and DNA (@Chen2010). On the other hand, open diesters and macrocyclic PAs have a reduced detoxification due to steric hinderance of the respective esterases (@Ruan2014)
+The latter reactions result in the formation of dehydropyrrolizidine (DHP) that
+is highly reactive and causes damage by building adducts with protein, lipids
+and DNA (@Chen2010). On the other hand, open diesters and macrocyclic PAs have
+a reduced detoxification due to steric hinderance of the respective esterases
+(@Ruan2014).
 
-Therefore the 
-mutagenic probability of PAs is highly dependent on structure of necine
-base and necic acid (@Hadi2021; @Allemang2018, @Louisse2019). However, due to
-limited availability of pure substances, only a limited number of PAs have been
-investigated with regards to their structure-specific mutagenicity and
-experimentally in an Ames test. To overcome this bottleneck, the prediction of
-structure-specific mutagenic probabilities of PAs with different machine learning
-models could provide further insights in the mechanisms.
+Therefore, the mutagenic probability of PAs is highly dependent on the
+structure of necine base and necic acid (@Hadi2021; @Allemang2018,
+@Louisse2019). However, due to limited availability of pure substances, only a
+small number of PAs have been investigated experimentally in an Ames test. To
+overcome this bottleneck, the application of different machine learning models
+to predict mutagenic probabilities based on structures and properties
+could provide further insights into the mutagenicity mechanisms of PAs.
 
 Materials and Methods
 =====================
@@ -142,16 +145,16 @@ training dataset was compiled from the following sources:
 -   EFSA Dataset (695 compounds @EFSA2016): <https://data.europa.eu/euodp/data/storage/f/2017-0719T142131/GENOTOX%20data%20and%20dictionary.xls>
 
 Mutagenicity classifications from Kazius and Hansen datasets were used without
-further processing. According to these publications compounds were classified
-as mutagenic, if at least one positive result has been obtained in *Salmonella
-typhimurium* strains TA98, TA100, TA1535, TA1537, TA97, TA102 and 1538 either
+further processing. According to these publications, compounds were classified
+as mutagenic if at least one positive result has been obtained in *Salmonella
+typhimurium* strains TA97, TA98, TA100, TA102, TA1535, TA1537 and TA1538 either
 with or without metabolic activation by S9. *E. coli* results were not
 considered in these databases. To achieve consistency with these datasets, EFSA
 compounds were classified as mutagenic, if at least one positive result was
-found for TA98 or T100 Salmonella strains either with or without metabolic
-activation. The complete dataset contains chemicals for very diverse
-application areas (e.g. pharmaceuticals, pesticides, industrial chemicals,
-environmental contaminants).
+found for the same *Salmonella* strains either with or without metabolic
+activation and as non-mutagenic if no positive result was found. The complete
+dataset contains chemicals from very diverse application areas (e.g.
+pharmaceuticals, pesticides, industrial chemicals, environmental contaminants).
 
 Dataset merges were based on unique SMILES (*Simplified Molecular Input Line
 Entry Specification*, @Weininger1989) strings of the compound structures.
@@ -159,7 +162,7 @@ Duplicated experimental data with the same outcome was merged into a single
 value, because it is likely that it originated from the same experiment.
 Contradictory results were kept as multiple measurements in the database. The
 combined training dataset contains {{cv.n_uniq}} unique structures and {{cv.n}}
-individual measurements.
+individual measurements. Contradictory results were found for {{cv.n_mult}} substances.
 
 Source code for all data download, extraction and merge operations is publicly
 available from the git repository <https://git.in-silico.ch/mutagenicity-paper>
@@ -170,7 +173,7 @@ under a GPL3 License. The new combined dataset can be found at
 
 The pyrrolizidine alkaloid dataset was created from five independent, necine
 base substructure searches in PubChem (https://pubchem.ncbi.nlm.nih.gov/) and
-compared to the PAs listed in the EFSA publication @EFSA2011 and the book by
+compared to the PAs listed in @EFSA2011 and the book by
 @Mattocks1986, to ensure, that all major PAs were included. PAs
 mentioned in these publications, which were not found in the downloaded
 substances were searched individually in PubChem and, if available, downloaded
@@ -182,7 +185,7 @@ Further details about the compilation of the PA dataset are described in @Schoen
 
 The PAs in the dataset were classified according to structural features. A
 total of 9 different structural features were assigned to the necine base,
-modifications of the necine base and to the necic acid (@fig:pa-schema):
+to modifications of the necine base and to the necic acid (@fig:pa-schema):
 
 ![Structural features of pyrrolizidine alkaloids](figures/PA-Schema.png){#fig:pa-schema}
 
@@ -222,8 +225,8 @@ descriptors. In addition, they allow the efficient calculation of chemical
 similarities (e.g. Tanimoto indices) with simple set operations.
 
 MolPrint2D fingerprints were calculated with the OpenBabel cheminformatics
-library (@OBoyle2011a) for the complete training dataset with {{cv.n}}
-instances. They can be obtained from the following locations:
+library (@OBoyle2011a) for the complete training dataset with {{cv.n_uniq}}
+unique structures. They can be obtained from the following locations:
 
 *Training data:*
 
@@ -244,7 +247,7 @@ for descriptor calculations.
 
 As the training dataset contained {{cv.n}} instances, it was decided to
 delete all instances where CDK descriptor calculations failed during pre-processing. Furthermore,
-all substances with contradictory experimental mutagenicity data were removed. The final training dataset
+{{cv.n_mult}} substances with contradictory experimental results were removed. The final training dataset
 contained {{cv.cdk.n_descriptors}} descriptors for {{cv.cdk.n_compounds}}
 compounds.
 
@@ -272,7 +275,7 @@ following basic workflow: For a given chemical structure `lazar`:
     compound.
 
 This procedure resembles an automated version of read across predictions
-in toxicology, in machine learning terms it would be classified as a
+in toxicology. In machine learning terms it would be classified as a
 k-nearest-neighbour algorithm.
 
 Apart from this basic workflow, `lazar` is completely modular and allows
@@ -399,7 +402,7 @@ used the scikit-learn default values.
 
 #### Logistic regression (SGD) (*LR-sgd*)
 
-For the logistic regression we used an ensemble of five trained models. 
+For the logistic regression we used a combination of five trained models. 
 For each model we used a batch size of 64 and trained for 50 epochs. As 
 an optimizer ADAM was chosen. For the other parameters we used the 
 tensorflow default values.
@@ -411,7 +414,7 @@ default values.
 
 #### Neural Nets (*NN*)
 
-For the neural network we used an ensemble of five trained models. For 
+For the neural network we used a combination of five trained models. For 
 each model we used a batch size of 64 and trained for 50 epochs. As an 
 optimizer ADAM was chosen. The neural network had 4 hidden layers with 
 64 nodes each and a ReLu activation function. For the other parameters 
@@ -876,10 +879,10 @@ however a substantially lower number of mutagenicity predictions, despite
 similar crossvalidation results and we were unable to identify the reasons for
 this discrepancy within this investigation.
 
-Our data show that large difference exist with regard to genotoxic probabilities
+Our data show that large difference exist with regard to mutagenic probabilities
 between different pyrrolizidine subgroups. To adjust risk assessment of
 pyrrolizidine contamination, our data supports a tiered risk assessment based
-on *in silico* and experimental data on the relative potency of individual
+on *in silico* predictions and experimental data of individual
 pyrrolizidine alkaloids.
 
 References
author	Christoph Helma <helma@in-silico.ch>	2021-06-30 12:50:00 +0200
committer	Christoph Helma <helma@in-silico.ch>	2022-08-18 12:51:45 +0200
commit	66543ccf5724f55e39775a1650b7b36381ae5ea9 (patch)
tree	f4b40e1774cdc368143b00e60939c12642bdc6f2 /mutagenicity.md
parent	1f956a4963f62c90475ac8e1f713b989b5a99b36 (diff)