From 66543ccf5724f55e39775a1650b7b36381ae5ea9 Mon Sep 17 00:00:00 2001 From: Christoph Helma Date: Wed, 30 Jun 2021 12:50:00 +0200 Subject: final revision --- bibliography.bib | 2 +- data.yaml | 1 + mutagenicity.md | 73 +++++++++++++++++++++++++++++-------------------------- mutagenicity.pdf | Bin 3301198 -> 3301187 bytes scripts/data.rb | 9 +++++++ 5 files changed, 49 insertions(+), 36 deletions(-) diff --git a/bibliography.bib b/bibliography.bib index d69d4a6..ed98a64 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -134,7 +134,7 @@ abstract = {Pyrrolizidine alkaloids (PAs) are among the most potent natural toxi } @misc{ECHA2017, - author ={European Chemicals Agency (ECHA)}, + author ={{European Chemicals Agency (ECHA)}}, title = {Guidance on Information Requirements and Chemical Safety Assessment, Chapter R.7a: Endpoint specific guidance}, year = 2017, note ={\url{https://echa.europa.eu/documents/10162/13632/information_requirements_r6_en.pdf}}, diff --git a/data.yaml b/data.yaml index 7eed171..55153be 100644 --- a/data.yaml +++ b/data.yaml @@ -238,6 +238,7 @@ :tnr_perc: 82 :ppv_perc: 82 :npv_perc: 82 + :n_mult: 19 :n: 8309 :n_uniq: 8290 :cdk: diff --git a/mutagenicity.md b/mutagenicity.md index eb0ce3c..98e25a5 100644 --- a/mutagenicity.md +++ b/mutagenicity.md @@ -62,11 +62,11 @@ Introduction ============ The assessment of mutagenicity is an important part in the safety assessment of -chemical structures, because mutations may lead to cancer and germ -cells damage. The *Salmonella typhimurium* bacterial reverse mutation -test (Ames test) is capable to identify substances that cause mutations (e.g., -base-pair substitutions, frameshifts, insertions, deletions) and is generally -used as the first step in genotoxicity and carcinogenicity assessments. +chemical structures, because mutations may lead to cancer and germ cells +damage. The bacterial reverse mutation test (Ames test) is capable to identify +substances that cause mutations (e.g., base-pair substitutions, frameshifts, +insertions, deletions) and is generally used as the first step in genotoxicity +and carcinogenicity assessments. Computer based (*in silico*) mutagenicity predictions can be used in the early screening of novel compounds (e.g. drug candidates), but they are also gaining @@ -75,7 +75,7 @@ REACH (@ECHA2017) or the assessment of impurities in pharmaceuticals (ICH M7 guideline, Harmonisation of Technical Requirements for Pharmaceuticals for Human Use @ICH2017). -Currently, *Salmonella* mutagenicity is the toxicological endpoint with the +Currently, mutagenicity is the toxicological endpoint with the largest amount of public data for almost 10000 structures, whereas datasets for other endpoints contain typically only a few hundred compounds. The Ames test itself is relatively reproducible with an interlaboratory variability of 80-85% @@ -89,7 +89,7 @@ of overfitting experimental errors. Within this study we attempted - - to generate a new public mutagenicity training dataset, by combining the most comprehensive public datasets + - to generate a new public mutagenicity training dataset focusing on *Salmonella typhimurium*, by combining the most comprehensive public datasets - to compare the performance of MolPrint2D (*MP2D*) fingerprints with Chemistry Development Kit (*CDK*) descriptors for mutagenicity predictions - to compare the performance of global QSAR models (random forests (*RF*), support vector machines (*SVM*), logistic regression (*LR*), neural nets (*NN*)) with local models (`lazar`) @@ -109,20 +109,23 @@ In mammals, PAs are mainly metabolized in the liver. There are three principal m - Detoxification by hydrolysis of the ester bond on positions C7 and C9 by non-specific esterases to release necine base and necic acid.  -- N-oxidation of the necine base to form a PA N-oxides, which can be either conjugated by phase II enzymes and then excreted or converted back into the corresponding parent PA (@Wang2005). This detoxification pathway is not possible for otonecine-type PAs, as they are N-methylated (see @fig:pa-schema). +- N-oxidation of the necine base to form PA N-oxides, which can be either conjugated by phase II enzymes and then excreted or converted back into the corresponding parent PA (@Wang2005). This detoxification pathway is not possible for otonecine-type PAs, as they are N-methylated (see @fig:pa-schema). - Metabolic activation or toxification by oxidation (for retronecine-type PAs) or oxidative N-demethylation (for otonecine-type Pas) by cytochromes P450 isoforms CYP2B and 3A (@Lin1998, @Ruan2014). -The latter reactions result in the formation of dehydropyrrolizidine (DHP) that is highly reactive and causes damage by building adducts with protein, lipids and DNA (@Chen2010). On the other hand, open diesters and macrocyclic PAs have a reduced detoxification due to steric hinderance of the respective esterases (@Ruan2014) +The latter reactions result in the formation of dehydropyrrolizidine (DHP) that +is highly reactive and causes damage by building adducts with protein, lipids +and DNA (@Chen2010). On the other hand, open diesters and macrocyclic PAs have +a reduced detoxification due to steric hinderance of the respective esterases +(@Ruan2014). -Therefore the -mutagenic probability of PAs is highly dependent on structure of necine -base and necic acid (@Hadi2021; @Allemang2018, @Louisse2019). However, due to -limited availability of pure substances, only a limited number of PAs have been -investigated with regards to their structure-specific mutagenicity and -experimentally in an Ames test. To overcome this bottleneck, the prediction of -structure-specific mutagenic probabilities of PAs with different machine learning -models could provide further insights in the mechanisms. +Therefore, the mutagenic probability of PAs is highly dependent on the +structure of necine base and necic acid (@Hadi2021; @Allemang2018, +@Louisse2019). However, due to limited availability of pure substances, only a +small number of PAs have been investigated experimentally in an Ames test. To +overcome this bottleneck, the application of different machine learning models +to predict mutagenic probabilities based on structures and properties +could provide further insights into the mutagenicity mechanisms of PAs. Materials and Methods ===================== @@ -142,16 +145,16 @@ training dataset was compiled from the following sources: - EFSA Dataset (695 compounds @EFSA2016): Mutagenicity classifications from Kazius and Hansen datasets were used without -further processing. According to these publications compounds were classified -as mutagenic, if at least one positive result has been obtained in *Salmonella -typhimurium* strains TA98, TA100, TA1535, TA1537, TA97, TA102 and 1538 either +further processing. According to these publications, compounds were classified +as mutagenic if at least one positive result has been obtained in *Salmonella +typhimurium* strains TA97, TA98, TA100, TA102, TA1535, TA1537 and TA1538 either with or without metabolic activation by S9. *E. coli* results were not considered in these databases. To achieve consistency with these datasets, EFSA compounds were classified as mutagenic, if at least one positive result was -found for TA98 or T100 Salmonella strains either with or without metabolic -activation. The complete dataset contains chemicals for very diverse -application areas (e.g. pharmaceuticals, pesticides, industrial chemicals, -environmental contaminants). +found for the same *Salmonella* strains either with or without metabolic +activation and as non-mutagenic if no positive result was found. The complete +dataset contains chemicals from very diverse application areas (e.g. +pharmaceuticals, pesticides, industrial chemicals, environmental contaminants). Dataset merges were based on unique SMILES (*Simplified Molecular Input Line Entry Specification*, @Weininger1989) strings of the compound structures. @@ -159,7 +162,7 @@ Duplicated experimental data with the same outcome was merged into a single value, because it is likely that it originated from the same experiment. Contradictory results were kept as multiple measurements in the database. The combined training dataset contains {{cv.n_uniq}} unique structures and {{cv.n}} -individual measurements. +individual measurements. Contradictory results were found for {{cv.n_mult}} substances. Source code for all data download, extraction and merge operations is publicly available from the git repository @@ -170,7 +173,7 @@ under a GPL3 License. The new combined dataset can be found at The pyrrolizidine alkaloid dataset was created from five independent, necine base substructure searches in PubChem (https://pubchem.ncbi.nlm.nih.gov/) and -compared to the PAs listed in the EFSA publication @EFSA2011 and the book by +compared to the PAs listed in @EFSA2011 and the book by @Mattocks1986, to ensure, that all major PAs were included. PAs mentioned in these publications, which were not found in the downloaded substances were searched individually in PubChem and, if available, downloaded @@ -182,7 +185,7 @@ Further details about the compilation of the PA dataset are described in @Schoen The PAs in the dataset were classified according to structural features. A total of 9 different structural features were assigned to the necine base, -modifications of the necine base and to the necic acid (@fig:pa-schema): +to modifications of the necine base and to the necic acid (@fig:pa-schema): ![Structural features of pyrrolizidine alkaloids](figures/PA-Schema.png){#fig:pa-schema} @@ -222,8 +225,8 @@ descriptors. In addition, they allow the efficient calculation of chemical similarities (e.g. Tanimoto indices) with simple set operations. MolPrint2D fingerprints were calculated with the OpenBabel cheminformatics -library (@OBoyle2011a) for the complete training dataset with {{cv.n}} -instances. They can be obtained from the following locations: +library (@OBoyle2011a) for the complete training dataset with {{cv.n_uniq}} +unique structures. They can be obtained from the following locations: *Training data:* @@ -244,7 +247,7 @@ for descriptor calculations. As the training dataset contained {{cv.n}} instances, it was decided to delete all instances where CDK descriptor calculations failed during pre-processing. Furthermore, -all substances with contradictory experimental mutagenicity data were removed. The final training dataset +{{cv.n_mult}} substances with contradictory experimental results were removed. The final training dataset contained {{cv.cdk.n_descriptors}} descriptors for {{cv.cdk.n_compounds}} compounds. @@ -272,7 +275,7 @@ following basic workflow: For a given chemical structure `lazar`: compound. This procedure resembles an automated version of read across predictions -in toxicology, in machine learning terms it would be classified as a +in toxicology. In machine learning terms it would be classified as a k-nearest-neighbour algorithm. Apart from this basic workflow, `lazar` is completely modular and allows @@ -399,7 +402,7 @@ used the scikit-learn default values. #### Logistic regression (SGD) (*LR-sgd*) -For the logistic regression we used an ensemble of five trained models. +For the logistic regression we used a combination of five trained models. For each model we used a batch size of 64 and trained for 50 epochs. As an optimizer ADAM was chosen. For the other parameters we used the tensorflow default values. @@ -411,7 +414,7 @@ default values. #### Neural Nets (*NN*) -For the neural network we used an ensemble of five trained models. For +For the neural network we used a combination of five trained models. For each model we used a batch size of 64 and trained for 50 epochs. As an optimizer ADAM was chosen. The neural network had 4 hidden layers with 64 nodes each and a ReLu activation function. For the other parameters @@ -876,10 +879,10 @@ however a substantially lower number of mutagenicity predictions, despite similar crossvalidation results and we were unable to identify the reasons for this discrepancy within this investigation. -Our data show that large difference exist with regard to genotoxic probabilities +Our data show that large difference exist with regard to mutagenic probabilities between different pyrrolizidine subgroups. To adjust risk assessment of pyrrolizidine contamination, our data supports a tiered risk assessment based -on *in silico* and experimental data on the relative potency of individual +on *in silico* predictions and experimental data of individual pyrrolizidine alkaloids. References diff --git a/mutagenicity.pdf b/mutagenicity.pdf index 32464af..03238c4 100644 Binary files a/mutagenicity.pdf and b/mutagenicity.pdf differ diff --git a/scripts/data.rb b/scripts/data.rb index 72e6b28..d24e46b 100755 --- a/scripts/data.rb +++ b/scripts/data.rb @@ -6,6 +6,15 @@ data = {} data.merge!(YAML.load_file(File.join(dir,"summary.yaml"))) end +mut = {} +File.readlines("mutagenicity/mutagenicity.csv").each do |line| + smi, m = line.chomp.split(",") + mut[smi] ||= [] + mut[smi] << m +end + +data[:cv][:n_mult] = mut.select{|s,m| m.size > 1}.size + data[:cv][:n] = `cut -f1 -d ',' mutagenicity/mutagenicity.csv | wc -l`.chomp.to_i - 1 data[:cv][:n_uniq] = `cut -f1 -d ',' mutagenicity/mutagenicity.csv | sort -u | wc -l`.chomp.to_i - 1 -- cgit v1.2.3