initial cleanupsubmission

author: Christoph Helma <helma@in-silico.ch> 2018-01-26 15:03:42 +0100
committer: Christoph Helma <helma@in-silico.ch> 2018-01-26 15:03:42 +0100
commit: 391042ada12bd0f9be2649b47e8746071354955a (patch)
tree: 57efaa82d4d9768ccd9b38efed89441bf5b9dad6
parent: d32fea79a1b6f1673510f1666bb471e6deb37eff (diff)
1 files changed, 0 insertions, 521 deletions
diff --git a/NestecNestlereportdraft2.txt b/NestecNestlereportdraft2.txt
deleted file mode 100644
index 75c71f3..0000000
--- a/NestecNestlereportdraft2.txt
+++ /dev/null
@@ -1,521 +0,0 @@
-First report: Improving the performance, transparence and application of in silico models applied to risk assessment
-
-in silico toxicology gmbh
-Table of contents
-
-
-Table of contents
-D1. Report documenting the work done
-Rebuilding the original models
-Internal descriptor calculation
-lazar refactoring
-Exploratory analysis of the Swiss Federal Office of Public Health LOAEL database
-Discussions and preprocessing
-Missing and invalid SMILES
-Duplicates in the datasets
-“0” values in measured data
-Comparison of new swiss datasets with the old LOAEL dataset
-Endpoint selection
-CheS-Mapper analysis of LOAEL datasets
-Graphical user interface
-Docker virtualization
-D4. Report documenting the performance of the new model compared to the old one
-LOAEL validation results
-MOUSE and RAT Carcinogenicity (TD50) validation results
-D5. New models implemented in the internal virtual machine including the batch mode
-New Graphical User Interface (GUI)
-1. Modes
-2. Select the endpoints to be predicted
-3. Start the prediction
-LAZAR Docker Service Environment
-Docker images vs. containers
-References
-________________
-D1. Report documenting the work done
-Rebuilding the original models
-The lazar framework and its underlying computational chemistry libraries have changed substantially since the delivery of the last virtual machine. For this reason it was necessary to recreate the original models with the current lazar version (spring 2015). Initial focus was the LOAEL model.
-It was impossible to reproduce the original results due to a bug in the R code of the original version, affecting the contribution of duplicated structures in the training set and causing slightly too optimistic validation results in the presence of duplicates. Still the new model performs comparable to the old one (Table 1: Model a vs Model b) with the same descriptor values.
-
-
-Internal descriptor calculation
-In order to substitute the external descriptor calculation service (Ambit) with internal descriptor calculation the computational chemistry libraries OpenBabel, CDK and JOELib were added to the lazar framework. This required substantial modifications of the framework (e.g. for the internal generation and caching of 3D structures), but ultimately led to a larger number of successfully calculated descriptors (especially 3D descriptors, see Table1, Model c).
-
-
-Model
-	Dataset
-	Model
-	Features
-	r²
-	Number of unpredicted compounds(of 439)
-	a)
-	LOAEL-mol
-	original
-	original
-	0.507
-	N/A
-	b)
-	
-
-	new
-	original
-	0.5
-	20
-	c)
-	
-
-	new
-	new
-	0.506
-	12
-	Table1: Leave-one-out validation for the LOAEL model
-
-
-lazar refactoring
-While working with the LOAEL models and other datasets we realized that the current lazar version has several scalability and performance problems, especially for large datasets. In order to address these issues we had to revise previous design decisions (e.g. RDF de/serialization, separate webservices for each OpenTox object) and refactor the lazar framework. 
-This was a major effort, because it required a complete restructuring of the codebase and a rewrite of large parts of the code. Ultimately we were able to
-* speedup prediction (and validation) substantially (> 1000 times faster)
-* remove system bottlenecks for large datasets
-* improve modularization for experiments with similarity and prediction algorithms
-* simplify the installation process
-* reduce and simplify the codebase to improve maintenance
-Despite a major rewrite lazar still follows the same basic read-across principle of searching for similar compounds and using their experimental data for building a local model. At the moment the following algorithm modifications have been made in comparison to the original lazar models:
-* Neighbor search is based on MolPrint2D (Bender 2004) instead of Fminer fragments
-* Regression predictions are obtained from the weighted average of neighbor activities instead of radial SVM models
-This selection can be seen as the simplest algorithms that provide reasonable performance in terms of speed, predictivity and number of predictable compounds. They will serve as a baseline for the investigation of more complex algorithms (e.g. more sophisticated local models).
-
-
-Exploratory analysis of the Swiss Federal Office of Public Health LOAEL database
-Discussions and preprocessing
-
-
-The spreadsheet of the swiss data has four tables: Codes, rat_chron, mouse_chron and multigen.
-* “Codes” contains descriptions of column names of the following three tables
-* “rat_chron” shows rat study data including identifier and measured values
-* “mouse_chron” shows mouse study data including identifier and measured values
-* “multigen” shows multi-generation rat study data including identifier and measured values
-
-
-Common columns with identifier are “CASNR”, “CAS name”, “SMILES”.
-All study tables provide a “function” and chemical “class” for the studies. 
-
-
-Missing and invalid SMILES
-Unfortunately no identifier is complete across all compound  therefore we focused on SMILES. Missing SMILES were generated from other identifiers when available. 
-
-
-study type/ table
-	rat_chron
-	mouse_chron
-	multigen
-	missing SMILES
-	35
-	27
-	31
-	invalid SMILES
-	9
-	6
-	9
-	corrected SMILES
-	44
-	33
-	40
-	Detailed tables:
-https://docs.google.com/spreadsheets/d/14P8F-3iX5gr5FbN7oSeuwabUOr_xdDhhr5KwiUX6LXY/edit?usp=sharing
-
-
-Duplicates in the datasets
-The swiss data has duplicate compounds for each study type but also across all types. 
-
-
-study type/ table
-	rat_chron
-	mouse_chron
-	multigen
-	all
-	# studies/ compounds
-	578
-	488
-	517
-	1583
-	unique structures
-	428
-	409
-	402
-	439
-	corrected SMILES
-	44
-	33
-	40
-	117
-	unique added SMILES
-	38
-	31
-	35
-	39
-	
-
-This table shows that across the study type mostly the same compound are present. Over all tables 39 compounds have invalid or missing SMILES available but all of them could be corrected or created. 
-
-
-“0” values in measured data
-Studies with undefined (“0”) or empty entries for “dose value”(endpoint) were removed from the tables. 
-
-
-The resulting datasets are still separated by their study type (rat, mouse and multigen). Each table includes the chemical structure as SMILES and all provided “food concentration” and “dose” values. This structure is ideal to continue with analysis and modelling. Initial modelling tests with lazar were successful.
-
-
-
-
-Comparison of new swiss datasets with the old LOAEL dataset
-Endpoint selection
-The measured value "LOAEL parental as dose (mg/kg bw per day)" is present in all new study types and selected as main endpoint for the data comparison.
-
-
-The following table shows the overlap of datasets. It lists the number of compounds that are common in both datasets. The diagonal shows the total number of compounds and the number of unique structures.
-
-
-
-
-	LOAEL old
-	LOAEL rat
-	LOAEL mouse
-	LOAEL multigen
-	LOAEL old
-	562 (439)
-	162
-	144
-	140
-	LOAEL rat
-	
-
-	493 (381)
-	322
-	321
-	LOAEL mouse
-	
-
-	
-
-	393 (339)
-	292
-	LOAEL multigen
-	
-
-	
-
-	
-
-	398 (340)
-	
-
-Number of unique structures that are only present one dataset:
-LOAEL old
-	LOAEL rat
-	LOAEL mouse
-	LOAEL multigen
-	269
-	13
-	4
-	8
-	
-
-Unique structures that are not in LOAEL old:
-
-
-	LOAEL rat
-	LOAEL mouse
-	LOAEL multigen
-	Not in LOAEL old
-	219
-	195
-	200
-	The new LOAEL tables have the majority of structures in common.
-
-
-There are many compounds included in more than one dataset. But it is possible to build models with the data even if the applicability domain will be very similar.
-Since the old LOAEL data is based on rat data it is probably beneficial to merge them with the swiss rat data. 
-
-
-CheS-Mapper analysis of LOAEL datasets
-
-
-CheS-Mapper (Chemical Space Mapping and Visualization in 3D, http://ches-mapper.org/, Gütlein 2012) can be used to analyze the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. CheS-Mapper embeds a dataset into 3D space, such that compounds with similar feature values are close to each other. 
-We explored the structural and physico-chemical diversity of the different LOAEL datasets in an interactive section together with Elena LoPiparo. The main conclusion was that old and new LOAEL datasets cover a similar chemical space and we recommend to merge them into a single dataset for LOAEL model development. 
-The following two screenshots visualise the comparison. The datasets are embeded into 3D Space based on structural fragments from three Smart list (OpenBabel FP3, OpenBabel FP4 and OpenBabel MACCS). 
-  
-
-Blue dots are the swiss datasets, red dots represent the old dataset. 
-
-
-  
-
-Blue dots represent the old dataset, red is mouse, green is multigen and yellow is rat. 
-
-
-Graphical user interface
-In order to adapt to the new lazar framework a new graphical interface had to be written. The new version includes batch predictions, a detailed description can be found section D5.
-Docker virtualization
-In order to meet requirements of the IT-department new versions are delivered as docker images. Our docker images are now completely self-contained and do not need any external services. Section D5 contains instructions for the installation of IST docker images and some background information.
-
-
-D4. Report documenting the performance of the new model compared to the old one
-
-
-LOAEL validation results
-
-
-The validation results are based on three independent 10 fold crossvalidations.
-
-
-
-
-model
-	r²
-	Root Mean Square Error
-	Mean Absolute Error
-	number unpredicted 
-	based on
-	LOAEL-mmol (Nestle docker 2015)
-	0.29/0.292/0.294
-	0.905/0.89/0.895
-	0.687/0.688/0.694
-	227/224/225 (of 567)
-	3 x 10 fold cross validations
-	LOAEL-mmol (Nestle VM 2012)
-	0.506
-	0.762
-	0.594
-	N/A
-	LOO cross validation
-	
-
-This table indicates that the predictivity of the latest LOAEL models is still lower than the old LOAEL model and the LOAEL models developed at the beginning of the project. Predictivities can be increased substantially by raising similarity thresholds for neighbors, but this comes at the cost of a larger number of unpredicted compounds. We are presently using a very simple (but fast) regression algorithm (weighted neighbor average) and hope to increase the model accuracy with slightly more complex local regression algorithms (e.g. local linear regression). 
-
-
-MOUSE and RAT Carcinogenicity (TD50) validation results
-
-
-
-
-The validation results (2015) are based on three independent 10 fold crossvalidations.                                                     
-
-
-
-
-model
-	r²
-	Root Mean Square Error
-	Mean Absolute Error
-	number unpredicted 
-	based on
-	Rat (Nestle docker 2015)
-	0.28/0.299/0.23
-	1.188/1.156/1.25
-	0.909/0.898/0.928
-	37/38/33 (of 511)
-	3 x 10 fold cross validations
-	Mouse (Nestle docker 2015)
-	0.207/0.232/0.226
-	1.066/1.04/1.045
-	0.779/0.774/0.776
-	31/30/31 (of 402)
-	3 x 10 fold cross validations
-	
-
-The old mouse and rat carcinogenicity models were validated against a specific testset (which usually leads to overfitting and poorer performance for other predictions). Independent leave-one out crossvalidation experiments with these models showed that the actual r² of these models is lower than 0.1. This indicates that the new models are already much more accurate than the old ones and we hope to increase the predictivity further with better similarity and regression algorithms.
-
-
-D5. New models implemented in the internal virtual machine including the batch mode
-
-
-IST delivered 2 versions of the virtual machine (as docker images) to Nestles IT department. The following two sections contain detailed descriptions of the graphical user interface and of the installation and maintenance of IST docker images.
-New Graphical User Interface (GUI)
-
-
-LAZAR comes with a simple GUI (Graphical User Interface) for inputting data and viewing prediction results. It contains two main pages:
-
-
-Start page:
-
-
-1. Modes:
-1.1         Single mode - Input a single compound that should be predicted.
-1.2         Batch mode - Upload a file with several compounds.
-        
-1. Select a model for the endpoints to be predicted
-   1. Inspect details and validation results of a model
-
-
-  3.         Start the prediction
-
-
-Result page:
-
-
-1. Single mode:
-   1. Overview of the prediction results.
-   2. View a list of the neighbor compounds for each selected model.
-
-
-1. Batch mode:        
-   1. Overview of the prediction results.
-   2. Download the prediction results in a CSV file.
-1. Modes  
-
-
-1.1
-Draw a compound with the JSME Modular Editor from Peter Ertl
-
-
-1.1
-Or insert a SMILES string
-
-
-1.2
-Select and upload a file for batch predictions. Requires a CSV file (select “Export to CSV” in Excel or other Spreadsheet programs) with the type (SMILES or InChI) defined in the first header followed by the compounds in the first column. All other columns are ignored from this file.
-
-
-Example input:
-
-
-SMILES
-CCCCCCCCOC(=O)C1=CC=C(C(=O)OCCCCCCCC)C=C1
-O=C1NC(=O)NC=C1
-O=C2C1=NC3=C(C=C(C)C(C)=C3)N(C[C@H](O)[C@H](O)[C@H](O)CO)C1=NC(N2)=O
-O=C1C2=C(C=CC=C2)C(=O)C3=C1C=CC=C3
-CCC1=C(Br)C(Br)=C(Br)C(Br)=C1Br
-C1CCCCC1C2CCCCC2
-C1=CC(C)=CC=C1SSC2=CC=C(C)C=C2
-CCCCCCCCCCCCCO
-
-
-2. Select the endpoints to be predicted
-
-
-  
-
-Choose one or more endpoint models.
-
-
-2.1
-
-
-  
-
-Detailed model information can be displayed with the Details | Validation button. It contains details about the training dataset, the modelling algorithm and validation results. Models are validated with three independent 10-fold crossvalidations.
-
-
-3. Start the prediction
-
-
-  
-
-The Predict button starts the prediction process. Calculations may take some time, you will be directed to the Results page.
-
-
-  
-
-Result page:
-
-
-The Result page appearance depends on the selected prediction mode.
-With “1.1 Single mode” an overview of prediction results will be displayed in a single row:
-
-
-  
-
-The first cell depicts the input structure, the remaining entries display prediction results.
-Predictions are shown in molar and weight units. Measured activities will be displayed, if the training dataset contains the query structure. The confidence value indicates the reliability of the prediction.
-IMPORTANT: At the time of writing this report confidence values are not working properly for regression models! Please inspect the neighbors (see below) to estimate the reliability of predictions.
-
-
-
-
-The following table gives an overview of similar (neighbor) compounds that have been used for the prediction:
-
-
-  
-
-
-
-The table displays the chemical structure, measured activities and the similarity of all neighbor. If the training dataset contains duplicated structures with different measured activities they will be treated as separate neighbors and influence the prediction results accordingly.
-Neighbors are sorted by default by descending similarities, other sorting criteria can be selected with a click on the arrows in the table header.
-Select one of the tabs at the top to switch to another endpoint.
-
-
-With “1.2 Batch mode” the result page is slightly different:
-
-
-  
-
-
-
-Results are displayed in a table with prediction results for all submitted compounds. Each row displays the same information as the overview in “1.1 Single mode”.
-
-
-Additionally the table can be downloaded as CSV file (choose “Import CSV” to import into Excel or other Spreadsheet programs).
-
-
-  
-
-
-
-The resulting file contains the same information as the GUI table.
-
-
-LAZAR Docker Service Environment
-
-
-The LAZAR service comes with his own environment that includes everything needed to run the service without changes to your operating system (OS) .
-In order to achieve this we use the Docker environment (www.docker.com) which is a code platform where you can install and manage virtual a OS on your host machine. Docker is the only program that is needed to be installed on the host OS.
-The main benefit of running lazar in a docker image is the independence from the host OS and the possibility for versioning the different development stages of our service.
-
-
-Docker manages different stages of an OS and all included programs and services as images. These images can be shared with the external Docker HUB service (https://hub.docker.com) via SSH access to keep them private.
-
-
-We started with Debian 7 as the OS for the LAZAR service followed by an installation of all necessary programs and tools required for the service. The LAZAR service provided is actually separated into the main LAZAR code and the LAZAR-GUI interface. After installation we took a snapshot of the current state of the virtual OS in order to save all installed programs within this Docker image.
-
-
-docker pull gebele/nestec:v7
-  
-
-
-
-To use this image you have to download the image via the docker platform installed on your host OS and start it by passing some extra flags.
-
-
-docker run -p 8088:8088 -itd gebele/nestec:v7 bash -c "/etc/init.d/supervisor start && /etc/init.d/lazar start"
-
-
--p, allows you to route the internal port 8088 to any of your host ports (in this case also 8088).
--itd, tells the docker image to run in detached mode with a console.
-gebele/nestec:v7, describes the actual name of the image and a tag (in this case v7 which we use for versioning).
--c, tells the interactive image console to run a command.
-
-
-This builds a running Docker container where the LAZAR service is up and running inside. Point your browser to http://localhost:8088/ and you can see the LAZAR-GUI.
-
-
-  
-
-
-
-Docker images vs. containers
-
-
-To understand the difference of an image and container you can see images as  versions of the service and containers as the service program you actually work with.
-
-
-Every run command creates a new container from an image. This behavior can be used to run several instances of the LAZAR service in parallel. For example if you want to run different development stages to compare or if you like to give several users their own instance of the same version to work with, all you have to do is give each container its own port forwarding to the host by adjust the -p flag with different host ports.
-
-
-A running container can be stopped with the docker stop CONTAINER_ID command which is commonly used for stopping an old version before you update and run a new version. If for any reason you want re-run an older LAZAR version you can just start the container again docker start CONTAINER_ID and the LAZAR service gets automatically started again and works like you leaved it the last time.
-
-
-________________
-References
-Andreas Bender, Hamse Y. Mussa, and Robert C. Glen. Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naive Bayesian Classifier. J. Chem. Inf. Comput. Sci. 2004, 44, 170-178.
-
-
-Gütlein M, Karwath A, Kramer S. CheS–Mapper–Chemical space mapping and visualization in 3D. J Cheminform. 2012;4:7. doi: 10.1186/1758-2946-4-7. [PubMed] [Cross Ref]
-\ No newline at end of file
author	Christoph Helma <helma@in-silico.ch>	2018-01-26 15:03:42 +0100
committer	Christoph Helma <helma@in-silico.ch>	2018-01-26 15:03:42 +0100
commit	391042ada12bd0f9be2649b47e8746071354955a (patch)
tree	57efaa82d4d9768ccd9b38efed89441bf5b9dad6
parent	d32fea79a1b6f1673510f1666bb471e6deb37eff (diff)