NestecNestlereportdraft2.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521

First report: Improving the performance, transparence and application of in silico models applied to risk assessment

in silico toxicology gmbh
Table of contents


Table of contents
D1. Report documenting the work done
Rebuilding the original models
Internal descriptor calculation
lazar refactoring
Exploratory analysis of the Swiss Federal Office of Public Health LOAEL database
Discussions and preprocessing
Missing and invalid SMILES
Duplicates in the datasets
“0” values in measured data
Comparison of new swiss datasets with the old LOAEL dataset
Endpoint selection
CheS-Mapper analysis of LOAEL datasets
Graphical user interface
Docker virtualization
D4. Report documenting the performance of the new model compared to the old one
LOAEL validation results
MOUSE and RAT Carcinogenicity (TD50) validation results
D5. New models implemented in the internal virtual machine including the batch mode
New Graphical User Interface (GUI)
1. Modes
2. Select the endpoints to be predicted
3. Start the prediction
LAZAR Docker Service Environment
Docker images vs. containers
References
________________
D1. Report documenting the work done
Rebuilding the original models
The lazar framework and its underlying computational chemistry libraries have changed substantially since the delivery of the last virtual machine. For this reason it was necessary to recreate the original models with the current lazar version (spring 2015). Initial focus was the LOAEL model.
It was impossible to reproduce the original results due to a bug in the R code of the original version, affecting the contribution of duplicated structures in the training set and causing slightly too optimistic validation results in the presence of duplicates. Still the new model performs comparable to the old one (Table 1: Model a vs Model b) with the same descriptor values.


Internal descriptor calculation
In order to substitute the external descriptor calculation service (Ambit) with internal descriptor calculation the computational chemistry libraries OpenBabel, CDK and JOELib were added to the lazar framework. This required substantial modifications of the framework (e.g. for the internal generation and caching of 3D structures), but ultimately led to a larger number of successfully calculated descriptors (especially 3D descriptors, see Table1, Model c).


Model
	Dataset
	Model
	Features
	r²
	Number of unpredicted compounds(of 439)
	a)
	LOAEL-mol
	original
	original
	0.507
	N/A
	b)
	

	new
	original
	0.5
	20
	c)
	

	new
	new
	0.506
	12
	Table1: Leave-one-out validation for the LOAEL model


lazar refactoring
While working with the LOAEL models and other datasets we realized that the current lazar version has several scalability and performance problems, especially for large datasets. In order to address these issues we had to revise previous design decisions (e.g. RDF de/serialization, separate webservices for each OpenTox object) and refactor the lazar framework. 
This was a major effort, because it required a complete restructuring of the codebase and a rewrite of large parts of the code. Ultimately we were able to
* speedup prediction (and validation) substantially (> 1000 times faster)
* remove system bottlenecks for large datasets
* improve modularization for experiments with similarity and prediction algorithms
* simplify the installation process
* reduce and simplify the codebase to improve maintenance
Despite a major rewrite lazar still follows the same basic read-across principle of searching for similar compounds and using their experimental data for building a local model. At the moment the following algorithm modifications have been made in comparison to the original lazar models:
* Neighbor search is based on MolPrint2D (Bender 2004) instead of Fminer fragments
* Regression predictions are obtained from the weighted average of neighbor activities instead of radial SVM models
This selection can be seen as the simplest algorithms that provide reasonable performance in terms of speed, predictivity and number of predictable compounds. They will serve as a baseline for the investigation of more complex algorithms (e.g. more sophisticated local models).


Exploratory analysis of the Swiss Federal Office of Public Health LOAEL database
Discussions and preprocessing


The spreadsheet of the swiss data has four tables: Codes, rat_chron, mouse_chron and multigen.
* “Codes” contains descriptions of column names of the following three tables
* “rat_chron” shows rat study data including identifier and measured values
* “mouse_chron” shows mouse study data including identifier and measured values
* “multigen” shows multi-generation rat study data including identifier and measured values


Common columns with identifier are “CASNR”, “CAS name”, “SMILES”.
All study tables provide a “function” and chemical “class” for the studies. 


Missing and invalid SMILES
Unfortunately no identifier is complete across all compound  therefore we focused on SMILES. Missing SMILES were generated from other identifiers when available. 


study type/ table
	rat_chron
	mouse_chron
	multigen
	missing SMILES
	35
	27
	31
	invalid SMILES
	9
	6
	9
	corrected SMILES
	44
	33
	40
	Detailed tables:
https://docs.google.com/spreadsheets/d/14P8F-3iX5gr5FbN7oSeuwabUOr_xdDhhr5KwiUX6LXY/edit?usp=sharing


Duplicates in the datasets
The swiss data has duplicate compounds for each study type but also across all types. 


study type/ table
	rat_chron
	mouse_chron
	multigen
	all
	# studies/ compounds
	578
	488
	517
	1583
	unique structures
	428
	409
	402
	439
	corrected SMILES
	44
	33
	40
	117
	unique added SMILES
	38
	31
	35
	39
	

This table shows that across the study type mostly the same compound are present. Over all tables 39 compounds have invalid or missing SMILES available but all of them could be corrected or created. 


“0” values in measured data
Studies with undefined (“0”) or empty entries for “dose value”(endpoint) were removed from the tables. 


The resulting datasets are still separated by their study type (rat, mouse and multigen). Each table includes the chemical structure as SMILES and all provided “food concentration” and “dose” values. This structure is ideal to continue with analysis and modelling. Initial modelling tests with lazar were successful.


Comparison of new swiss datasets with the old LOAEL dataset
Endpoint selection
The measured value "LOAEL parental as dose (mg/kg bw per day)" is present in all new study types and selected as main endpoint for the data comparison.


The following table shows the overlap of datasets. It lists the number of compounds that are common in both datasets. The diagonal shows the total number of compounds and the number of unique structures.


	LOAEL old
	LOAEL rat
	LOAEL mouse
	LOAEL multigen
	LOAEL old
	562 (439)
	162
	144
	140
	LOAEL rat
	

	493 (381)
	322
	321
	LOAEL mouse
	

	393 (339)
	292
	LOAEL multigen
	

	398 (340)
	

Number of unique structures that are only present one dataset:
LOAEL old
	LOAEL rat
	LOAEL mouse
	LOAEL multigen
	269
	13
	4
	8
	

Unique structures that are not in LOAEL old:


	LOAEL rat
	LOAEL mouse
	LOAEL multigen
	Not in LOAEL old
	219
	195
	200
	The new LOAEL tables have the majority of structures in common.


There are many compounds included in more than one dataset. But it is possible to build models with the data even if the applicability domain will be very similar.
Since the old LOAEL data is based on rat data it is probably beneficial to merge them with the swiss rat data. 


CheS-Mapper analysis of LOAEL datasets


CheS-Mapper (Chemical Space Mapping and Visualization in 3D, http://ches-mapper.org/, Gütlein 2012) can be used to analyze the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. CheS-Mapper embeds a dataset into 3D space, such that compounds with similar feature values are close to each other. 
We explored the structural and physico-chemical diversity of the different LOAEL datasets in an interactive section together with Elena LoPiparo. The main conclusion was that old and new LOAEL datasets cover a similar chemical space and we recommend to merge them into a single dataset for LOAEL model development. 
The following two screenshots visualise the comparison. The datasets are embeded into 3D Space based on structural fragments from three Smart list (OpenBabel FP3, OpenBabel FP4 and OpenBabel MACCS). 
  

Blue dots are the swiss datasets, red dots represent the old dataset. 


Blue dots represent the old dataset, red is mouse, green is multigen and yellow is rat. 


Graphical user interface
In order to adapt to the new lazar framework a new graphical interface had to be written. The new version includes batch predictions, a detailed description can be found section D5.
Docker virtualization
In order to meet requirements of the IT-department new versions are delivered as docker images. Our docker images are now completely self-contained and do not need any external services. Section D5 contains instructions for the installation of IST docker images and some background information.


D4. Report documenting the performance of the new model compared to the old one


LOAEL validation results


The validation results are based on three independent 10 fold crossvalidations.


model
	r²
	Root Mean Square Error
	Mean Absolute Error
	number unpredicted 
	based on
	LOAEL-mmol (Nestle docker 2015)
	0.29/0.292/0.294
	0.905/0.89/0.895
	0.687/0.688/0.694
	227/224/225 (of 567)
	3 x 10 fold cross validations
	LOAEL-mmol (Nestle VM 2012)
	0.506
	0.762
	0.594
	N/A
	LOO cross validation
	

This table indicates that the predictivity of the latest LOAEL models is still lower than the old LOAEL model and the LOAEL models developed at the beginning of the project. Predictivities can be increased substantially by raising similarity thresholds for neighbors, but this comes at the cost of a larger number of unpredicted compounds. We are presently using a very simple (but fast) regression algorithm (weighted neighbor average) and hope to increase the model accuracy with slightly more complex local regression algorithms (e.g. local linear regression). 


MOUSE and RAT Carcinogenicity (TD50) validation results


The validation results (2015) are based on three independent 10 fold crossvalidations.                                                     


model
	r²
	Root Mean Square Error
	Mean Absolute Error
	number unpredicted 
	based on
	Rat (Nestle docker 2015)
	0.28/0.299/0.23
	1.188/1.156/1.25
	0.909/0.898/0.928
	37/38/33 (of 511)
	3 x 10 fold cross validations
	Mouse (Nestle docker 2015)
	0.207/0.232/0.226
	1.066/1.04/1.045
	0.779/0.774/0.776
	31/30/31 (of 402)
	3 x 10 fold cross validations
	

The old mouse and rat carcinogenicity models were validated against a specific testset (which usually leads to overfitting and poorer performance for other predictions). Independent leave-one out crossvalidation experiments with these models showed that the actual r² of these models is lower than 0.1. This indicates that the new models are already much more accurate than the old ones and we hope to increase the predictivity further with better similarity and regression algorithms.


D5. New models implemented in the internal virtual machine including the batch mode


IST delivered 2 versions of the virtual machine (as docker images) to Nestles IT department. The following two sections contain detailed descriptions of the graphical user interface and of the installation and maintenance of IST docker images.
New Graphical User Interface (GUI)


LAZAR comes with a simple GUI (Graphical User Interface) for inputting data and viewing prediction results. It contains two main pages:


Start page:


1. Modes:
1.1         Single mode - Input a single compound that should be predicted.
1.2         Batch mode - Upload a file with several compounds.
        
1. Select a model for the endpoints to be predicted
   1. Inspect details and validation results of a model


  3.         Start the prediction


Result page:


1. Single mode:
   1. Overview of the prediction results.
   2. View a list of the neighbor compounds for each selected model.


1. Batch mode:        
   1. Overview of the prediction results.
   2. Download the prediction results in a CSV file.
1. Modes  


1.1
Draw a compound with the JSME Modular Editor from Peter Ertl


1.1
Or insert a SMILES string


1.2
Select and upload a file for batch predictions. Requires a CSV file (select “Export to CSV” in Excel or other Spreadsheet programs) with the type (SMILES or InChI) defined in the first header followed by the compounds in the first column. All other columns are ignored from this file.


Example input:


SMILES
CCCCCCCCOC(=O)C1=CC=C(C(=O)OCCCCCCCC)C=C1
O=C1NC(=O)NC=C1
O=C2C1=NC3=C(C=C(C)C(C)=C3)N(C[C@H](O)[C@H](O)[C@H](O)CO)C1=NC(N2)=O
O=C1C2=C(C=CC=C2)C(=O)C3=C1C=CC=C3
CCC1=C(Br)C(Br)=C(Br)C(Br)=C1Br
C1CCCCC1C2CCCCC2
C1=CC(C)=CC=C1SSC2=CC=C(C)C=C2
CCCCCCCCCCCCCO


2. Select the endpoints to be predicted


Choose one or more endpoint models.


2.1


Detailed model information can be displayed with the Details | Validation button. It contains details about the training dataset, the modelling algorithm and validation results. Models are validated with three independent 10-fold crossvalidations.


3. Start the prediction


The Predict button starts the prediction process. Calculations may take some time, you will be directed to the Results page.


Result page:


The Result page appearance depends on the selected prediction mode.
With “1.1 Single mode” an overview of prediction results will be displayed in a single row:


The first cell depicts the input structure, the remaining entries display prediction results.
Predictions are shown in molar and weight units. Measured activities will be displayed, if the training dataset contains the query structure. The confidence value indicates the reliability of the prediction.
IMPORTANT: At the time of writing this report confidence values are not working properly for regression models! Please inspect the neighbors (see below) to estimate the reliability of predictions.


The following table gives an overview of similar (neighbor) compounds that have been used for the prediction:


The table displays the chemical structure, measured activities and the similarity of all neighbor. If the training dataset contains duplicated structures with different measured activities they will be treated as separate neighbors and influence the prediction results accordingly.
Neighbors are sorted by default by descending similarities, other sorting criteria can be selected with a click on the arrows in the table header.
Select one of the tabs at the top to switch to another endpoint.


With “1.2 Batch mode” the result page is slightly different:


Results are displayed in a table with prediction results for all submitted compounds. Each row displays the same information as the overview in “1.1 Single mode”.


Additionally the table can be downloaded as CSV file (choose “Import CSV” to import into Excel or other Spreadsheet programs).


The resulting file contains the same information as the GUI table.


LAZAR Docker Service Environment


The LAZAR service comes with his own environment that includes everything needed to run the service without changes to your operating system (OS) .
In order to achieve this we use the Docker environment (www.docker.com) which is a code platform where you can install and manage virtual a OS on your host machine. Docker is the only program that is needed to be installed on the host OS.
The main benefit of running lazar in a docker image is the independence from the host OS and the possibility for versioning the different development stages of our service.


Docker manages different stages of an OS and all included programs and services as images. These images can be shared with the external Docker HUB service (https://hub.docker.com) via SSH access to keep them private.


We started with Debian 7 as the OS for the LAZAR service followed by an installation of all necessary programs and tools required for the service. The LAZAR service provided is actually separated into the main LAZAR code and the LAZAR-GUI interface. After installation we took a snapshot of the current state of the virtual OS in order to save all installed programs within this Docker image.


docker pull gebele/nestec:v7
  

To use this image you have to download the image via the docker platform installed on your host OS and start it by passing some extra flags.


docker run -p 8088:8088 -itd gebele/nestec:v7 bash -c "/etc/init.d/supervisor start && /etc/init.d/lazar start"


-p, allows you to route the internal port 8088 to any of your host ports (in this case also 8088).
-itd, tells the docker image to run in detached mode with a console.
gebele/nestec:v7, describes the actual name of the image and a tag (in this case v7 which we use for versioning).
-c, tells the interactive image console to run a command.


This builds a running Docker container where the LAZAR service is up and running inside. Point your browser to http://localhost:8088/ and you can see the LAZAR-GUI.


Docker images vs. containers


To understand the difference of an image and container you can see images as  versions of the service and containers as the service program you actually work with.


Every run command creates a new container from an image. This behavior can be used to run several instances of the LAZAR service in parallel. For example if you want to run different development stages to compare or if you like to give several users their own instance of the same version to work with, all you have to do is give each container its own port forwarding to the host by adjust the -p flag with different host ports.


A running container can be stopped with the docker stop CONTAINER_ID command which is commonly used for stopping an old version before you update and run a new version. If for any reason you want re-run an older LAZAR version you can just start the container again docker start CONTAINER_ID and the LAZAR service gets automatically started again and works like you leaved it the last time.


________________
References
Andreas Bender, Hamse Y. Mussa, and Robert C. Glen. Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naive Bayesian Classifier. J. Chem. Inf. Comput. Sci. 2004, 44, 170-178.


Gütlein M, Karwath A, Kramer S. CheS–Mapper–Chemical space mapping and visualization in 3D. J Cheminform. 2012;4:7. doi: 10.1186/1758-2946-4-7. [PubMed] [Cross Ref]