_posts/2012-05-02-lazar-models-and-how-to-trigger-them.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175

---
layout: post
title: "Lazar Models and how to trigger them"
description: "I have implemented several underlying statistical learners within Lazar. There are kernel models for classification and regression. There are also facilities for physico-chemical descriptor calculation."
category: Usage
tags: [Lazar, Algorithm, Tutorials]
---
{% include JB/setup %}

**I have implemented several underlying statistical learners within Lazar. There are kernel models for classification and regression. There are also facilities for physico-chemical descriptor calculation.**

There exist two flavors of Lazar models:


  1. based on subgraphs

	
  2. based on physico-chemical descriptors


Type 1. is the default, with the least requirements. It will work with only a dataset uri supplied. Type 2. requires the user to supply a dataset with pre-computed physico-chemical descriptor values (feature dataset).

The next paragraph discusses how to create feature datasets. After that the whole procedure is summarized.


# Creating a Feature Dataset


Lazar models can be equipped with a [feature dataset](/algorithm/2012/05/02/calculating-physico-chemical-descriptors-with-opentox-algorithm). This feature dataset must already exist at the time of Lazar model creation. Currently, only physico-chemical descriptors can be supplied as feature datasets. In the feature dataset, features must be annotated with the last substring of the DC.description field being


    [<pc_type>, <lib>]
    

e.g.


    Largest Chain [constitutional, cdk]
    

would be an appropriate DC.description entry. If no such annotation exists in the feature dataset, the user must supply parameters _pc_type_ and _lib_ to the lazar algorithm (see below). This ensures that the model can derive descriptor values for unknown query compounds on prediction time.

If no feature dataset is supplied, subgraph descriptors are used in the model.


# Creating a Model


When POSTing a dataset URI to the Lazar algorithm webservice (and optionally a feature dataset URI), a Lazar model is created. Models have parameters and parameters have values (e.g. _params[:foo]=bar_ obtained by setting _-d "foo=bar"_ via curl). The flowchart below shows the decision process for model building:

<img src="http://www.maunz.de/wordpress/wp-content/uploads/2011/05/Workflow_Algorithms6.png" width="600px">

Note: This chart displays only a subset of options. It is not necessary to pass any options apart from _dataset_uri_, which is the URI of the training dataset. The standard models are _weighted_majority_vote _for classification and _local_svm_regression_ for regression.

# Summary of models, see also [this post](algorithm/2012/05/02/data-mining-and-machine-learning-algorithms-in-lazar)

These models are available (textual form of the leaf nodes of the flowchart)

* Weighted Majority Vote for classification <br />
  (prediction_algorithm = _weighted_majority_vote_).
* SVM for classification  <br />
  (prediction_algorithm = local_svm_classification) <br />
  and regression <br />
  (prediction_algorithm = local_svm_regression).
<br />
<br />


# Parameter Summary


Mandatory parameters in bold. Default values are below the list.


* **dataset_uri**:

  * uri of the training dataset

* prediction_feature:

  * uri of the prediction feature in the training dataset.

* prediction_algorithm:

  * One of weighted_majority_vote, local_svm_classification, local_svm_regression.


The following are mutually exclusive:

* feature_generation_uri: Used for subgraphs

  * host/algorithm/fminer/bbrc, host/algorithm/fminer/last.

* feature_dataset_uri: Used for physico-chemical descriptors

  * uri of the physico-chemical descriptor feature dataset.

Further parameters:
	
* pc_type: The physico-chemical type(s) when using a feature_dataset_uri.

  * [See this post](/algorithm/2012/05/02/calculating-physico-chemical-descriptors-with-opentox-algorithm). Short summary: supply  a comma-separated list, <br />e.g. "pc_type=constitutional,electronic"

* lib: The library or libraries when using a feature_dataset_uri.

  * [See this post](/algorithm/2012/05/02/calculating-physico-chemical-descriptors-with-opentox-algorithm). Short summary: supply  a comma-separated list, e.g. "lib=cdk,openbabel"

* nr_hits: Whether subgraphs should be weighted with their occurrence counts in the instances (frequency).

  * One of true, false.

* min_sim: The minimum similarity threshold for neighbors.

  * Numeric value in \[0,1\].

* min_train_performance: The minimum training performance for local_svm_classification (Accuracy) and local_svm_regression (R-squared).

  * Numeric value in \[0,1\].

Here are the default values of some parameters:

* prediction_algorithm=

  * weighted_majority_vote (classification)

  * local_svm_regression (regression)

* feature_dataset_uri= not set

* pc_type=not set (autodetected from feature dataset, if applicable)

* lib=not set (autodetected from feature dataset, if applicable)

* nr_hits=

  * false (classification using weighted_majority_vote)

  * true (all others)

* min_sim=

  * "0.3" (nominal features, no feature dataset used)

  * "0.4" (physico-chemical descriptors, feature dataset used)

* min_train_performance=0.1

# References


  * [A post](/algorithm/2012/05/02/data-mining-and-machine-learning-algorithms-in-lazar) details data mining and machine learning components involved.

	
  * The [README](https://github.com/opentox/algorithm/tree/development) details all settings.

	
  * From a higher perspective: A complete [tutorial](/algorithm/2012/05/01/services-tutorial---lazar-feature-generation-feature-selection-validation) streamlines the process.