summaryrefslogtreecommitdiff
path: root/README.md
blob: 1f62c364de1eb39bc1f86c8415625088228aace0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
lazar
=====

Ruby libraries for the lazar framework

Dependencies
------------

  lazar depends on a couple of external programs and libraries. All required libraries will be installed with the `gem install lazar` command. 
  If any of the dependencies fails to install, please check if all required development packages are installed from your operating systems package manager (e.g. `apt`, `rpm`, `pacman`, ...). 
  You will need a working Java runtime to use descriptor calculation algorithms from CDK and JOELib libraries.

Installation
------------

  `gem install lazar`

  Please be patient, the compilation of external libraries can be very time consuming. If installation fails you can try to install manually:

  ```
  git clone https://github.com/opentox/lazar.git
  cd lazar
  ruby ext/lazar/extconf.rb
  bundle install
  ```

  The output should give you more verbose information that can help in debugging (e.g. to identify missing libraries).

Tutorial
--------

Execute the following commands either from an interactive Ruby shell or a Ruby script:

### Create and use `lazar` models for small molecules

#### Create a training dataset

  Create a CSV file with two columns. The first line should contain either SMILES or InChI (first column) and the endpoint (second column). The first column should contain either the SMILES or InChI of the training compounds, the second column the training compounds toxic activities (qualitative or quantitative). Use -log10 transformed values for regression datasets. Add metadata to a JSON file with the same basename containing the fields "species", "endpoint", "source" and "unit" (regression only). You can find example training data at [Github](https://github.com/opentox/lazar-public-data).

#### Create and validate a `lazar` model with default algorithms and parameters

  `validated_model = Model::Validation.create_from_csv_file EPAFHM_log10.csv`

  This command will create a `lazar` model and validate it with three independent 10-fold crossvalidations.

#### Inspect crossvalidation results

  `validated_model.crossvalidations`

#### Predict a new compound

  Create a compound

  `compound = Compound.from_smiles "NC(=O)OCCC"`

  Predict Fathead Minnow Acute Toxicity

  `validated_model.predict compound`

#### Experiment with other algorithms

  You can pass algorithm specifications as parameters to the `Model::Validation.create_from_csv_file` and `Model::Lazar.create` commands. Algorithms for descriptors, similarity calculations, feature_selection and local models are specified in the `algorithm` parameter. Unspecified algorithms and parameters are substituted by default values. The example below selects 

  - MP2D fingerprint descriptors
  - Tanimoto similarity with a threshold of 0.1
  - no feature selection
  - weighted majority vote predictions

  ```
algorithms = {
  :descriptors => { # descriptor algorithm
    :method => "fingerprint", # fingerprint descriptors
    :type => "MP2D" # fingerprint type, e.g. FP4, MACCS
  },
  :similarity => { # similarity algorithm
    :method => "Algorithm::Similarity.tanimoto",
    :min => 0.1 # similarity threshold for neighbors
  },
  :feature_selection => nil, # no feature selection
  :prediction => { # local modelling algorithm
    :method => "Algorithm::Classification.weighted_majority_vote",
  },
}

training_dataset = Dataset.from_csv_file "hamster_carcinogenicity.csv"
model = Model::Lazar.create  training_dataset: training_dataset, algorithms: algorithms
  ```

  The next example creates a regression model with

  - calculated descriptors from OpenBabel libraries
  - weighted cosine similarity and a threshold of 0.5
  - descriptors that are correlated with the endpoint
  - local partial least squares models from the R caret package

  ```
algorithms = {
  :descriptors => { # descriptor algorithm
    :method => "calculate_properties",
    :features => PhysChem.openbabel_descriptors,
  },
  :similarity => { # similarity algorithm
    :method => "Algorithm::Similarity.weighted_cosine",
    :min => 0.5
  },
  :feature_selection => { # feature selection algorithm
    :method => "Algorithm::FeatureSelection.correlation_filter",
  },
  :prediction => { # local modelling algorithm
    :method => "Algorithm::Caret.pls",
  },
}
training_dataset = Dataset.from_csv_file "EPAFHM_log10.csv"
model = Model::Lazar.create(training_dataset:training_dataset, algorithms:algorithms)
    ```

Please consult the [API documentation](http://rdoc.info/gems/lazar) and [source code](https:://github.com/opentox/lazar) for up to date information about implemented algorithms:

- Descriptor algorithms
  - [Compounds](http://www.rubydoc.info/gems/lazar/OpenTox/Compound)
  - [Nanoparticles](http://www.rubydoc.info/gems/lazar/OpenTox/Nanoparticle)
- [Similarity algorithms](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Similarity)
- [Feature selection algorithms](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/FeatureSelection)
- Local models
  - [Classification](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Classification)
  - [Regression](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Regression)
  - [R caret](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Caret)


You can find more working examples in the `lazar` `model-*.rb` and `validation-*.rb` [tests](https://github.com/opentox/lazar/tree/master/test).

### Create and use `lazar` nanoparticle models

#### Create and validate a `nano-lazar` model from eNanoMapper with default algorithms and parameters

  `validated_model = Model::Validation.create_from_enanomapper`

  This command will mirror the eNanoMapper database in the local database, create a `nano-lazar` model and validate it with five independent 10-fold crossvalidations.

#### Inspect crossvalidation results

  `validated_model.crossvalidations`

#### Predict nanoparticle toxicities

  Choose a random nanoparticle from the "Potein Corona" dataset
  ```
  training_dataset = Dataset.where(:name => "Protein Corona Fingerprinting Predicts the Cellular Interaction of Gold and Silver Nanoparticles").first
  nanoparticle = training_dataset.substances.shuffle.first
  ```

  Predict the "Net Cell Association" endpoint

  `validated_model.predict nanoparticle`

#### Experiment with other datasets, endpoints and algorithms

  You can pass training_dataset, prediction_feature and algorithms parameters to the `Model::Validation.create_from_enanomapper` command. Procedure and options are the same as for compounds. The following commands create and validate a `nano-lazar` model with

  - measured P-CHEM properties as descriptors
  - descriptors selected with correlation filter
  - weighted cosine similarity with a threshold of 0.5
  - Caret random forests

```
algorithms = {
  :descriptors => {
    :method => "properties",
    :categories => ["P-CHEM"],
  },
  :similarity => {
    :method => "Algorithm::Similarity.weighted_cosine",
    :min => 0.5
  },
  :feature_selection => {
    :method => "Algorithm::FeatureSelection.correlation_filter",
  },
  :prediction => {
    :method => "Algorithm::Caret.rf",
  },
}
validation_model = Model::Validation.from_enanomapper algorithms: algorithms
```


  Detailed documentation and validation results for nanoparticle models can be found in this [publication](https://github.com/enanomapper/nano-lazar-paper/blob/master/nano-lazar.pdf).

Documentation
-------------
* [API documentation](http://rdoc.info/gems/lazar)

Copyright
---------
Copyright (c) 2009-2017 Christoph Helma, Martin Guetlein, Micha Rautenberg, Andreas Maunz, David Vorgrimmler, Denis Gebele. See LICENSE for details.