1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
|
Filename: `dsspeed.pdf`
Description: A benchmark comparison of different dataset implementations.
Author: Andreas Maunz `<andreas@maunz.de>`
Date: 10/2012
Some experiments made on branch `development`, using a VirtualBox VM (2 CPU, 2G of RAM), Debian 6.0.5, 64bit.
# Dataset Creation
Storing a dataset at the 4store backend.
## Generating and Storing Triples.
Implementation with querying the `/compound` service for compound URIs.
date
task=`curl -X POST \
-F "file=@/home/am/opentox-ruby/opentox-test/test/data/kazius.csv;type=text/csv"
http://localhost:8083/dataset 2>/dev/null`
get_result $task
date
Timings for uploading the Kazius dataset (>4000 compounds. Repeated three times, median reported):
Sat Nov 3 11:10:04 CET 2012
http://localhost:8083/dataset/6a92fbf1-9c46-4c72-a487-365589c1210d
Sat Nov 3 11:10:41 CET 2012
Uploading takes 37s. This time is consumed by the workflow as follows:
- Compound Triples: 33.236s (89.8 %)
- Value Triples: 1.052s (0.03 %)
- Other Triples: <1s (<0.03 %)
- 4store upload: <3s (<0.1 %)
Based on these results I suggest to avoid querying the compound service.
# Dataset Read-In
Populating an `OpenTox::Dataset` object in memory, by reading from the 4store backend.
## Request per row
Implementation with one query for data entries **per compound**.
@compounds.each_with_index do |compound,i|
query = RDF::Query.new do
pattern [:data_entry, RDF::OLO.index, i]
pattern [:data_entry, RDF::OT.values, :values]
pattern [:values, RDF::OT.feature, :feature]
pattern [:feature, RDF::OLO.index, :feature_idx]
pattern [:values, RDF::OT.value, :value]
end
values = query.execute(@rdf).sort_by{|s| s.feature_idx}.collect do |s|
(numeric_features[s.feature_idx] and s.value.to_s != "") ? \
s.value.to_s.to_f : s.value.to_s
end
@data_entries << values.collect{|v| v == "" ? nil : v}
end
Timings for reading a BBRC feature dataset (85 compounds, 53 features. Repeated three times, median reported):
user system total real
ds reading 6.640000 0.090000 6.730000 ( 7.429505)
## Single Table
Now some optimized versions that retrieve entries all at once. A few variables have been renamed for clarity in the query:
query = RDF::Query.new do
# compound index: now a free variable
pattern [:data_entry, RDF::OLO.index, :cidx]
pattern [:data_entry, RDF::OT.values, :vals]
pattern [:vals, RDF::OT.feature, :f]
pattern [:f, RDF::OLO.index, :fidx]
pattern [:vals, RDF::OT.value, :val]
end
Also `RDF::Query::Solutions#order_by` is used instead of the generic `Enumerable#sort_by`, which may have advantages (not tested seperately).
### 'Row Slicing' Version
Results are sorted by compound, then by feature. The long array is sliced into rows.
@data_entries = query.execute(@rdf).order_by(:cidx, :fidx).collect { |entry|
entry.val.to_s.blank? ? nil : \
(numeric_features[entry.fidx] ? entry.val.to_s.to_f : entry.val.to_s)
}.each_slice(@features.size).to_a
Timings:
user system total real
ds reading 3.850000 0.090000 3.940000 ( 4.643435)
### 'Fill Table' Version
Modification of 'Row Slicing' that avoids lookup operations where possible. Also pre-allocates `@data_entries`.
clim=(@compounds.size-1)
cidx=0
fidx=0
num=numeric_features[fidx]
@data_entries = \
(Array.new(@compounds.size*@features.size)).each_slice(@features.size).to_a
# order by feature index as to compute numeric status less frequently
query.execute(@rdf).order_by(:fidx, :cidx).each { |entry|
val = entry.val.to_s
unless val.blank?
@data_entries[cidx][fidx] = (num ? val.to_f : val)
end
if (cidx < clim)
cidx+=1
else
cidx=0
fidx+=1
num=numeric_features[fidx]
end
}
Timings:
user system total real
ds reading 3.820000 0.040000 3.860000 ( 4.540800)
### 'SPARQL' Version
Modification of 'Fill Table' that loads data entries via SPARQL, not RDF query.
sparql = "SELECT ?value FROM <#{uri}> WHERE {
?data_entry <#{RDF::OLO.index}> ?cidx ;
<#{RDF::OT.values}> ?v .
?v <#{RDF::OT.feature}> ?f;
<#{RDF::OT.value}> ?value .
?f <#{RDF::OLO.index}> ?fidx.
} ORDER BY ?fidx ?cidx"
Timings:
user system total real
ds reading 1.690000 0.050000 1.740000 ( 2.362236)
## Dataset Tests
Test runtimes changed as follows:
Test old 'Row Slicing' 'SPARQL'
---------------- ------- ------------- --------
dataset.rb 6.998s 7.406s 6.341s
dataset_large.rb 64.230s 25.231s 25.071
Table: Runtimes
### Conclusions
In view of the results I implemented the 'SPARQL' version.
### Note
A further modification that avoids querying compounds separately made runtimes much worse again.
The idea was to get the compound together with each data entry:
#<RDF::Query::Solution:0x24f41cc(
{
:compound=>#<RDF::URI:0x2638c68(http://loca [...]
:cidx=>#<RDF::Literal::Integer:0x2639190("3 [...]
:data_entry=>#<RDF::Node:0x2639618(_:b1324f [...]
:vals=>#<RDF::Node:0x17699d0(_:b32bf4000000 [...]
:f=>#<RDF::URI:0x1638ed0(http://localhost:8 [...]
:fidx=>#<RDF::Literal::Integer:0x271c170("0 [...]
:val=>#<RDF::Literal::Integer:0x176879c("0" [...]
}
)>
One would add compounds to `@compounds` only for the first run through column no '1'.
|