1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
|
Filename: `dsspeed.pdf`
Description: A benchmark comparison of different dataset implementations.
Author: Andreas Maunz `<andreas@maunz.de>`
Date: 10/2012
# Request per row
(Old) implementation with one query for data entries **per compound**.
@compounds.each_with_index do |compound,i|
query = RDF::Query.new do
pattern [:data_entry, RDF::OLO.index, i]
pattern [:data_entry, RDF::OT.values, :values]
pattern [:values, RDF::OT.feature, :feature]
pattern [:feature, RDF::OLO.index, :feature_idx]
pattern [:values, RDF::OT.value, :value]
end
values = query.execute(@rdf).sort_by{|s| s.feature_idx}.collect do |s|
(numeric_features[s.feature_idx] and s.value.to_s != "") ? \
s.value.to_s.to_f : s.value.to_s
end
@data_entries << values.collect{|v| v == "" ? nil : v}
end
Timings for reading a BBRC feature dataset (85 compounds, 53 features. Repeated three times, median reported):
user system total real
ds reading 6.640000 0.090000 6.730000 ( 7.429505)
# Single Table
Now some optimized versions that retrieve entries all at once. A few variables have been renamed for clarity in the query:
query = RDF::Query.new do
# compound index: now a free variable
pattern [:data_entry, RDF::OLO.index, :cidx]
pattern [:data_entry, RDF::OT.values, :vals]
pattern [:vals, RDF::OT.feature, :f]
pattern [:f, RDF::OLO.index, :fidx]
pattern [:vals, RDF::OT.value, :val]
end
Also `RDF::Query::Solutions#order_by` is used instead of the generic `Enumerable#sort_by`, which may have advantages (not tested seperately).
## 'Row Slicing' Version
Results are sorted by compound, then by feature. The long array is sliced into rows.
@data_entries = query.execute(@rdf).order_by(:cidx, :fidx).collect { |entry|
entry.val.to_s.blank? ? nil : \
(numeric_features[entry.fidx] ? entry.val.to_s.to_f : entry.val.to_s)
}.each_slice(@features.size).to_a
Timings:
user system total real
ds reading 3.850000 0.090000 3.940000 ( 4.643435)
## 'Fill Table' Version
Modification of 'Row Slicing' that avoids lookup operations where possible. Also pre-allocates `@data_entries`.
clim=(@compounds.size-1)
cidx=0
fidx=0
num=numeric_features[fidx]
@data_entries = \
(Array.new(@compounds.size*@features.size)).each_slice(@features.size).to_a
# order by feature index as to compute numeric status less frequently
query.execute(@rdf).order_by(:fidx, :cidx).each { |entry|
val = entry.val.to_s
unless val.blank?
@data_entries[cidx][fidx] = (num ? val.to_f : val)
end
if (cidx < clim)
cidx+=1
else
cidx=0
fidx+=1
num=numeric_features[fidx]
end
}
Timings:
user system total real
ds reading 3.820000 0.040000 3.860000 ( 4.540800)
## 'SPARQL' Version
Modification of 'Fill Table' that loads data entrues via SPARQL, not RDF query.
sparql = "SELECT ?value FROM <#{uri}> WHERE {
?data_entry <#{RDF::OLO.index}> ?cidx ;
<#{RDF::OT.values}> ?v .
?v <#{RDF::OT.feature}> ?f;
<#{RDF::OT.value}> ?value .
?f <#{RDF::OLO.index}> ?fidx.
} ORDER BY ?fidx ?cidx"
Timings:
user system total real
ds reading 1.690000 0.050000 1.740000 ( 2.362236)
# Dataset Tests
Test runtimes changed as follows:
Test old 'Row Slicing' 'SPARQL'
---------------- ------- ------------- --------
dataset.rb 6.998s 7.406s 6.341s
dataset_large.rb 64.230s 25.231s 25.071
Table: Runtimes
## Conclusions
In view of the results I implemented the 'SPARQL' version.
## Note
A further modification that avoids querying compounds separately made runtimes much worse again.
The idea was to get the compound together with each data entry:
#<RDF::Query::Solution:0x24f41cc(
{
:compound=>#<RDF::URI:0x2638c68(http://loca [...]
:cidx=>#<RDF::Literal::Integer:0x2639190("3 [...]
:data_entry=>#<RDF::Node:0x2639618(_:b1324f [...]
:vals=>#<RDF::Node:0x17699d0(_:b32bf4000000 [...]
:f=>#<RDF::URI:0x1638ed0(http://localhost:8 [...]
:fidx=>#<RDF::Literal::Integer:0x271c170("0 [...]
:val=>#<RDF::Literal::Integer:0x176879c("0" [...]
}
)>
One would add compounds to `@compounds` only for the first run through column no '1'.
|