doc/dsspeed.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143

Filename: `dsspeed.pdf`  
Description: A benchmark comparison of different dataset implementations.  
Author: Andreas Maunz `<andreas@maunz.de>`  
Date: 10/2012

# Request per row

(Old) implementation with one query for data entries **per compound**.

    @compounds.each_with_index do |compound,i|
      query = RDF::Query.new do
        pattern [:data_entry, RDF::OLO.index, i]
        pattern [:data_entry, RDF::OT.values, :values]
        pattern [:values, RDF::OT.feature, :feature]
        pattern [:feature, RDF::OLO.index, :feature_idx]
        pattern [:values, RDF::OT.value, :value]
      end
      values = query.execute(@rdf).sort_by{|s| s.feature_idx}.collect do |s|
        (numeric_features[s.feature_idx] and s.value.to_s != "") ? \
         s.value.to_s.to_f : s.value.to_s
      end
      @data_entries << values.collect{|v| v == "" ? nil : v}
    end

Timings for reading a BBRC feature dataset (85 compounds, 53 features. Repeated three times, median reported):

                     user     system      total        real
    ds reading   6.640000   0.090000   6.730000 (  7.429505)


# Single Table

Now some optimized versions that retrieve entries all at once. A few variables have been renamed for clarity in the query:

    query = RDF::Query.new do
      # compound index: now a free variable
      pattern [:data_entry, RDF::OLO.index, :cidx] 
      pattern [:data_entry, RDF::OT.values, :vals]
      pattern [:vals, RDF::OT.feature, :f]
      pattern [:f, RDF::OLO.index, :fidx]
      pattern [:vals, RDF::OT.value, :val]
    end

Also `RDF::Query::Solutions#order_by` is used instead of the generic `Enumerable#sort_by`, which may have advantages (not tested seperately).

## 'Row Slicing' Version

Results are sorted by compound, then by feature. The long array is sliced into rows.

    @data_entries = query.execute(@rdf).order_by(:cidx, :fidx).collect { |entry| 
      entry.val.to_s.blank? ? nil : \
      (numeric_features[entry.fidx] ? entry.val.to_s.to_f : entry.val.to_s)
    }.each_slice(@features.size).to_a

Timings:

                     user     system      total        real
    ds reading   3.850000   0.090000   3.940000 (  4.643435)

## 'Fill Table' Version

Modification of 'Row Slicing' that avoids lookup operations where possible. Also pre-allocates `@data_entries`.

    clim=(@compounds.size-1)
    cidx=0
    fidx=0
    num=numeric_features[fidx]
    @data_entries = \
    (Array.new(@compounds.size*@features.size)).each_slice(@features.size).to_a
    # order by feature index as to compute numeric status less frequently
    query.execute(@rdf).order_by(:fidx, :cidx).each { |entry| 
      val = entry.val.to_s
      unless val.blank?
        @data_entries[cidx][fidx] = (num ? val.to_f : val)
      end
      if (cidx < clim)
        cidx+=1
      else
        cidx=0
        fidx+=1
        num=numeric_features[fidx]
      end
    }

Timings:

                     user     system      total        real
    ds reading   3.820000   0.040000   3.860000 (  4.540800)

## 'SPARQL' Version

Modification of 'Fill Table' that loads data entries via SPARQL, not RDF query.

    sparql = "SELECT ?value FROM <#{uri}> WHERE {
      ?data_entry <#{RDF::OLO.index}> ?cidx ;
                  <#{RDF::OT.values}> ?v .
      ?v          <#{RDF::OT.feature}> ?f;
                  <#{RDF::OT.value}> ?value .
      ?f          <#{RDF::OLO.index}> ?fidx.
      } ORDER BY ?fidx ?cidx" 

Timings:

                 user     system      total        real
ds reading   1.690000   0.050000   1.740000 (  2.362236)


# Dataset Tests

Test runtimes changed as follows:

Test             old     'Row Slicing' 'SPARQL'
---------------- ------- ------------- -------- 
dataset.rb       6.998s  7.406s        6.341s
dataset_large.rb 64.230s 25.231s       25.071

Table: Runtimes


## Conclusions

In view of the results I implemented the 'SPARQL' version.


## Note

A further modification that avoids querying compounds separately made runtimes much worse again.
The idea was to get the compound together with each data entry:

    #<RDF::Query::Solution:0x24f41cc(
      {
        :compound=>#<RDF::URI:0x2638c68(http://loca [...]
        :cidx=>#<RDF::Literal::Integer:0x2639190("3 [...]
        :data_entry=>#<RDF::Node:0x2639618(_:b1324f [...]
        :vals=>#<RDF::Node:0x17699d0(_:b32bf4000000 [...]
        :f=>#<RDF::URI:0x1638ed0(http://localhost:8 [...]
        :fidx=>#<RDF::Literal::Integer:0x271c170("0 [...]
        :val=>#<RDF::Literal::Integer:0x176879c("0" [...]
      }
    )>

One would add compounds to `@compounds` only for the first run through column no '1'.