summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorAndreas Maunz <andreas@maunz.de>2012-11-03 11:28:43 +0100
committerAndreas Maunz <andreas@maunz.de>2012-11-03 11:28:43 +0100
commit3cbb1ef43dc03e89b780ac67a74cc30773d8cacf (patch)
tree2ab2843f795e86dbc37e6f4978f0491f1617c166 /doc
parent461a20d9071ad80f334d7a3d8c334e2e085821d2 (diff)
Added CSV upload analysis
Diffstat (limited to 'doc')
-rw-r--r--doc/dsspeed.md53
1 files changed, 44 insertions, 9 deletions
diff --git a/doc/dsspeed.md b/doc/dsspeed.md
index 62ab1c9..498181c 100644
--- a/doc/dsspeed.md
+++ b/doc/dsspeed.md
@@ -3,9 +3,44 @@ Description: A benchmark comparison of different dataset implementations.
Author: Andreas Maunz `<andreas@maunz.de>`
Date: 10/2012
-# Request per row
+# Dataset Creation
-(Old) implementation with one query for data entries **per compound**.
+Storing a dataset in the 4store backend.
+
+## Data analysis Generation of triples.
+
+Implementation with querying the `/compound` service for compound URIs.
+
+ date
+ task=`curl -X POST -F "file=@/home/am/opentox-ruby/opentox-test/test/data/kazius.csv;type=text/csv"
+ http://localhost:8083/dataset 2>/dev/null`
+ get_result $task
+ date
+
+Timings for uploading the Kazius dataset (>4000 compounds. Repeated three times, median reported):
+
+ Sat Nov 3 11:10:04 CET 2012
+ http://localhost:8083/dataset/6a92fbf1-9c46-4c72-a487-365589c1210d
+ Sat Nov 3 11:10:41 CET 2012
+
+Uploading takes 37s. This time is consumed by the workflow as follows:
+
+- Compound Triples: 33.236s (89.8 %)
+- Value Triples: 1.052s (0.03 %)
+- Other Triples: <1s (<0.03 %)
+- 4store upload: <3s (<0.1 %)
+
+Based on these results I suggest to avoid querying the compound service.
+
+
+
+# Dataset Read-In
+
+Populating an `OpenTox::Dataset` object in memory, by reading from the 4store backend.
+
+## Request per row
+
+Implementation with one query for data entries **per compound**.
@compounds.each_with_index do |compound,i|
query = RDF::Query.new do
@@ -28,7 +63,7 @@ Timings for reading a BBRC feature dataset (85 compounds, 53 features. Repeated
ds reading 6.640000 0.090000 6.730000 ( 7.429505)
-# Single Table
+## Single Table
Now some optimized versions that retrieve entries all at once. A few variables have been renamed for clarity in the query:
@@ -43,7 +78,7 @@ Now some optimized versions that retrieve entries all at once. A few variables h
Also `RDF::Query::Solutions#order_by` is used instead of the generic `Enumerable#sort_by`, which may have advantages (not tested seperately).
-## 'Row Slicing' Version
+### 'Row Slicing' Version
Results are sorted by compound, then by feature. The long array is sliced into rows.
@@ -57,7 +92,7 @@ Timings:
user system total real
ds reading 3.850000 0.090000 3.940000 ( 4.643435)
-## 'Fill Table' Version
+### 'Fill Table' Version
Modification of 'Row Slicing' that avoids lookup operations where possible. Also pre-allocates `@data_entries`.
@@ -87,7 +122,7 @@ Timings:
user system total real
ds reading 3.820000 0.040000 3.860000 ( 4.540800)
-## 'SPARQL' Version
+### 'SPARQL' Version
Modification of 'Fill Table' that loads data entries via SPARQL, not RDF query.
@@ -105,7 +140,7 @@ Timings:
ds reading 1.690000 0.050000 1.740000 ( 2.362236)
-# Dataset Tests
+## Dataset Tests
Test runtimes changed as follows:
@@ -117,12 +152,12 @@ dataset_large.rb 64.230s 25.231s 25.071
Table: Runtimes
-## Conclusions
+### Conclusions
In view of the results I implemented the 'SPARQL' version.
-## Note
+### Note
A further modification that avoids querying compounds separately made runtimes much worse again.
The idea was to get the compound together with each data entry: