summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAndreas Maunz <andreas@maunz.de>2012-11-19 13:33:59 +0100
committerAndreas Maunz <andreas@maunz.de>2012-11-19 13:33:59 +0100
commit1da5f79c8d88cb3390b4e53c3c7b9adfca205c68 (patch)
tree486bba3485209f2e296d5bce45bb953d423b0629
parent034d72f95bfc3f599fb8cf9cd2d8fab801ff590d (diff)
Added feature creation foo
-rw-r--r--_posts/2012-05-02-support-calculation-in-fminer.md2
-rw-r--r--_posts/2012-05-07-release-with-git-flow.md2
-rw-r--r--_posts/2012-07-30-how-to-run-4store-without-internet-connection.md2
-rw-r--r--_posts/2012-11-19-clever-feature-creation-keeps-your-database-small.md58
4 files changed, 61 insertions, 3 deletions
diff --git a/_posts/2012-05-02-support-calculation-in-fminer.md b/_posts/2012-05-02-support-calculation-in-fminer.md
index cdea00d..b987eb4 100644
--- a/_posts/2012-05-02-support-calculation-in-fminer.md
+++ b/_posts/2012-05-02-support-calculation-in-fminer.md
@@ -3,7 +3,7 @@ layout: post
title: "Support Calculation in Fminer"
description: "Fminer algorithms BBRC and LAST-PM can now be used for automatic support calculation. This post shows how to use it."
category: algorithm
-tags: [Fminer, Feature generation, BBRC, LAST-PM, Tutorials]
+tags: [Fminer, Feature Generation, BBRC, LAST-PM, Tutorials]
---
{% include JB/setup %}
diff --git a/_posts/2012-05-07-release-with-git-flow.md b/_posts/2012-05-07-release-with-git-flow.md
index 1dbfaa0..4e3022d 100644
--- a/_posts/2012-05-07-release-with-git-flow.md
+++ b/_posts/2012-05-07-release-with-git-flow.md
@@ -3,7 +3,7 @@ layout: post
title: "Release Development with Git Flow"
description: "Release Development with Git Flow"
category: development
-tags: [git, deployment]
+tags: [GIT, deployment]
---
{% include JB/setup %}
diff --git a/_posts/2012-07-30-how-to-run-4store-without-internet-connection.md b/_posts/2012-07-30-how-to-run-4store-without-internet-connection.md
index 3e28c89..614c033 100644
--- a/_posts/2012-07-30-how-to-run-4store-without-internet-connection.md
+++ b/_posts/2012-07-30-how-to-run-4store-without-internet-connection.md
@@ -3,7 +3,7 @@ layout: post
title: "How to run 4store without internet connection"
description: ""
category: setup
-tags: [4store, local installation]
+tags: [4store, installation]
---
{% include JB/setup %}
diff --git a/_posts/2012-11-19-clever-feature-creation-keeps-your-database-small.md b/_posts/2012-11-19-clever-feature-creation-keeps-your-database-small.md
new file mode 100644
index 0000000..849e636
--- /dev/null
+++ b/_posts/2012-11-19-clever-feature-creation-keeps-your-database-small.md
@@ -0,0 +1,58 @@
+---
+layout: post
+title: "Clever Feature Creation Keeps Your Database Small"
+description: ""
+category: development
+tags: [development,dataset, Feature Generation, Fminer]
+---
+{% include JB/setup %}
+
+**Avoiding feature generation from scratch can save a lot of memory and potentially speeds up the whole framework**
+
+# The Problem
+
+Consider an `OpenTox::Dataset` object in its fully loaded (populated) state, containing metadata, features, parameters, compounds, and data entries. The 'heavy' parts, containing most of the data, are the data entries and compounds. For this reason, in our implementation, they load directly via SPARQL queries. The 'light' parts are metadata, parameters, and features, as they do not contain tabulated data. The latter can be even filtered through RDF queries quickly.
+
+However, even SPARQL queries take more time when data is stored in a redundant fashion at the 4Store service. This may not be necessary, since most of the data is usually already stored. For example, datasets, or parts of datasets, are often uploaded many times to the same service. Therefore, creating features by
+
+ f=OpenTox::Feature.new
+ ... do something with f ...
+ f.put
+
+programmatically creates a new feature every time. The 4Store service knows nothing about the semantics and the programmer is in charge to tell the service that
+
+ f1=OpenTox::Feature.new
+ f1.title="Hamster Carcinogenicity"
+ f1.put
+
+and
+
+ f2=OpenTox::Feature.new
+ f2.title="Hamster Carcinogenicity"
+ f2.put
+
+are actually the same feature. He should *do this* instead:
+
+ f1=OpenTox::Feature.new
+ f1.title="Hamster Carcinogenicity"
+ f2=f1 # f2 is a pointer to f1
+ f1.put
+
+This results in a completely non-redundant setting, because compounds describing the same structure were already non-redundantly store. Therefore, two `OpenTox::Dataset`s describing the same data are now merely pointers to the same data. The features are also only present a single time at the 4Store backend.
+
+
+# A Solution
+
+The `opentox-client` library contains a method `OpenTox::Feature#find_by_title` that creates features non-redundantly. It accepts two arguments, a string with the feature title and a metadata hash. It searches the 4Store service for features with an identical name, parses the list and compares the metadata. If one of the features has identical metadata, its URI is returned. Otherwise, a new feature is created and its URI is returned.
+
+The method is used like this:
+
+ # Search feature by title
+ title = "Foo"
+ metadata = {
+ RDF.type => [RDF::OT.Feature, RDF::OT.NumericFeature],
+ RDF::DC.description => description
+ }
+ feature = OpenTox::Feature.find_by_title(title, metadata)
+
+In practice, not many features are created per dataset, and the method works quite efficiently. It has very nice effects, for example, when uploading to datasets one after the other, both have identical features. The 4Store backend now knows that they are equal and potentially eliminates duplicate data.