summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAndreas Maunz <andreas@maunz.de>2012-02-09 15:40:00 +0100
committerAndreas Maunz <andreas@maunz.de>2012-02-09 15:40:00 +0100
commitdf96ba4183b341393ac00ee5e444c99411d8123d (patch)
tree144ed6918f33046d503d6e66afaee1ec96f4d63a
parent0fa509eeab52c336552a38db1a3f7195f840a1f2 (diff)
parent49eb76b0f2c2037e4a1e664752271b7e4a955f72 (diff)
Merge branch 'pc_new_1' into development
-rw-r--r--README.md79
-rw-r--r--application.rb3
-rw-r--r--feature_selection.rb85
m---------last-utils0
-rw-r--r--lazar.rb194
m---------libfminer0
6 files changed, 228 insertions, 133 deletions
diff --git a/README.md b/README.md
index 8383cb6..344f747 100644
--- a/README.md
+++ b/README.md
@@ -9,44 +9,58 @@ OpenTox Algorithm
REST operations
---------------
- Get a list of all algorithms GET / - URIs of algorithms 200
- Get a representation of the GET /fminer/ - fminer representation 200,404
+ Get a list of all algorithms GET / - URIs of algorithms 200
+ Get a representation of the GET /fminer/ - fminer representation 200,404
fminer algorithms
- Get a representation of the GET /fminer/bbrc - bbrc representation 200,404
+ Get a representation of the GET /fminer/bbrc - bbrc representation 200,404
bbrc algorithm
- Get a representation of the GET /fminer/last - last representation 200,404
+ Get a representation of the GET /fminer/last - last representation 200,404
last algorithm
- Get a representation of the GET /lazar - lazar representation 200,404
+ Get a representation of the GET /lazar - lazar representation 200,404
lazar algorithm
- Create bbrc features POST /fminer/bbrc dataset_uri, URI for feature dataset 200,400,404,500
- feature_uri,
- [min_frequency=5 per-mil],
- [feature_type=trees],
- [backbone=true],
- [min_chisq_significance=0.95],
- [nr_hits=false]
- Create last features POST /fminer/last dataset_uri, URI for feature dataset 200,400,404,500
- feature_uri,
- [min_frequency=8 %],
- [feature_type=trees],
- [nr_hits=false]
- Create lazar model POST /lazar dataset_uri, URI for lazar model 200,400,404,500
- prediction_feature,
- feature_generation_uri
- prediction_algorithm
- [local_svm_kernel=weighted_tanimoto]
- [min_sim=0.3]
- [nr_hits=false]
- [conf_stdev=false]
+ Get a representation of the GET /feature_selection - feature selection representation 200,404
+ feature selection algorithms
+ Get a representation of the GET /feature_selection/rfe - rfe representation 200,404
+ rfe algorithm
+
+
+ Create bbrc features POST /fminer/bbrc dataset_uri, URI for feature dataset 200,400,404,500
+ feature_uri,
+ [min_frequency=5 per-mil],
+ [feature_type=trees],
+ [backbone=true],
+ [min_chisq_significance=0.95],
+ [nr_hits=false]
+ Create last features POST /fminer/last dataset_uri, URI for feature dataset 200,400,404,500
+ feature_uri,
+ [min_frequency=8 %],
+ [feature_type=trees],
+ [nr_hits=false]
+ Create lazar model POST /lazar dataset_uri, URI for lazar model 200,400,404,500
+ [prediction_feature],
+ [feature_generation_uri],
+ [prediction_algorithm],
+ [feature_dataset_uri],
+ [pc_type=null],
+ [nr_hits=false (class. using wt. maj. vote), true (else)],
+ [min_sim=0.3 (nominal), 0.4 (numeric features)]
+ [min_train_performance=0.1]
+
+ Create selected features POST /feature_selection/rfe dataset_uri, URI for dataset 200,400,404,500
+ prediction_feature,
+ feature_dataset_uri,
+ [del_missing=false]
+
Synopsis
--------
-- prediction\_algorithm: One of "weighted\_majority\_vote" (default for classification), "local\_svm\_classification", "local\_svm\_regression (default for regression)", "local\_mlr\_prop". "weighted\_majority\_vote" is not applicable for regression. "local\_mlr\_prop" is not applicable for classification.
-- local\_svm\_kernel: One of "weighted\_tanimoto", "propositionalized". local\_svm\_kernel is not appplicable when prediction\_algorithm="weighted\_majority\_vote".
-- min_sim: The minimum similarity threshold for neighbors. Numeric value in [0,1].
-- nr_hits: Whether for instantiated models (local\_svm\_kernel = "propositionalized" for prediction_algorithm="local\_svm\_classification" or "local\_svm\_regression", or for prediction_algorithm="local\_mlr\_prop") nominal features should be instantiated with their occurrence counts in the instances. For non-instantiated models (local\_svm\_kernel = "weighted\_tanimoto" for prediction_algorithm="local\_svm\_classification" or "local\_svm\_regression", or for prediction_algorithm="weighted\_majority\_vote") the neighbor-to-neighbor and neighbor-to-query similarity also integrates these counts, when the parameter is set. One of "true", "false".
-- conf_stdev: Whether confidence integrates distribution of neighbor activity values. When "true", the exp(-1.0*(standard deviation of neighbor activities)) is multiplied on the similarity. One of "true", "false".
+- prediction\_algorithm: One of "weighted\_majority\_vote" (default for classification), "local\_svm\_classification", "local\_svm\_regression" (default for regression). "weighted\_majority\_vote" is not applicable for regression.
+- pc_type: Mandatory for feature dataset, one of [geometrical, topological, electronic, constitutional, hybrid, cpsa].
+- nr_hits: Whether nominal features should be instantiated with their occurrence counts in the instances. One of "true", "false".
+- min_sim: The minimum similarity threshold for neighbors. Numeric value in [0,1].
+- min_train_performance. The minimum training performance for "local\_svm\_classification" (Accuracy) and "local\_svm\_regression" (R-squared). Numeric value in [0,1].
+- del_missing: one of true, false
See http://www.maunz.de/wordpress/opentox/2011/lazar-models-and-how-to-trigger-them for a graphical overview.
@@ -108,4 +122,9 @@ Creates a standard Lazar model.
[API documentation](http://rdoc.info/github/opentox/algorithm)
--------------------------------------------------------------
+* * *
+
+### Create a feature dataset of selected features
+ curl -X POST -d dataset_uri={dataset_uri} -d prediction_feature_uri={prediction_feature_uri} -d feature_dataset_uri={feature_dataset_uri} -d del_missing=true http://webservices.in-silico.ch/test/algorithm/feature_selection/rfe
+
Copyright (c) 2009-2011 Christoph Helma, Martin Guetlein, Micha Rautenberg, Andreas Maunz, David Vorgrimmler, Denis Gebele. See LICENSE for details.
diff --git a/application.rb b/application.rb
index b62f6f5..f5b331f 100644
--- a/application.rb
+++ b/application.rb
@@ -11,6 +11,7 @@ require 'opentox-ruby'
require 'openbabel.rb'
require 'fminer.rb'
require 'lazar.rb'
+require 'feature_selection.rb'
set :lock, true
@@ -22,7 +23,7 @@ end
#
# @return [text/uri-list] algorithm URIs
get '/?' do
- list = [ url_for('/lazar', :full), url_for('/fminer/bbrc', :full), url_for('/fminer/last', :full) ].join("\n") + "\n"
+ list = [ url_for('/lazar', :full), url_for('/fminer/bbrc', :full), url_for('/fminer/last', :full), url_for('/feature_selection/rfe', :full) ].join("\n") + "\n"
case request.env['HTTP_ACCEPT']
when /text\/html/
content_type "text/html"
diff --git a/feature_selection.rb b/feature_selection.rb
new file mode 100644
index 0000000..d375a0e
--- /dev/null
+++ b/feature_selection.rb
@@ -0,0 +1,85 @@
+# Get list of feature_selection algorithms
+#
+# @return [text/uri-list] URIs of feature_selection algorithms
+get '/feature_selection/?' do
+ list = [ url_for('/feature_selection/rfe', :full) ].join("\n") + "\n"
+ case request.env['HTTP_ACCEPT']
+ when /text\/html/
+ content_type "text/html"
+ OpenTox.text_to_html list
+ else
+ content_type 'text/uri-list'
+ list
+ end
+end
+
+# Get RDF/XML representation of feature_selection rfe algorithm
+# @return [application/rdf+xml] OWL-DL representation of feature_selection rfe algorithm
+get "/feature_selection/rfe/?" do
+ algorithm = OpenTox::Algorithm::Generic.new(url_for('/feature_selection/rfe',:full))
+ algorithm.metadata = {
+ DC.title => 'recursive feature elimination',
+ DC.creator => "andreas@maunz.de, helma@in-silico.ch",
+ DC.contributor => "vorgrimmlerdavid@gmx.de",
+ BO.instanceOf => "http://opentox.org/ontology/ist-algorithms.owl#feature_selection_rfe",
+ RDF.type => [OT.Algorithm,OTA.PatternMiningSupervised],
+ OT.parameters => [
+ { DC.description => "Dataset URI", OT.paramScope => "mandatory", DC.title => "dataset_uri" },
+ { DC.description => "Prediction Feature URI", OT.paramScope => "mandatory", DC.title => "prediction_feature_uri" },
+ { DC.description => "Feature Dataset URI", OT.paramScope => "mandatory", DC.title => "feature_dataset_uri" },
+ { DC.description => "Delete Instances with missing values", OT.paramScope => "optional", DC.title => "del_missing" }
+ ]
+ }
+ case request.env['HTTP_ACCEPT']
+ when /text\/html/
+ content_type "text/html"
+ OpenTox.text_to_html algorithm.to_yaml
+ when /application\/x-yaml/
+ content_type "application/x-yaml"
+ algorithm.to_yaml
+ else
+ response['Content-Type'] = 'application/rdf+xml'
+ algorithm.to_rdfxml
+ end
+end
+
+# Run rfe algorithm on dataset
+#
+# @param [String] dataset_uri URI of the training dataset
+# @param [String] feature_dataset_uri URI of the feature dataset
+# @return [text/uri-list] Task URI
+post '/feature_selection/rfe/?' do
+
+ raise OpenTox::NotFoundError.new "Please submit a dataset_uri." unless params[:dataset_uri]
+ raise OpenTox::NotFoundError.new "Please submit a prediction_feature_uri." unless params[:prediction_feature_uri]
+ raise OpenTox::NotFoundError.new "Please submit a feature_dataset_uri." unless params[:feature_dataset_uri]
+
+ ds_csv=OpenTox::RestClientWrapper.get( params[:dataset_uri], {:accept => "text/csv"} )
+ tf_ds=Tempfile.open(['rfe_', '.csv'])
+ tf_ds.puts(ds_csv)
+ tf_ds.flush()
+
+ prediction_feature = params[:prediction_feature_uri].split('/').last # get col name
+
+ fds_csv=OpenTox::RestClientWrapper.get( params[:feature_dataset_uri], {:accept => "text/csv"})
+ tf_fds=Tempfile.open(['rfe_', '.csv'])
+ tf_fds.puts(fds_csv)
+ tf_fds.flush()
+
+ del_missing = params[:del_missing] == "true" ? true : false
+
+ task = OpenTox::Task.create("Recursive Feature Elimination", url_for('/feature_selection',:full)) do |task|
+ r_result_file = OpenTox::Algorithm::FeatureSelection.rfe( { :ds_csv_file => tf_ds.path, :prediction_feature => prediction_feature, :fds_csv_file => tf_fds.path, :del_missing => del_missing } )
+ r_result_uri = OpenTox::Dataset.create_from_csv_file(r_result_file).uri
+ begin
+ tf_ds.close!; tf_fds.close!
+ File.unlink(r_result_file)
+ rescue
+ end
+ r_result_uri
+ end
+ response['Content-Type'] = 'text/uri-list'
+ raise OpenTox::ServiceUnavailableError.newtask.uri+"\n" if task.status == "Cancelled"
+ halt 202,task.uri.to_s+"\n"
+end
+
diff --git a/last-utils b/last-utils
-Subproject 8c02f7e71450cac6d8c5d7d34ecb620046b4ea4
+Subproject cf0238477127e54509b6ab8b5c38f50dd6ffce0
diff --git a/lazar.rb b/lazar.rb
index 9aac0d8..81929c6 100644
--- a/lazar.rb
+++ b/lazar.rb
@@ -12,9 +12,9 @@ get '/lazar/?' do
OT.parameters => [
{ DC.description => "Dataset URI with the dependent variable", OT.paramScope => "mandatory", DC.title => "dataset_uri" },
{ DC.description => "Feature URI for dependent variable. Optional for datasets with only a single feature.", OT.paramScope => "optional", DC.title => "prediction_feature" },
- { DC.description => "URI of feature genration service. Default: #{@@feature_generation_default}", OT.paramScope => "optional", DC.title => "feature_generation_uri" },
+ { DC.description => "URI of feature generation service. Default: #{@@feature_generation_default}", OT.paramScope => "optional", DC.title => "feature_generation_uri" },
{ DC.description => "URI of feature dataset. If this parameter is set no feature generation algorithm will be called", OT.paramScope => "optional", DC.title => "feature_dataset_uri" },
- { DC.description => "Further parameters for the feaature generation service", OT.paramScope => "optional" }
+ { DC.description => "Further parameters for the feature generation service", OT.paramScope => "optional" }
]
}
case request.env['HTTP_ACCEPT']
@@ -45,45 +45,74 @@ post '/lazar/?' do
task = OpenTox::Task.create("Create lazar model",url_for('/lazar',:full)) do |task|
+
+ # # # Dataset present, prediction feature present?
raise OpenTox::NotFoundError.new "Dataset #{dataset_uri} not found." unless training_activities = OpenTox::Dataset.new(dataset_uri)
training_activities.load_all(@subjectid)
+ # Prediction Feature
prediction_feature = OpenTox::Feature.find(params[:prediction_feature],@subjectid)
unless params[:prediction_feature] # try to read prediction_feature from dataset
raise OpenTox::NotFoundError.new "#{training_activities.features.size} features in dataset #{dataset_uri}. Please provide a prediction_feature parameter." unless training_activities.features.size == 1
prediction_feature = OpenTox::Feature.find(training_activities.features.keys.first,@subjectid)
params[:prediction_feature] = prediction_feature.uri # pass to feature mining service
end
+ raise OpenTox::NotFoundError.new "No feature #{prediction_feature.uri} in dataset #{params[:dataset_uri]}. (features: "+ training_activities.features.inspect+")" unless training_activities.features and training_activities.features.include?(prediction_feature.uri)
- feature_generation_uri = @@feature_generation_default unless feature_generation_uri = params[:feature_generation_uri]
-
- raise OpenTox::NotFoundError.new "No feature #{prediction_feature.uri} in dataset #{params[:dataset_uri]}. (features: "+
- training_activities.features.inspect+")" unless training_activities.features and training_activities.features.include?(prediction_feature.uri)
+ # Feature Generation URI
+ feature_generation_uri = @@feature_generation_default unless ( (feature_generation_uri = params[:feature_generation_uri]) || (params[:feature_dataset_uri]) )
+ # Create instance
lazar = OpenTox::Model::Lazar.new
- lazar.min_sim = params[:min_sim].to_f if params[:min_sim]
- # AM: Manage endpoint related variables.
+ # # # ENDPOINT RELATED
+
+ # Default Values
+ # Classification: Weighted Majority, Substructure.match
if prediction_feature.feature_type == "classification"
@training_classes = training_activities.accept_values(prediction_feature.uri).sort
@training_classes.each_with_index { |c,i|
lazar.value_map[i+1] = c # don't use '0': we must take the weighted mean later.
params[:value_map] = lazar.value_map
}
+ # Regression: SVM, Substructure.match_hits
elsif prediction_feature.feature_type == "regression"
- lazar.nr_hits = true
+ lazar.feature_calculation_algorithm = "Substructure.match_hits"
lazar.prediction_algorithm = "Neighbors.local_svm_regression"
end
- if params[:nr_hits] == "false" # if nr_hits is set optional to true/false it will return as String (but should be True/FalseClass)
- lazar.nr_hits = false
- elsif params[:nr_hits] == "true"
- lazar.nr_hits = true
+
+
+
+ # # # USER VALUES
+
+ # Min Sim
+ min_sim = params[:min_sim].to_f if params[:min_sim]
+ min_sim = 0.3 unless params[:min_sim]
+
+ # Algorithm
+ lazar.prediction_algorithm = "Neighbors.#{params[:prediction_algorithm]}" if params[:prediction_algorithm]
+
+ # Nr Hits
+ nr_hits = false
+ if params[:nr_hits] == "true" || lazar.prediction_algorithm.include?("local_svm")
+ lazar.feature_calculation_algorithm = "Substructure.match_hits"
+ nr_hits = true
end
- params[:nr_hits] = "true" if lazar.nr_hits
+ params[:nr_hits] = "true" if lazar.feature_calculation_algorithm == "Substructure.match_hits" #not sure if this line in needed
+
+ # Propositionalization
+ propositionalized = (lazar.prediction_algorithm=="Neighbors.weighted_majority_vote" ? false : true)
+
+ # PC type
+ pc_type = params[:pc_type] unless params[:pc_type].nil?
+
+ # Min train performance
+ min_train_performance = params[:min_train_performance].to_f if params[:min_train_performance]
+ min_train_performance = 0.1 unless params[:min_train_performance]
@@ -96,29 +125,22 @@ post '/lazar/?' do
- #
- # AM: features
- #
- #
- #
+ # # # Features
- # READ OR CREATE
+ # Read Features
if params[:feature_dataset_uri]
+ lazar.feature_calculation_algorithm = "Substructure.lookup"
feature_dataset_uri = params[:feature_dataset_uri]
training_features = OpenTox::Dataset.new(feature_dataset_uri)
- case training_features.feature_type(@subjectid)
- when "classification"
- lazar.similarity_algorithm = "Similarity.tanimoto"
- when "regression"
- lazar.similarity_algorithm = "Similarity.euclid"
+ if training_features.feature_type(@subjectid) == "regression"
+ lazar.similarity_algorithm = "Similarity.cosine"
+ min_sim = 0.4 unless params[:min_sim]
+ raise OpenTox::NotFoundError.new "No pc_type parameter." unless params[:pc_type]
end
- else # create features
+
+ # Create Features
+ else
params[:feature_generation_uri] = feature_generation_uri
- if feature_generation_uri.match(/fminer/)
- lazar.feature_calculation_algorithm = "Substructure.match"
- else
- raise OpenTox::NotFoundError.new "External feature generation services not yet supported"
- end
params[:subjectid] = @subjectid
prediction_feature = OpenTox::Feature.find params[:prediction_feature], @subjectid
if prediction_feature.feature_type == "regression" && feature_generation_uri.match(/fminer/)
@@ -130,57 +152,42 @@ post '/lazar/?' do
- # WRITE IN MODEL
+ # # # Write fingerprints
training_features.load_all(@subjectid)
raise OpenTox::NotFoundError.new "Dataset #{feature_dataset_uri} not found." if training_features.nil?
- # sorted features for index lookups
-
- lazar.features = training_features.features.sort if prediction_feature.feature_type == "regression" and lazar.feature_calculation_algorithm != "Substructure.match"
-
training_features.data_entries.each do |compound,entry|
- lazar.fingerprints[compound] = {} unless lazar.fingerprints[compound]
- entry.keys.each do |feature|
-
- # CASE 1: Substructure
- if lazar.feature_calculation_algorithm == "Substructure.match"
- if training_features.features[feature]
- smarts = training_features.features[feature][OT.smarts]
- #lazar.fingerprints[compound] << smarts
- if params[:nr_hits]
- lazar.fingerprints[compound][smarts] = entry[feature].flatten.first
- else
- lazar.fingerprints[compound][smarts] = 1
- end
- unless lazar.features.include? smarts
- lazar.features << smarts
- lazar.p_values[smarts] = training_features.features[feature][OT.pValue]
- lazar.effects[smarts] = training_features.features[feature][OT.effect]
+ if training_activities.data_entries.has_key? compound
+
+ lazar.fingerprints[compound] = {} unless lazar.fingerprints[compound]
+ entry.keys.each do |feature|
+
+ # CASE 1: Substructure
+ if (lazar.feature_calculation_algorithm == "Substructure.match") || (lazar.feature_calculation_algorithm == "Substructure.match_hits")
+ if training_features.features[feature]
+ smarts = training_features.features[feature][OT.smarts]
+ #lazar.fingerprints[compound] << smarts
+ if lazar.feature_calculation_algorithm == "Substructure.match_hits"
+ lazar.fingerprints[compound][smarts] = entry[feature].flatten.first * training_features.features[feature][OT.pValue]
+ else
+ lazar.fingerprints[compound][smarts] = 1 * training_features.features[feature][OT.pValue]
+ end
+ unless lazar.features.include? smarts
+ lazar.features << smarts
+ lazar.p_values[smarts] = training_features.features[feature][OT.pValue]
+ lazar.effects[smarts] = training_features.features[feature][OT.effect]
+ end
end
- end
- # CASE 2: Others
- else
- case training_features.feature_type(@subjectid)
- when "classification"
- # fingerprints are sets
- if entry[feature].flatten.size == 1
- #lazar.fingerprints[compound] << feature if entry[feature].flatten.first.to_s.match(TRUE_REGEXP)
- lazar.fingerprints[compound][feature] = entry[feature].flatten.first if entry[feature].flatten.first.to_s.match(TRUE_REGEXP)
- lazar.features << feature unless lazar.features.include? feature
- else
- LOGGER.warn "More than one entry (#{entry[feature].inspect}) for compound #{compound}, feature #{feature}"
- end
- when "regression"
- # fingerprints are arrays
- if entry[feature].flatten.size == 1
- lazar.fingerprints[compound][lazar.features.index(feature)] = entry[feature].flatten.first
- #lazar.fingerprints[compound][feature] = entry[feature].flatten.first
- else
- LOGGER.warn "More than one entry (#{entry[feature].inspect}) for compound #{compound}, feature #{feature}"
- end
+ # CASE 2: Others
+ elsif entry[feature].flatten.size == 1
+ lazar.fingerprints[compound][feature] = entry[feature].flatten.first
+ lazar.features << feature unless lazar.features.include? feature
+ else
+ LOGGER.warn "More than one entry (#{entry[feature].inspect}) for compound #{compound}, feature #{feature}"
end
end
+
end
end
task.progress 80
@@ -188,28 +195,8 @@ post '/lazar/?' do
-
- #
- # AM: SETTINGS
- #
- #
- #
-
- # AM: allow settings override by user
- lazar.prediction_algorithm = "Neighbors.#{params[:prediction_algorithm]}" unless params[:prediction_algorithm].nil?
- lazar.prop_kernel = true if (params[:local_svm_kernel] == "propositionalized" || params[:prediction_algorithm] == "local_mlr_prop")
- lazar.conf_stdev = false
- lazar.conf_stdev = true if params[:conf_stdev] == "true"
-
-
-
-
-
- #
- # AM: Feed data
- #
- #
- #
+
+ # # # Activities
if prediction_feature.feature_type == "regression"
training_activities.data_entries.each do |compound,entry|
@@ -235,11 +222,7 @@ post '/lazar/?' do
- #
- # AM: Metadata
- #
- #
- #
+ # Metadata
lazar.metadata[DC.title] = "lazar model for #{URI.decode(File.basename(prediction_feature.uri))}"
lazar.metadata[OT.dependentVariables] = prediction_feature.uri
@@ -255,12 +238,19 @@ post '/lazar/?' do
lazar.metadata[OT.parameters] = [
{DC.title => "dataset_uri", OT.paramValue => dataset_uri},
{DC.title => "prediction_feature", OT.paramValue => prediction_feature.uri},
- {DC.title => "feature_generation_uri", OT.paramValue => feature_generation_uri}
+ {DC.title => "feature_generation_uri", OT.paramValue => feature_generation_uri},
+ {DC.title => "propositionalized", OT.paramValue => propositionalized},
+ {DC.title => "pc_type", OT.paramValue => pc_type},
+ {DC.title => "nr_hits", OT.paramValue => nr_hits},
+ {DC.title => "min_sim", OT.paramValue => min_sim},
+ {DC.title => "min_train_performance", OT.paramValue => min_train_performance},
+
]
model_uri = lazar.save(@subjectid)
LOGGER.info model_uri + " created #{Time.now}"
model_uri
+
end
response['Content-Type'] = 'text/uri-list'
raise OpenTox::ServiceUnavailableError.newtask.uri+"\n" if task.status == "Cancelled"
diff --git a/libfminer b/libfminer
-Subproject 17932e809c35c93374ed3d5fd19a313325c35b4
+Subproject f9e560dc0a7a5d5af439814ab5fa9ce027a025b