summaryrefslogtreecommitdiff
path: root/_posts/2012-05-02-parameter-selection-with-bbrc-and-last-pm.md
blob: 758df8fe401a6dd1cbf56abc195f3c5cfa072b48 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
layout: post
title: "Parameter Selection with BBRC and LAST PM"
description: "This post gives some information on how parameters for BBRC and LAST could be selected, especially for the case of regression. Please see the [usage information](/algorithm/2012/05/02/bbrc-and-last-pm-usage) on how to apply the hints in your situation."
category: Usage
tags: [Algorithm, Fminer, Feature Generation, BBRC, LAST-PM, Tutorials]
---

{% include JB/setup %}

**This post gives some information on how parameters for BBRC and LAST could be selected, especially for the case of regression. Please see the [usage information](/algorithm/2012/05/02/bbrc-and-last-pm-usage) on how to apply the hints in your situation.**


# Some Background Information


Graph mining applications fminer/bbrc (BBRC) and fminer/last (LAST-PM) are **complete miners** in the sense that they do not restrict the result set of subgraphs _a-priori_ to a specific amount of patterns. Restricting the output set would contradict the principle of data-driven pattern generation, where no human intervention should be applied to the data mining process.
Instead of hard cutoffs for the set size, the user is expected to bound the mining process with sensible constraints (this is where he is "allowed" to bring expert knowledge in). In the worst case however, he has to apply a _trial-and-error_ strategy for finding such constraints.

**BBRC** has been designed for (binary) class-correlated subgraph mining. In this domain, where each compound (graph) is assigned a true/false value, it can handle very large datasets. The algorithm is optimized for this setting and here the parameters have sensible default values.

**LAST-PM** has been designed for the same (binary) class-correlated subgraph mining setting as BBRC. Since the generated patterns describe hidden motifs in the graph database, the process takes more time to complete. LAST-PM has also sensible default values, which might need to be adapted now and then, however.


# The Regression Case


When dealing with numerical values as target variable ([referred to as _prediction-feature_](/algorithm/2012/05/02/bbrc-and-last-pm-usage)), some pruning techniques (_dynamic upper bound pruning_), which reduce runtime drastically for classification, are not yet applicable -  BBRC and LAST-PM disable it automatically for you. Moreover, the result set might be larger or smaller compared to classification.

Regression has not been experimentally validated yet and support is therefore experimental. Meanwhile, here are some hints that alleviate possible problems in this setting. They refer to each other, but try them also individually!


## Hint #1: Revert to ordinary subgraph mining


Try disabling BBRC mining by setting


    
    
    backbone=false
    



This will revert to "ordinary" correlated and frequent subgraph mining by disabling the super-sparse selection of BBRC. You will receive a much larger set of subgraphs (all correlated and frequent subgraphs).


## Hint #2: Use paths instead of trees


If Hint #1 resulted in a too large set of descriptors, you might try setting


    
    
    feature_type=paths
    



(default 'trees'). That way, only linear fragments are mined. I have found paths to be also very expressive in regression models.


## Hint #3: Increase minimum frequency


The default is that every mined fragment must occur in at least 5 per-mil and 8 percent of the molecules in your database for BBRC and LAST-PM, respectively. You can adjust those thresholds: increasing minimum frequency poses higher demands on the features, so fewer will be found. Additionally, this speeds up the mining process. The respective parameter is called


    
    
    min_frequency=x