thr3ads.net - R help - [R] Stepwise SVM Variable selection [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Noah Silverman

2011-Jan-07 07:10 UTC

[R] Stepwise SVM Variable selection

I have a data set with about 30,000 training cases and 103 variable.

I've trained an SVM (using the e1071 package) for a binary classifier 
{0,1}.  The accuracy isn't great.

I used a grid search over the C and G parameters with an RBF kernel to 
find the best settings.

I remember that for least squares, R has a nice stepwise function that 
will try combining subsets of variables to find the optimal result.  
Clearly, this doesn't exist for SVMs as a built in function.

As an experiment, I simply grabbed the first 50 variables and repeated 
the training/grid search procedure.  The results were significantly 
better.  Since the date is VERY noisy, my guess is that eliminating some 
of the variables eliminated some noise that resulted in better results.

With a grid of 100 parameter settings (10 for C, 10 for G) and 106 
variables, trying every combination would be prohibitively time consuming.

Can anyone suggest an approach to seek the ideal subset of variables for 
my SVM classifier?

Thanks!

Steve Lianoglou

2011-Jan-07 07:34 UTC

head link

[R] Stepwise SVM Variable selection

Hi,

On Fri, Jan 7, 2011 at 2:10 AM, Noah Silverman <noah at
smartmediacorp.com> wrote:> I have a data set with about 30,000 training cases and 103 variable.
>
> I've trained an SVM (using the e1071 package) for a binary classifier
{0,1}.
> ?The accuracy isn't great.
>
> I used a grid search over the C and G parameters with an RBF kernel to find
> the best settings.
>
> I remember that for least squares, R has a nice stepwise function that will
> try combining subsets of variables to find the optimal result. ?Clearly,
> this doesn't exist for SVMs as a built in function.
>
> As an experiment, I simply grabbed the first 50 variables and repeated the
> training/grid search procedure. ?The results were significantly better.
> ?Since the date is VERY noisy, my guess is that eliminating some of the
> variables eliminated some noise that resulted in better results.
>
> With a grid of 100 parameter settings (10 for C, 10 for G) and 106
> variables, trying every combination would be prohibitively time consuming.
>
> Can anyone suggest an approach to seek the ideal subset of variables for my
> SVM classifier?
Sounds like a job for the types of approaches found in the penalizedSVM package:

cran.r-project.org/web/packages/penalizedSVM/index.html

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: cbio.mskcc.org/~lianos/contact

Georg Ruß

2011-Jan-07 10:28 UTC

head link

[R] Stepwise SVM Variable selection

On 06/01/11 23:10:59, Noah Silverman wrote:> I have a data set with about 30,000 training cases and 103 variable.
> I've trained an SVM (using the e1071 package) for a binary classifier
> {0,1}.  The accuracy isn't great.  I used a grid search over the C and
G
> parameters with an RBF kernel to find the best settings. [...]
>
> Can anyone suggest an approach to seek the ideal subset of variables for
> my SVM classifier?
The standard feature selection stuff (backward/forward etc.) is probably
ruled out by the time it takes to compute all the sets and subsets. What
you could try is the following:

First, do a cross-validation setup: split up your data set into a training
and testing set (ratio 0.9 / 0.1 or so).

Second, train your SVM on the training set (try conservative parameters
first).

Third, have your trained SVM classify the test set and compute the
classification error.

Fourth, iterate over all variables and do the following:
  a) choose one variable and permute its values (only) in the test set
  b) have your trained SVM (from step 2) classify this test set and 
  measure the classification error
  c) repeat a) and b) a (high) number of times to be significant 
  d) go to next variable

Fifth, you can get an impression of the importance that one variable has
by comparing the errors generated on the permuted test set for each
variable with the non-permuted test set classification error. If the
permutation of one variable drastically increases the classification
error, the variable is probably important.

Sixth: repeat the cross-validation / random sampling a number of times to
be significant.

This is more like an ad-hoc approach and there are some pitfalls, but the
idea is easily explained and can also be carried over to any other
regression model with cross-validation. The computational burden in SVM is
assumed to be the training and not the prediction step and you only need a
relatively low number of training runs (sixth step) here.

Regards,
Georg.
-- 
Research Assistant
Otto-von-Guericke-Universit?t Magdeburg
research at georgruss.de
research.georgruss.de

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Jan 2011 - Stepwise SVM Variable selection

[R] Stepwise SVM Variable selection

[R] Stepwise SVM Variable selection

[R] Stepwise SVM Variable selection

Seemingly Similar Threads