thr3ads.net - R help - [R] Nominal variables in SVM? [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Noah Silverman

2009-Aug-12 18:53 UTC

[R] Nominal variables in SVM?

Hi,

The answers to my previous question about nominal variables has lead me 
to a more important question.

What is the "best practice" way to feed nominal variable to an SVM.

For example:
color = ("red, "blue", "green")

I could translate that into an index so I wind up with
color= (1,2,3)

But my concern is that the SVM will now think that the values are 
numeric in "range" and not discrete conditions.

Another thought would be to create 3 binary variables from the single 
color variable, so I have:

red = (0,1)
blue = (0,1)
green = (0,1)

A example fed to the SVM would have one positive and two negative values 
to indicate the color value:
i.e. for a blue example:
red = 0, blue =1 , green = 0

Or, do any of the SVM packages intelligently handle this internally so 
that I don't have to mess with it.  If so, do I need to be concerned 
about different "translation" of the data if the test data set
isn't
exactly the same as the training set.
For example:
training data  =  color ("red, "blue", "green")
test data = color ("red, "green")

How would I be sure that the "red" and "green" examples get
encoded the
same so that the SVM is accurate?

Thanks in advance!!

-N

Steve Lianoglou

2009-Aug-12 20:55 UTC

head link

[R] Nominal variables in SVM?

Hi,

On Aug 12, 2009, at 2:53 PM, Noah Silverman wrote:
> Hi,
>
> The answers to my previous question about nominal variables has lead  
> me to a more important question.
>
> What is the "best practice" way to feed nominal variable to an
SVM.
>
> For example:
> color = ("red, "blue", "green")
>
> I could translate that into an index so I wind up with
> color= (1,2,3)
>
> But my concern is that the SVM will now think that the values are  
> numeric in "range" and not discrete conditions.
>
> Another thought would be to create 3 binary variables from the  
> single color variable, so I have:
>
> red = (0,1)
> blue = (0,1)
> green = (0,1)
>
> A example fed to the SVM would have one positive and two negative  
> values to indicate the color value:
> i.e. for a blue example:
> red = 0, blue =1 , green = 0
Do it this way.

So, imagine if the features for your examples were color and height,  
your "feature matrix" for N examples would be N x 4

0,1,0,15  # blue object, height 15
1,0,0,10  # red object, height 10
0,0,1,5 # green object, height 5
...

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Noah Silverman

2009-Aug-12 20:59 UTC

head link

[R] Nominal variables in SVM?

That makes sense.

I my data is already nominal, I need to "expand" a single column into 
several binary ones.

Is there an easy function to do this in R, or do I need to create 
something from scratch?  (If I have to create my own, any suggestions?)

Thanks!

-N

On 8/12/09 1:55 PM, Steve Lianoglou wrote:> Hi,
>
> On Aug 12, 2009, at 2:53 PM, Noah Silverman wrote:
>
>> Hi,
>>
>> The answers to my previous question about nominal variables has lead 
>> me to a more important question.
>>
>> What is the "best practice" way to feed nominal variable to
an SVM.
>>
>> For example:
>> color = ("red, "blue", "green")
>>
>> I could translate that into an index so I wind up with
>> color= (1,2,3)
>>
>> But my concern is that the SVM will now think that the values are 
>> numeric in "range" and not discrete conditions.
>>
>> Another thought would be to create 3 binary variables from the single 
>> color variable, so I have:
>>
>> red = (0,1)
>> blue = (0,1)
>> green = (0,1)
>>
>> A example fed to the SVM would have one positive and two negative 
>> values to indicate the color value:
>> i.e. for a blue example:
>> red = 0, blue =1 , green = 0
>
> Do it this way.
>
> So, imagine if the features for your examples were color and height, 
> your "feature matrix" for N examples would be N x 4
>
> 0,1,0,15  # blue object, height 15
> 1,0,0,10  # red object, height 10
> 0,0,1,5 # green object, height 5
> ...
>
> -steve
>
> -- 
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>   |  Memorial Sloan-Kettering Cancer Center
>   |  Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>

Bernd Bischl

2009-Aug-12 21:09 UTC

head link

[R] Nominal variables in SVM?

Noah Silverman wrote:> That makes sense.
>
> I my data is already nominal, I need to "expand" a single column
into
> several binary ones.
>
> Is there an easy function to do this in R, or do I need to create 
> something from scratch?  (If I have to create my own, any suggestions?)
>
> Thanks!
>
> -NHi Noah,

read up on the "contrasts" and the "model.matrix" functions.

Although if you use the kernlab package for SVMs, factors get treated in 
this way by default, you just need to use the formula interface.

Bernd

Erik Iverson

2009-Aug-12 21:17 UTC

head link

[R] Nominal variables in SVM?

Noah, depending on what function you use, it might do this automatically for you
if you give the function a formula containing a factor.  Otherwise, see
?model.matrix.

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Noah Silverman
Sent: Wednesday, August 12, 2009 3:59 PM
Cc: r help
Subject: Re: [R] Nominal variables in SVM?

That makes sense.

I my data is already nominal, I need to "expand" a single column into 
several binary ones.

Is there an easy function to do this in R, or do I need to create 
something from scratch?  (If I have to create my own, any suggestions?)

Thanks!

-N

On 8/12/09 1:55 PM, Steve Lianoglou wrote:> Hi,
>
> On Aug 12, 2009, at 2:53 PM, Noah Silverman wrote:
>
>> Hi,
>>
>> The answers to my previous question about nominal variables has lead 
>> me to a more important question.
>>
>> What is the "best practice" way to feed nominal variable to
an SVM.
>>
>> For example:
>> color = ("red, "blue", "green")
>>
>> I could translate that into an index so I wind up with
>> color= (1,2,3)
>>
>> But my concern is that the SVM will now think that the values are 
>> numeric in "range" and not discrete conditions.
>>
>> Another thought would be to create 3 binary variables from the single 
>> color variable, so I have:
>>
>> red = (0,1)
>> blue = (0,1)
>> green = (0,1)
>>
>> A example fed to the SVM would have one positive and two negative 
>> values to indicate the color value:
>> i.e. for a blue example:
>> red = 0, blue =1 , green = 0
>
> Do it this way.
>
> So, imagine if the features for your examples were color and height, 
> your "feature matrix" for N examples would be N x 4
>
> 0,1,0,15  # blue object, height 15
> 1,0,0,10  # red object, height 10
> 0,0,1,5 # green object, height 5
> ...
>
> -steve
>
> -- 
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>   |  Memorial Sloan-Kettering Cancer Center
>   |  Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Achim Zeileis

2009-Aug-12 21:21 UTC

head link

[R] Nominal variables in SVM?

On Wed, 12 Aug 2009, Noah Silverman wrote:
> Hi,
>
> The answers to my previous question about nominal variables has lead me 
> to a more important question.
>
> What is the "best practice" way to feed nominal variable to an
SVM.
As some of the previous posters have already indicated: The data structure 
for storing categorical (including nominal) variables in R is a
"factor".

Your comment about "truly nominal" is wrong. A character variable is a
character variable, not necessarily a categorical variable. Categorical 
means that the answer falls into one of a finite number of known 
categories, known as "levels" in R's "factor" class.

If you start out from character information:

   x <- c("red", "red", "blue",
"green", "blue")

You can turn it into a factor via:

   x <- factor(x, levels = c("red", "green",
"blue"))

R now knows how to do certain things with such a variable, e.g., produces 
useful summaries or knows how to deal with it in regression problems:

   model.matrix(~ x)

which seems to be what you asked for. Moreover, you don't need call this 
yourself but most regression functions in R will do that for you 
(including svm() in "e1071" or ksvm() in "kernlab", among
others).

In short: Keep your categorical variables as "factor" columns in a 
"data.frame" and use the formula interface of svm()/ksvm() and you are
fine.
Z

> For example:
> color = ("red, "blue", "green")
>
> I could translate that into an index so I wind up with
> color= (1,2,3)
>
> But my concern is that the SVM will now think that the values are numeric
in
> "range" and not discrete conditions.
>
> Another thought would be to create 3 binary variables from the single color
> variable, so I have:
>
> red = (0,1)
> blue = (0,1)
> green = (0,1)
>
> A example fed to the SVM would have one positive and two negative values to
> indicate the color value:
> i.e. for a blue example:
> red = 0, blue =1 , green = 0
>
> Or, do any of the SVM packages intelligently handle this internally so that
I
> don't have to mess with it.  If so, do I need to be concerned about
different
> "translation" of the data if the test data set isn't exactly
the same as the
> training set.
> For example:
> training data  =  color ("red, "blue", "green")
> test data = color ("red, "green")
>
> How would I be sure that the "red" and "green" examples
get encoded the same
> so that the SVM is accurate?
>
> Thanks in advance!!
>
> -N
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

Maybe Matching Threads

Search for more apparently analagous threads

R help - Aug 2009 - Nominal variables in SVM?

[R] Nominal variables in SVM?

[R] Nominal variables in SVM?

[R] Nominal variables in SVM?

[R] Nominal variables in SVM?

[R] Nominal variables in SVM?

[R] Nominal variables in SVM?

Maybe Matching Threads