Hi, The answers to my previous question about nominal variables has lead me to a more important question. What is the "best practice" way to feed nominal variable to an SVM. For example: color = ("red, "blue", "green") I could translate that into an index so I wind up with color= (1,2,3) But my concern is that the SVM will now think that the values are numeric in "range" and not discrete conditions. Another thought would be to create 3 binary variables from the single color variable, so I have: red = (0,1) blue = (0,1) green = (0,1) A example fed to the SVM would have one positive and two negative values to indicate the color value: i.e. for a blue example: red = 0, blue =1 , green = 0 Or, do any of the SVM packages intelligently handle this internally so that I don't have to mess with it. If so, do I need to be concerned about different "translation" of the data if the test data set isn't exactly the same as the training set. For example: training data = color ("red, "blue", "green") test data = color ("red, "green") How would I be sure that the "red" and "green" examples get encoded the same so that the SVM is accurate? Thanks in advance!! -N
Hi, On Aug 12, 2009, at 2:53 PM, Noah Silverman wrote:> Hi, > > The answers to my previous question about nominal variables has lead > me to a more important question. > > What is the "best practice" way to feed nominal variable to an SVM. > > For example: > color = ("red, "blue", "green") > > I could translate that into an index so I wind up with > color= (1,2,3) > > But my concern is that the SVM will now think that the values are > numeric in "range" and not discrete conditions. > > Another thought would be to create 3 binary variables from the > single color variable, so I have: > > red = (0,1) > blue = (0,1) > green = (0,1) > > A example fed to the SVM would have one positive and two negative > values to indicate the color value: > i.e. for a blue example: > red = 0, blue =1 , green = 0Do it this way. So, imagine if the features for your examples were color and height, your "feature matrix" for N examples would be N x 4 0,1,0,15 # blue object, height 15 1,0,0,10 # red object, height 10 0,0,1,5 # green object, height 5 ... -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
That makes sense. I my data is already nominal, I need to "expand" a single column into several binary ones. Is there an easy function to do this in R, or do I need to create something from scratch? (If I have to create my own, any suggestions?) Thanks! -N On 8/12/09 1:55 PM, Steve Lianoglou wrote:> Hi, > > On Aug 12, 2009, at 2:53 PM, Noah Silverman wrote: > >> Hi, >> >> The answers to my previous question about nominal variables has lead >> me to a more important question. >> >> What is the "best practice" way to feed nominal variable to an SVM. >> >> For example: >> color = ("red, "blue", "green") >> >> I could translate that into an index so I wind up with >> color= (1,2,3) >> >> But my concern is that the SVM will now think that the values are >> numeric in "range" and not discrete conditions. >> >> Another thought would be to create 3 binary variables from the single >> color variable, so I have: >> >> red = (0,1) >> blue = (0,1) >> green = (0,1) >> >> A example fed to the SVM would have one positive and two negative >> values to indicate the color value: >> i.e. for a blue example: >> red = 0, blue =1 , green = 0 > > Do it this way. > > So, imagine if the features for your examples were color and height, > your "feature matrix" for N examples would be N x 4 > > 0,1,0,15 # blue object, height 15 > 1,0,0,10 # red object, height 10 > 0,0,1,5 # green object, height 5 > ... > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact >
Noah Silverman wrote:> That makes sense. > > I my data is already nominal, I need to "expand" a single column into > several binary ones. > > Is there an easy function to do this in R, or do I need to create > something from scratch? (If I have to create my own, any suggestions?) > > Thanks! > > -NHi Noah, read up on the "contrasts" and the "model.matrix" functions. Although if you use the kernlab package for SVMs, factors get treated in this way by default, you just need to use the formula interface. Bernd
Noah, depending on what function you use, it might do this automatically for you if you give the function a formula containing a factor. Otherwise, see ?model.matrix. -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Noah Silverman Sent: Wednesday, August 12, 2009 3:59 PM Cc: r help Subject: Re: [R] Nominal variables in SVM? That makes sense. I my data is already nominal, I need to "expand" a single column into several binary ones. Is there an easy function to do this in R, or do I need to create something from scratch? (If I have to create my own, any suggestions?) Thanks! -N On 8/12/09 1:55 PM, Steve Lianoglou wrote:> Hi, > > On Aug 12, 2009, at 2:53 PM, Noah Silverman wrote: > >> Hi, >> >> The answers to my previous question about nominal variables has lead >> me to a more important question. >> >> What is the "best practice" way to feed nominal variable to an SVM. >> >> For example: >> color = ("red, "blue", "green") >> >> I could translate that into an index so I wind up with >> color= (1,2,3) >> >> But my concern is that the SVM will now think that the values are >> numeric in "range" and not discrete conditions. >> >> Another thought would be to create 3 binary variables from the single >> color variable, so I have: >> >> red = (0,1) >> blue = (0,1) >> green = (0,1) >> >> A example fed to the SVM would have one positive and two negative >> values to indicate the color value: >> i.e. for a blue example: >> red = 0, blue =1 , green = 0 > > Do it this way. > > So, imagine if the features for your examples were color and height, > your "feature matrix" for N examples would be N x 4 > > 0,1,0,15 # blue object, height 15 > 1,0,0,10 # red object, height 10 > 0,0,1,5 # green object, height 5 > ... > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact >______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Wed, 12 Aug 2009, Noah Silverman wrote:> Hi, > > The answers to my previous question about nominal variables has lead me > to a more important question. > > What is the "best practice" way to feed nominal variable to an SVM.As some of the previous posters have already indicated: The data structure for storing categorical (including nominal) variables in R is a "factor". Your comment about "truly nominal" is wrong. A character variable is a character variable, not necessarily a categorical variable. Categorical means that the answer falls into one of a finite number of known categories, known as "levels" in R's "factor" class. If you start out from character information: x <- c("red", "red", "blue", "green", "blue") You can turn it into a factor via: x <- factor(x, levels = c("red", "green", "blue")) R now knows how to do certain things with such a variable, e.g., produces useful summaries or knows how to deal with it in regression problems: model.matrix(~ x) which seems to be what you asked for. Moreover, you don't need call this yourself but most regression functions in R will do that for you (including svm() in "e1071" or ksvm() in "kernlab", among others). In short: Keep your categorical variables as "factor" columns in a "data.frame" and use the formula interface of svm()/ksvm() and you are fine. Z> For example: > color = ("red, "blue", "green") > > I could translate that into an index so I wind up with > color= (1,2,3) > > But my concern is that the SVM will now think that the values are numeric in > "range" and not discrete conditions. > > Another thought would be to create 3 binary variables from the single color > variable, so I have: > > red = (0,1) > blue = (0,1) > green = (0,1) > > A example fed to the SVM would have one positive and two negative values to > indicate the color value: > i.e. for a blue example: > red = 0, blue =1 , green = 0 > > Or, do any of the SVM packages intelligently handle this internally so that I > don't have to mess with it. If so, do I need to be concerned about different > "translation" of the data if the test data set isn't exactly the same as the > training set. > For example: > training data = color ("red, "blue", "green") > test data = color ("red, "green") > > How would I be sure that the "red" and "green" examples get encoded the same > so that the SVM is accurate? > > Thanks in advance!! > > -N > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >