Thanks for all your help and I apologize for not being clear in the
beginning. I will try the "group lasso" packages. From the paper, it
seems like that is what I want to do. Thanks again!
On Tue, May 3, 2011 at 2:40 AM, Nick Sabbe <nick.sabbe at ugent.be>
wrote:> For performance reasons, I advise on using the following function instead
of
> model.matrix:
>
> factorsToDummyVariables<-function(dfr, betweenColAndLevel="")
> {
> ? ? ? ?nc<-dim(dfr)[2]
> ? ? ? ?firstRow<-dfr[1,]
> ? ? ? ?coln<-colnames(dfr)
> ? ? ? ?retval<-do.call(cbind, lapply(seq(nc), function(ci){
> ? ? ? ? ? ? ? ? ? ? ? ?if(is.factor(firstRow[,ci]))
> ? ? ? ? ? ? ? ? ? ? ? ?{
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?lvls<-levels(firstRow[,ci])[-1]
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?stretchedcols<-sapply(lvls,
function(lvl){
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rv<-dfr[,ci]==lvl
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
?mode(rv)<-"integer"
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return(rv)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?})
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if(!is.matrix(stretchedcols))
> stretchedcols<-matrix(stretchedcols, nrow=1)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?colnames(stretchedcols)<-paste(coln[ci],
> lvls, sep=betweenColAndLevel)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return(stretchedcols)
> ? ? ? ? ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ? ? ? ? ?else
> ? ? ? ? ? ? ? ? ? ? ? ?{
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?curcol<-matrix(dfr[,ci], ncol=1)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?colnames(curcol)<-coln[ci]
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return(curcol)
> ? ? ? ? ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?}))
> ? ? ? ?rownames(retval)<-rownames(dfr)
> ? ? ? ?return(retval)
> }
>
>
> Just for comparison: here is my old version of the same function, using
> model.matrix:
>
> factorsToDummyVariables.old<-function(dfrPredictors,
> form=paste("~",paste(colnames(dfrPredictors),
collapse="+"), sep=""))
> {
> ? ? ? ?#note: this function seems to operate quite slowly!
> ? ? ? ?#Because it is used often, it may be worth improving its speed
> ? ? ? ?dfrTmp<-model.frame(dfrPredictors, na.action=na.pass)
> ? ? ? ?frm<-as.formula(form)
> ? ? ? ?mm<-model.matrix(frm, data=dfrTmp)
> ? ? ? ?retval<-as.matrix(mm)[,-1]
>
> ? ? ? ?return(retval)
> }
>
> In a testcase with a reasonably big dataset, I compared the speeds:
>
> #system.time(tmp.fd.convds.full.man<-manualFactorsToDummyVariables(ds))
> ## ? user ?system elapsed
> ## ? 9.44 ? ?0.00 ? ?9.48
> #system.time(tmp.fd.convds.full<-factorsToDummyVariables.old(ds))
> ## ? user ?system elapsed
> ## ?15.49 ? ?0.00 ? 15.64
> #system.time(invisible(factorsToDummyVariables (ds[10,])))
> ## ? user ?system elapsed
> ## ? 0.36 ? ?0.00 ? ?0.36
> #system.time(invisible(factorsToDummyVariables.old (ds[10,])))
> ## ? user ?system elapsed
> ## ? 2.18 ? ?0.00 ? ?2.20
> #system.time(invisible(factorsToDummyVariables (ds[20:30,])))
> ## ? user ?system elapsed
> ## ? 0.34 ? ?0.00 ? ?0.38
> #system.time(invisible(factorsToDummyVariables.old (ds[20:30,])))
> ## ? user ?system elapsed
> ## ? 2.11 ? ?0.00 ? ?2.15
>
> If you have to do this quite often, the difference surely adds up...
> More improvements may be possible.
> This function only works if you don't include interactions, though.
>
>
> Nick Sabbe
> --
> ping: nick.sabbe at ugent.be
> link: http://biomath.ugent.be
> wink: A1.056, Coupure Links 653, 9000 Gent
> ring: 09/264.59.36
>
> -- Do Not Disapprove
>
>
>
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org] On
> Behalf Of David Winsemius
> Sent: maandag 2 mei 2011 20:48
> To: Steve Lianoglou
> Cc: r-help at r-project.org
> Subject: Re: [R] Lasso with Categorical Variables
>
>
> On May 2, 2011, at 10:51 AM, Steve Lianoglou wrote:
>
>> Hi,
>>
>> On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckalexa2 at
ncsu.edu
>> > wrote:
>>> Hi! This is my first time posting. I've read the general rules
and
>>> guidelines, but please bear with me if I make some fatal error in
>>> posting. Anyway, I have a continuous response and 29 predictors
made
>>> up of continuous variables and nominal and ordinal categorical
>>> variables. I'd like to do lasso on these, but I get an error.
The way
>>> I am using "lars" doesn't allow for the factors. Is
there a special
>>> option or some other method in order to do lasso with cat.
variables?
>>>
>>> Here is and example (considering ordinal variables as just
nominal):
>>>
>>> set.seed(1)
>>> Y <- rnorm(10,0,1)
>>> X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE))
>>> X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE))
>>> X3 <- sample(x=30:55, size=10, replace=TRUE) ?# think age
>>> X4 <- rchisq(10, df=4, ncp=0)
>>> X <- data.frame(X1,X2,X3,X4)
>>>
>>>> str(X)
>>> 'data.frame': ? 10 obs. of ?4 variables:
>>> ?$ X1: Factor w/ 4 levels
"A","B","C","D": 4 1 3 1 2 2 1 2 4 2
>>> ?$ X2: Factor w/ 5 levels
"E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3
>>> ?$ X3: int ?51 46 50 44 43 50 30 42 49 48
>>> ?$ X4: num ?2.86 1.55 1.94 2.45 2.75 ...
>>>
>>>
>>> I'd like to do:
>>> obj <- lars(x=X, y=Y, type = "lasso")
>>>
>>> Instead, what I have been doing is converting all data to
continuous
>>> but I think this is really bad!
>>
>> Yeah, it is.
>>
>> Check out the "Categorical Predictor Variables" section here
for a way
>> to handle such predictor vars:
>> http://www.psychstat.missouristate.edu/multibook/mlt08m.html
>
> Steve's citation is somewhat helpful, but not sufficient to take the
> next steps. You can find details regarding the mechanics of typical
> linear regression in R on the ?lm page where you find that the factor
> variables are typically handled by model.matrix. See below:
>
> ?> model.matrix(~X1 + X2 + X3 + X4, X)
> ? ?(Intercept) X1B X1C X1D X2F X2G X2H X2I X3 ? ? ? ?X4
> 1 ? ? ? ? ? ?1 ? 0 ? 0 ? 1 ? 0 ? 1 ? 0 ? 0 51 2.8640884
> 2 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 0 ? 1 ? 0 46 1.5462243
> 3 ? ? ? ? ? ?1 ? 0 ? 1 ? 0 ? 0 ? 1 ? 0 ? 0 50 1.9430901
> 4 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 1 ? 0 ? 0 ? 0 44 2.4504180
> 5 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? 0 ? 1 43 2.7535052
> 6 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? 0 ? 1 50 1.6200326
> 7 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 1 30 0.5750533
> 8 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 42 5.9224777
> 9 ? ? ? ? ? ?1 ? 0 ? 0 ? 1 ? 0 ? 0 ? 0 ? 1 49 2.0401528
> 10 ? ? ? ? ? 1 ? 1 ? 0 ? 0 ? 0 ? 1 ? 0 ? 0 48 6.2995288
> attr(,"assign")
> ?[1] 0 1 1 1 2 2 2 2 3 4
> attr(,"contrasts")
> attr(,"contrasts")$X1
> [1] "contr.treatment"
>
> attr(,"contrasts")$X2
> [1] "contr.treatment"
>
> The numeric variables are passed through, while the dummy variables
> for factor columns are constructed (as treatment contrasts) and the
> whole thing it returned in a neat package.
>
> --
> David.
>>
>> HTH,
>> -steve
>>
> --
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>