> From: Weiwei Shi > > it works. > thanks, > > but: (just curious) > why i tried previously and i got > > > is.vector(sample.size) > [1] TRUEBecause a list is also a vector:> a <- c(list(1), list(2)) > a[[1]] [1] 1 [[2]] [1] 2> is.vector(a)[1] TRUE> is.numeric(a)[1] FALSE Actually, the way I initialize a list of known length is by something like: myList <- vector(mode="list", length=veryLong) Andy> i also tried as.vector(sample.size) and assigned it to sampsz,it still > does not work. > > On 7/7/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote: > > On 7/7/2005 3:38 PM, Weiwei Shi wrote: > > > Hi there: > > > I have a question on random foresst: > > > > > > recently i helped a friend with her random forest and i > came with this problem: > > > her dataset has 6 classes and since the sample size is > pretty small: > > > 264 and the class distr is like this (Diag is the > response variable) > > > sample.size <- lapply(1:6, function(i) sum(Diag==i)) > > >> sample.size > > > [[1]] > > > [1] 36 > > > > > > [[2]] > > > [1] 12 > > > > > > [[3]] > > > [1] 120 > > > > > > [[4]] > > > [1] 36 > > > > > > [[5]] > > > [1] 30 > > > > > > [[6]] > > > [1] 30 > > > > > > I assigned this sample.size to sampsz for a stratiefied sampling > > > purpose and i got the following error: > > > Error in sum(..., na.rm = na.rm) : invalid 'mode' of argument > > > > > > if I use sampsz=c(36, 12, 120, 36, 30, 30), then it is > fine. Could you > > > tell me why? > > > > The sum() function knows what to do on a vector, but not on > a list. You > > can turn your sample.size variable into a vector using > > > > unlist(sample.size) > > > > Duncan Murdoch > > > > > btw, as to classification problem for this with uneven > class number > > > situation, do u have some suggestions to improve its accuracy? I > > > tried to use c() way to make the sampsz works but the result is > > > similar. > > > > > > Thanks, > > > > > > weiwei > > > > > > On 6/30/05, Liaw, Andy <andy_liaw at merck.com> wrote: > > >> The limitation comes from the way categorical splits are > represented in the > > >> code: For a categorical variable with k categories, the split is > > >> represented by k binary digits: 0=right, 1=left. So it > takes k bits to > > >> store each split on k categories. To save storage, this > is `packed' into a > > >> 4-byte integer (32-bit), thus the limit of 32 categories. > > >> > > >> The current Fortran code (version 5.x) by Breiman and > Cutler gets around > > >> this limitation by storing the split in an integer > array. While this lifts > > >> the 32-category limit, it takes much more memory to > store the splits. I'm > > >> still trying to figure out a more memory efficient way > of storing the splits > > >> without imposing the 32-category limit. If anyone has > suggestions, I'm all > > >> ears. > > >> > > >> Best, > > >> Andy > > >> > > >> > From: Arne.Muller at sanofi-aventis.com > > >> > > > >> > Hello, > > >> > > > >> > I'm using the random forest package. One of my factors in the > > >> > data set contains 41 levels (I can't code this as a numeric > > >> > value - in terms of linear models this would be a random > > >> > factor). The randomForest call comes back with an error > > >> > telling me that the limit is 32 categories. > > >> > > > >> > Is there any reason for this particular limit? Maybe it's > > >> > possible to recompile the module with a different cutoff? > > >> > > > >> > thanks a lot for your help, > > >> > kind regards, > > >> > > > >> > > > >> > Arne > > >> > > > >> > ______________________________________________ > > >> > R-help at stat.math.ethz.ch mailing list > > >> > https://stat.ethz.ch/mailman/listinfo/r-help > > >> > PLEASE do read the posting guide! > > >> > http://www.R-project.org/posting-guide.html > > >> > > > >> > > > >> > > > >> > > >> ______________________________________________ > > >> R-help at stat.math.ethz.ch mailing list > > >> https://stat.ethz.ch/mailman/listinfo/r-help > > >> PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > >> > > > > > > > > > > > > > > -- > Weiwei Shi, Ph.D > > "Did you always know?" > "No, I did not. But I believed..." > ---Matrix III > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > >
thanks. but can you suggest some ways for the classification problems since for some specific class, there are too few observations. the following is from adding sample.size :> najie.rf.2 <- randomForest(Diag~., data=one.df[ind==1,4:ncol(one.df)], importance=T, sampsize=unlist(sample.size)) > najie.pred.2<- predict(najie.rf.2, one.df[ind==2,])> table(observed=one.df[ind==2,"Diag"], predicted=najie.pred.2)predicted observed 1 2 3 4 5 6 1 6 0 1 0 0 1 2 0 4 0 0 0 0 3 1 0 37 0 0 0 4 0 0 3 5 0 0 5 1 0 3 0 8 0 6 0 0 0 3 0 5 and class number returned from sample.size is like: 28, 8, 82, 28, 18, 22 Should I use gbm to try it since it might "focus" more on misplaced cases? thanks, weiwei On 7/7/05, Liaw, Andy <andy_liaw at merck.com> wrote:> > From: Weiwei Shi > > > > it works. > > thanks, > > > > but: (just curious) > > why i tried previously and i got > > > > > is.vector(sample.size) > > [1] TRUE > > Because a list is also a vector: > > > a <- c(list(1), list(2)) > > a > [[1]] > [1] 1 > > [[2]] > [1] 2 > > > is.vector(a) > [1] TRUE > > is.numeric(a) > [1] FALSE > > Actually, the way I initialize a list of known length is by something like: > > myList <- vector(mode="list", length=veryLong) > > Andy > > > > i also tried as.vector(sample.size) and assigned it to sampsz,it still > > does not work. > > > > On 7/7/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote: > > > On 7/7/2005 3:38 PM, Weiwei Shi wrote: > > > > Hi there: > > > > I have a question on random foresst: > > > > > > > > recently i helped a friend with her random forest and i > > came with this problem: > > > > her dataset has 6 classes and since the sample size is > > pretty small: > > > > 264 and the class distr is like this (Diag is the > > response variable) > > > > sample.size <- lapply(1:6, function(i) sum(Diag==i)) > > > >> sample.size > > > > [[1]] > > > > [1] 36 > > > > > > > > [[2]] > > > > [1] 12 > > > > > > > > [[3]] > > > > [1] 120 > > > > > > > > [[4]] > > > > [1] 36 > > > > > > > > [[5]] > > > > [1] 30 > > > > > > > > [[6]] > > > > [1] 30 > > > > > > > > I assigned this sample.size to sampsz for a stratiefied sampling > > > > purpose and i got the following error: > > > > Error in sum(..., na.rm = na.rm) : invalid 'mode' of argument > > > > > > > > if I use sampsz=c(36, 12, 120, 36, 30, 30), then it is > > fine. Could you > > > > tell me why? > > > > > > The sum() function knows what to do on a vector, but not on > > a list. You > > > can turn your sample.size variable into a vector using > > > > > > unlist(sample.size) > > > > > > Duncan Murdoch > > > > > > > btw, as to classification problem for this with uneven > > class number > > > > situation, do u have some suggestions to improve its accuracy? I > > > > tried to use c() way to make the sampsz works but the result is > > > > similar. > > > > > > > > Thanks, > > > > > > > > weiwei > > > > > > > > On 6/30/05, Liaw, Andy <andy_liaw at merck.com> wrote: > > > >> The limitation comes from the way categorical splits are > > represented in the > > > >> code: For a categorical variable with k categories, the split is > > > >> represented by k binary digits: 0=right, 1=left. So it > > takes k bits to > > > >> store each split on k categories. To save storage, this > > is `packed' into a > > > >> 4-byte integer (32-bit), thus the limit of 32 categories. > > > >> > > > >> The current Fortran code (version 5.x) by Breiman and > > Cutler gets around > > > >> this limitation by storing the split in an integer > > array. While this lifts > > > >> the 32-category limit, it takes much more memory to > > store the splits. I'm > > > >> still trying to figure out a more memory efficient way > > of storing the splits > > > >> without imposing the 32-category limit. If anyone has > > suggestions, I'm all > > > >> ears. > > > >> > > > >> Best, > > > >> Andy > > > >> > > > >> > From: Arne.Muller at sanofi-aventis.com > > > >> > > > > >> > Hello, > > > >> > > > > >> > I'm using the random forest package. One of my factors in the > > > >> > data set contains 41 levels (I can't code this as a numeric > > > >> > value - in terms of linear models this would be a random > > > >> > factor). The randomForest call comes back with an error > > > >> > telling me that the limit is 32 categories. > > > >> > > > > >> > Is there any reason for this particular limit? Maybe it's > > > >> > possible to recompile the module with a different cutoff? > > > >> > > > > >> > thanks a lot for your help, > > > >> > kind regards, > > > >> > > > > >> > > > > >> > Arne > > > >> > > > > >> > ______________________________________________ > > > >> > R-help at stat.math.ethz.ch mailing list > > > >> > https://stat.ethz.ch/mailman/listinfo/r-help > > > >> > PLEASE do read the posting guide! > > > >> > http://www.R-project.org/posting-guide.html > > > >> > > > > >> > > > > >> > > > > >> > > > >> ______________________________________________ > > > >> R-help at stat.math.ethz.ch mailing list > > > >> https://stat.ethz.ch/mailman/listinfo/r-help > > > >> PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > >> > > > > > > > > > > > > > > > > > > > > > > -- > > Weiwei Shi, Ph.D > > > > "Did you always know?" > > "No, I did not. But I believed..." > > ---Matrix III > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > > > > > ------------------------------------------------------------------------------ > Notice: This e-mail message, together with any attachment...{{dropped}}
With small sample sizes the variability for estimate of test set error will be large. Instead of splitting the data once, you should consider cross-validation or bootstrap for estimating performance. AFAIK gbm as is won't handle more than two classes. You will need to do quite a bit of work to get it to do what MART does. Andy> From: Weiwei Shi > > thanks. but can you suggest some ways for the classification problems > since for some specific class, there are too few observations. > > the following is from adding sample.size : > > najie.rf.2 <- randomForest(Diag~., > data=one.df[ind==1,4:ncol(one.df)], importance=T, > sampsize=unlist(sample.size)) > > najie.pred.2<- predict(najie.rf.2, one.df[ind==2,]) > > > table(observed=one.df[ind==2,"Diag"], predicted=najie.pred.2) > predicted > observed 1 2 3 4 5 6 > 1 6 0 1 0 0 1 > 2 0 4 0 0 0 0 > 3 1 0 37 0 0 0 > 4 0 0 3 5 0 0 > 5 1 0 3 0 8 0 > 6 0 0 0 3 0 5 > > and class number returned from sample.size is like: > 28, 8, 82, 28, 18, 22 > > Should I use gbm to try it since it might "focus" more on > misplaced cases? > > thanks, > > weiwei > > > On 7/7/05, Liaw, Andy <andy_liaw at merck.com> wrote: > > > From: Weiwei Shi > > > > > > it works. > > > thanks, > > > > > > but: (just curious) > > > why i tried previously and i got > > > > > > > is.vector(sample.size) > > > [1] TRUE > > > > Because a list is also a vector: > > > > > a <- c(list(1), list(2)) > > > a > > [[1]] > > [1] 1 > > > > [[2]] > > [1] 2 > > > > > is.vector(a) > > [1] TRUE > > > is.numeric(a) > > [1] FALSE > > > > Actually, the way I initialize a list of known length is by > something like: > > > > myList <- vector(mode="list", length=veryLong) > > > > Andy > > > > > > > i also tried as.vector(sample.size) and assigned it to > sampsz,it still > > > does not work. > > > > > > On 7/7/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote: > > > > On 7/7/2005 3:38 PM, Weiwei Shi wrote: > > > > > Hi there: > > > > > I have a question on random foresst: > > > > > > > > > > recently i helped a friend with her random forest and i > > > came with this problem: > > > > > her dataset has 6 classes and since the sample size is > > > pretty small: > > > > > 264 and the class distr is like this (Diag is the > > > response variable) > > > > > sample.size <- lapply(1:6, function(i) sum(Diag==i)) > > > > >> sample.size > > > > > [[1]] > > > > > [1] 36 > > > > > > > > > > [[2]] > > > > > [1] 12 > > > > > > > > > > [[3]] > > > > > [1] 120 > > > > > > > > > > [[4]] > > > > > [1] 36 > > > > > > > > > > [[5]] > > > > > [1] 30 > > > > > > > > > > [[6]] > > > > > [1] 30 > > > > > > > > > > I assigned this sample.size to sampsz for a > stratiefied sampling > > > > > purpose and i got the following error: > > > > > Error in sum(..., na.rm = na.rm) : invalid 'mode' of argument > > > > > > > > > > if I use sampsz=c(36, 12, 120, 36, 30, 30), then it is > > > fine. Could you > > > > > tell me why? > > > > > > > > The sum() function knows what to do on a vector, but not on > > > a list. You > > > > can turn your sample.size variable into a vector using > > > > > > > > unlist(sample.size) > > > > > > > > Duncan Murdoch > > > > > > > > > btw, as to classification problem for this with uneven > > > class number > > > > > situation, do u have some suggestions to improve its > accuracy? I > > > > > tried to use c() way to make the sampsz works but the > result is > > > > > similar. > > > > > > > > > > Thanks, > > > > > > > > > > weiwei > > > > > > > > > > On 6/30/05, Liaw, Andy <andy_liaw at merck.com> wrote: > > > > >> The limitation comes from the way categorical splits are > > > represented in the > > > > >> code: For a categorical variable with k categories, > the split is > > > > >> represented by k binary digits: 0=right, 1=left. So it > > > takes k bits to > > > > >> store each split on k categories. To save storage, this > > > is `packed' into a > > > > >> 4-byte integer (32-bit), thus the limit of 32 categories. > > > > >> > > > > >> The current Fortran code (version 5.x) by Breiman and > > > Cutler gets around > > > > >> this limitation by storing the split in an integer > > > array. While this lifts > > > > >> the 32-category limit, it takes much more memory to > > > store the splits. I'm > > > > >> still trying to figure out a more memory efficient way > > > of storing the splits > > > > >> without imposing the 32-category limit. If anyone has > > > suggestions, I'm all > > > > >> ears. > > > > >> > > > > >> Best, > > > > >> Andy > > > > >> > > > > >> > From: Arne.Muller at sanofi-aventis.com > > > > >> > > > > > >> > Hello, > > > > >> > > > > > >> > I'm using the random forest package. One of my > factors in the > > > > >> > data set contains 41 levels (I can't code this as a numeric > > > > >> > value - in terms of linear models this would be a random > > > > >> > factor). The randomForest call comes back with an error > > > > >> > telling me that the limit is 32 categories. > > > > >> > > > > > >> > Is there any reason for this particular limit? Maybe it's > > > > >> > possible to recompile the module with a different cutoff? > > > > >> > > > > > >> > thanks a lot for your help, > > > > >> > kind regards, > > > > >> > > > > > >> > > > > > >> > Arne > > > > >> > > > > > >> > ______________________________________________ > > > > >> > R-help at stat.math.ethz.ch mailing list > > > > >> > https://stat.ethz.ch/mailman/listinfo/r-help > > > > >> > PLEASE do read the posting guide! > > > > >> > http://www.R-project.org/posting-guide.html > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > >> ______________________________________________ > > > > >> R-help at stat.math.ethz.ch mailing list > > > > >> https://stat.ethz.ch/mailman/listinfo/r-help > > > > >> PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Weiwei Shi, Ph.D > > > > > > "Did you always know?" > > > "No, I did not. But I believed..." > > > ---Matrix III > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > ---------------- > > Notice: This e-mail message, together with any > attachments, contains information of Merck & Co., Inc. (One > Merck Drive, Whitehouse Station, New Jersey, USA 08889), > and/or its affiliates (which may be known outside the United > States as Merck Frosst, Merck Sharp & Dohme or MSD and in > Japan, as Banyu) that may be confidential, proprietary > copyrighted and/or legally privileged. It is intended solely > for the use of the individual or entity named on this > message. If you are not the intended recipient, and have > received this message in error, please notify us immediately > by reply e-mail and then delete it from your system. > > > -------------------------------------------------------------- > ---------------- > > > > > -- > Weiwei Shi, Ph.D > > "Did you always know?" > "No, I did not. But I believed..." > ---Matrix III > > >