thr3ads.net - R help - [R] randomForest [Jul 2005]

If this information is useful, please help other people find it:
Share via:

Liaw, Andy

2005-Jul-07 20:10 UTC

[R] randomForest

> From: Weiwei Shi
> 
> it works.
> thanks,
> 
> but: (just curious)
> why i tried previously and i got
> 
> > is.vector(sample.size)
> [1] TRUE
Because a list is also a vector:
> a <- c(list(1), list(2))
> a[[1]]
[1] 1

[[2]]
[1] 2
> is.vector(a)
[1] TRUE> is.numeric(a)[1] FALSE

Actually, the way I initialize a list of known length is by something like:

myList <- vector(mode="list", length=veryLong)

Andy
 
 > i also tried as.vector(sample.size) and assigned it to sampsz,it still
> does not work.
> 
> On 7/7/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
> > On 7/7/2005 3:38 PM, Weiwei Shi wrote:
> > > Hi there:
> > > I have a question on random foresst:
> > >
> > > recently i helped a friend with her random forest and i 
> came with this problem:
> > > her dataset has 6 classes and since the sample size is 
> pretty small:
> > > 264 and the class distr is like this (Diag is the 
> response variable)
> > > sample.size <- lapply(1:6, function(i) sum(Diag==i))
> > >> sample.size
> > > [[1]]
> > > [1] 36
> > >
> > > [[2]]
> > > [1] 12
> > >
> > > [[3]]
> > > [1] 120
> > >
> > > [[4]]
> > > [1] 36
> > >
> > > [[5]]
> > > [1] 30
> > >
> > > [[6]]
> > > [1] 30
> > >
> > > I assigned this sample.size to sampsz for a stratiefied sampling
> > > purpose and i got the following error:
> > > Error in sum(..., na.rm = na.rm) : invalid 'mode' of
argument
> > >
> > > if I use sampsz=c(36, 12, 120, 36, 30, 30), then it is 
> fine. Could you
> > > tell me why?
> > 
> > The sum() function knows what to do on a vector, but not on 
> a list.  You
> > can turn your sample.size variable into a vector using
> > 
> > unlist(sample.size)
> > 
> > Duncan Murdoch
> > 
> > > btw, as to classification problem for this with uneven 
> class number
> > > situation, do u have some suggestions to improve its accuracy?  I
> > > tried to use c() way to make the sampsz works but the result is
> > > similar.
> > >
> > > Thanks,
> > >
> > > weiwei
> > >
> > > On 6/30/05, Liaw, Andy <andy_liaw at merck.com> wrote:
> > >> The limitation comes from the way categorical splits are 
> represented in the
> > >> code:  For a categorical variable with k categories, the
split is
> > >> represented by k binary digits: 0=right, 1=left.  So it 
> takes k bits to
> > >> store each split on k categories.  To save storage, this 
> is `packed' into a
> > >> 4-byte integer (32-bit), thus the limit of 32 categories.
> > >>
> > >> The current Fortran code (version 5.x) by Breiman and 
> Cutler gets around
> > >> this limitation by storing the split in an integer 
> array.  While this lifts
> > >> the 32-category limit, it takes much more memory to 
> store the splits.  I'm
> > >> still trying to figure out a more memory efficient way 
> of storing the splits
> > >> without imposing the 32-category limit.  If anyone has 
> suggestions, I'm all
> > >> ears.
> > >>
> > >> Best,
> > >> Andy
> > >>
> > >> > From: Arne.Muller at sanofi-aventis.com
> > >> >
> > >> > Hello,
> > >> >
> > >> > I'm using the random forest package. One of my
factors in the
> > >> > data set contains 41 levels (I can't code this as a
numeric
> > >> > value - in terms of linear models this would be a random
> > >> > factor). The randomForest call comes back with an error
> > >> > telling me that the limit is 32 categories.
> > >> >
> > >> > Is there any reason for this particular limit? Maybe
it's
> > >> > possible to recompile the module with a different
cutoff?
> > >> >
> > >> >       thanks a  lot for your help,
> > >> >       kind regards,
> > >> >
> > >> >
> > >> >       Arne
> > >> >
> > >> > ______________________________________________
> > >> > R-help at stat.math.ethz.ch mailing list
> > >> > https://stat.ethz.ch/mailman/listinfo/r-help
> > >> > PLEASE do read the posting guide!
> > >> > http://www.R-project.org/posting-guide.html
> > >> >
> > >> >
> > >> >
> > >>
> > >> ______________________________________________
> > >> R-help at stat.math.ethz.ch mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > >> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> > >>
> > >
> > >
> > 
> > 
> 
> 
> 
> -- 
> Weiwei Shi, Ph.D
> 
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
> 
>

Weiwei Shi

2005-Jul-07 20:23 UTC

head link

[R] randomForest

thanks. but can you suggest some ways for the classification problems
since for some specific class, there are too few observations.

the following is from adding sample.size :> najie.rf.2 <- randomForest(Diag~., data=one.df[ind==1,4:ncol(one.df)],
importance=T, sampsize=unlist(sample.size))
> najie.pred.2<- predict(najie.rf.2, one.df[ind==2,])
> table(observed=one.df[ind==2,"Diag"], predicted=najie.pred.2)        predicted
observed  1  2  3  4  5  6
       1  6  0  1  0  0  1
       2  0  4  0  0  0  0
       3  1  0 37  0  0  0
       4  0  0  3  5  0  0
       5  1  0  3  0  8  0
       6  0  0  0  3  0  5

and class number returned from sample.size is like:
28, 8, 82, 28, 18, 22

Should I use gbm to try it since it might "focus" more on misplaced
cases?

thanks,

weiwei


On 7/7/05, Liaw, Andy <andy_liaw at merck.com>
wrote:> > From: Weiwei Shi
> >
> > it works.
> > thanks,
> >
> > but: (just curious)
> > why i tried previously and i got
> >
> > > is.vector(sample.size)
> > [1] TRUE
> 
> Because a list is also a vector:
> 
> > a <- c(list(1), list(2))
> > a
> [[1]]
> [1] 1
> 
> [[2]]
> [1] 2
> 
> > is.vector(a)
> [1] TRUE
> > is.numeric(a)
> [1] FALSE
> 
> Actually, the way I initialize a list of known length is by something like:
> 
> myList <- vector(mode="list", length=veryLong)
> 
> Andy
> 
> 
> > i also tried as.vector(sample.size) and assigned it to sampsz,it still
> > does not work.
> >
> > On 7/7/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
> > > On 7/7/2005 3:38 PM, Weiwei Shi wrote:
> > > > Hi there:
> > > > I have a question on random foresst:
> > > >
> > > > recently i helped a friend with her random forest and i
> > came with this problem:
> > > > her dataset has 6 classes and since the sample size is
> > pretty small:
> > > > 264 and the class distr is like this (Diag is the
> > response variable)
> > > > sample.size <- lapply(1:6, function(i) sum(Diag==i))
> > > >> sample.size
> > > > [[1]]
> > > > [1] 36
> > > >
> > > > [[2]]
> > > > [1] 12
> > > >
> > > > [[3]]
> > > > [1] 120
> > > >
> > > > [[4]]
> > > > [1] 36
> > > >
> > > > [[5]]
> > > > [1] 30
> > > >
> > > > [[6]]
> > > > [1] 30
> > > >
> > > > I assigned this sample.size to sampsz for a stratiefied
sampling
> > > > purpose and i got the following error:
> > > > Error in sum(..., na.rm = na.rm) : invalid 'mode' of
argument
> > > >
> > > > if I use sampsz=c(36, 12, 120, 36, 30, 30), then it is
> > fine. Could you
> > > > tell me why?
> > >
> > > The sum() function knows what to do on a vector, but not on
> > a list.  You
> > > can turn your sample.size variable into a vector using
> > >
> > > unlist(sample.size)
> > >
> > > Duncan Murdoch
> > >
> > > > btw, as to classification problem for this with uneven
> > class number
> > > > situation, do u have some suggestions to improve its
accuracy?  I
> > > > tried to use c() way to make the sampsz works but the result
is
> > > > similar.
> > > >
> > > > Thanks,
> > > >
> > > > weiwei
> > > >
> > > > On 6/30/05, Liaw, Andy <andy_liaw at merck.com> wrote:
> > > >> The limitation comes from the way categorical splits are
> > represented in the
> > > >> code:  For a categorical variable with k categories, the
split is
> > > >> represented by k binary digits: 0=right, 1=left.  So it
> > takes k bits to
> > > >> store each split on k categories.  To save storage, this
> > is `packed' into a
> > > >> 4-byte integer (32-bit), thus the limit of 32
categories.
> > > >>
> > > >> The current Fortran code (version 5.x) by Breiman and
> > Cutler gets around
> > > >> this limitation by storing the split in an integer
> > array.  While this lifts
> > > >> the 32-category limit, it takes much more memory to
> > store the splits.  I'm
> > > >> still trying to figure out a more memory efficient way
> > of storing the splits
> > > >> without imposing the 32-category limit.  If anyone has
> > suggestions, I'm all
> > > >> ears.
> > > >>
> > > >> Best,
> > > >> Andy
> > > >>
> > > >> > From: Arne.Muller at sanofi-aventis.com
> > > >> >
> > > >> > Hello,
> > > >> >
> > > >> > I'm using the random forest package. One of my
factors in the
> > > >> > data set contains 41 levels (I can't code this
as a numeric
> > > >> > value - in terms of linear models this would be a
random
> > > >> > factor). The randomForest call comes back with an
error
> > > >> > telling me that the limit is 32 categories.
> > > >> >
> > > >> > Is there any reason for this particular limit?
Maybe it's
> > > >> > possible to recompile the module with a different
cutoff?
> > > >> >
> > > >> >       thanks a  lot for your help,
> > > >> >       kind regards,
> > > >> >
> > > >> >
> > > >> >       Arne
> > > >> >
> > > >> > ______________________________________________
> > > >> > R-help at stat.math.ethz.ch mailing list
> > > >> > https://stat.ethz.ch/mailman/listinfo/r-help
> > > >> > PLEASE do read the posting guide!
> > > >> > http://www.R-project.org/posting-guide.html
> > > >> >
> > > >> >
> > > >> >
> > > >>
> > > >> ______________________________________________
> > > >> R-help at stat.math.ethz.ch mailing list
> > > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > > >> PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> > > >>
> > > >
> > > >
> > >
> > >
> >
> >
> >
> > --
> > Weiwei Shi, Ph.D
> >
> > "Did you always know?"
> > "No, I did not. But I believed..."
> > ---Matrix III
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> >
> >
> 
> 
> 
>
------------------------------------------------------------------------------
> Notice:  This e-mail message, together with any attachment...{{dropped}}

Liaw, Andy

2005-Jul-07 21:07 UTC

head link

[R] randomForest

With small sample sizes the variability for estimate of test set error will
be large.  Instead of splitting the data once, you should consider
cross-validation or bootstrap for estimating performance.

AFAIK gbm as is won't handle more than two classes.  You will need to do
quite a bit of work to get it to do what MART does.

Andy
> From: Weiwei Shi 
> 
> thanks. but can you suggest some ways for the classification problems
> since for some specific class, there are too few observations.
> 
> the following is from adding sample.size :
> > najie.rf.2 <- randomForest(Diag~., 
> data=one.df[ind==1,4:ncol(one.df)], importance=T, 
> sampsize=unlist(sample.size))
> > najie.pred.2<- predict(najie.rf.2, one.df[ind==2,])
> 
> > table(observed=one.df[ind==2,"Diag"],
predicted=najie.pred.2)
>         predicted
> observed  1  2  3  4  5  6
>        1  6  0  1  0  0  1
>        2  0  4  0  0  0  0
>        3  1  0 37  0  0  0
>        4  0  0  3  5  0  0
>        5  1  0  3  0  8  0
>        6  0  0  0  3  0  5
> 
> and class number returned from sample.size is like:
> 28, 8, 82, 28, 18, 22
> 
> Should I use gbm to try it since it might "focus" more on 
> misplaced cases?
> 
> thanks,
> 
> weiwei
> 
> 
> On 7/7/05, Liaw, Andy <andy_liaw at merck.com> wrote:
> > > From: Weiwei Shi
> > >
> > > it works.
> > > thanks,
> > >
> > > but: (just curious)
> > > why i tried previously and i got
> > >
> > > > is.vector(sample.size)
> > > [1] TRUE
> > 
> > Because a list is also a vector:
> > 
> > > a <- c(list(1), list(2))
> > > a
> > [[1]]
> > [1] 1
> > 
> > [[2]]
> > [1] 2
> > 
> > > is.vector(a)
> > [1] TRUE
> > > is.numeric(a)
> > [1] FALSE
> > 
> > Actually, the way I initialize a list of known length is by 
> something like:
> > 
> > myList <- vector(mode="list", length=veryLong)
> > 
> > Andy
> > 
> > 
> > > i also tried as.vector(sample.size) and assigned it to 
> sampsz,it still
> > > does not work.
> > >
> > > On 7/7/05, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
> > > > On 7/7/2005 3:38 PM, Weiwei Shi wrote:
> > > > > Hi there:
> > > > > I have a question on random foresst:
> > > > >
> > > > > recently i helped a friend with her random forest and i
> > > came with this problem:
> > > > > her dataset has 6 classes and since the sample size is
> > > pretty small:
> > > > > 264 and the class distr is like this (Diag is the
> > > response variable)
> > > > > sample.size <- lapply(1:6, function(i) sum(Diag==i))
> > > > >> sample.size
> > > > > [[1]]
> > > > > [1] 36
> > > > >
> > > > > [[2]]
> > > > > [1] 12
> > > > >
> > > > > [[3]]
> > > > > [1] 120
> > > > >
> > > > > [[4]]
> > > > > [1] 36
> > > > >
> > > > > [[5]]
> > > > > [1] 30
> > > > >
> > > > > [[6]]
> > > > > [1] 30
> > > > >
> > > > > I assigned this sample.size to sampsz for a 
> stratiefied sampling
> > > > > purpose and i got the following error:
> > > > > Error in sum(..., na.rm = na.rm) : invalid
'mode' of argument
> > > > >
> > > > > if I use sampsz=c(36, 12, 120, 36, 30, 30), then it is
> > > fine. Could you
> > > > > tell me why?
> > > >
> > > > The sum() function knows what to do on a vector, but not on
> > > a list.  You
> > > > can turn your sample.size variable into a vector using
> > > >
> > > > unlist(sample.size)
> > > >
> > > > Duncan Murdoch
> > > >
> > > > > btw, as to classification problem for this with uneven
> > > class number
> > > > > situation, do u have some suggestions to improve its 
> accuracy?  I
> > > > > tried to use c() way to make the sampsz works but the 
> result is
> > > > > similar.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > weiwei
> > > > >
> > > > > On 6/30/05, Liaw, Andy <andy_liaw at merck.com>
wrote:
> > > > >> The limitation comes from the way categorical
splits are
> > > represented in the
> > > > >> code:  For a categorical variable with k
categories,
> the split is
> > > > >> represented by k binary digits: 0=right, 1=left. 
So it
> > > takes k bits to
> > > > >> store each split on k categories.  To save storage,
this
> > > is `packed' into a
> > > > >> 4-byte integer (32-bit), thus the limit of 32
categories.
> > > > >>
> > > > >> The current Fortran code (version 5.x) by Breiman
and
> > > Cutler gets around
> > > > >> this limitation by storing the split in an integer
> > > array.  While this lifts
> > > > >> the 32-category limit, it takes much more memory to
> > > store the splits.  I'm
> > > > >> still trying to figure out a more memory efficient
way
> > > of storing the splits
> > > > >> without imposing the 32-category limit.  If anyone
has
> > > suggestions, I'm all
> > > > >> ears.
> > > > >>
> > > > >> Best,
> > > > >> Andy
> > > > >>
> > > > >> > From: Arne.Muller at sanofi-aventis.com
> > > > >> >
> > > > >> > Hello,
> > > > >> >
> > > > >> > I'm using the random forest package. One
of my
> factors in the
> > > > >> > data set contains 41 levels (I can't code
this as a numeric
> > > > >> > value - in terms of linear models this would
be a random
> > > > >> > factor). The randomForest call comes back with
an error
> > > > >> > telling me that the limit is 32 categories.
> > > > >> >
> > > > >> > Is there any reason for this particular limit?
Maybe it's
> > > > >> > possible to recompile the module with a
different cutoff?
> > > > >> >
> > > > >> >       thanks a  lot for your help,
> > > > >> >       kind regards,
> > > > >> >
> > > > >> >
> > > > >> >       Arne
> > > > >> >
> > > > >> > ______________________________________________
> > > > >> > R-help at stat.math.ethz.ch mailing list
> > > > >> > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > >> > PLEASE do read the posting guide!
> > > > >> > http://www.R-project.org/posting-guide.html
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >> ______________________________________________
> > > > >> R-help at stat.math.ethz.ch mailing list
> > > > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > > > >> PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Weiwei Shi, Ph.D
> > >
> > > "Did you always know?"
> > > "No, I did not. But I believed..."
> > > ---Matrix III
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> > >
> > >
> > 
> > 
> > 
> > 
> --------------------------------------------------------------
> ----------------
> > Notice:  This e-mail message, together with any 
> attachments, contains information of Merck & Co., Inc. (One 
> Merck Drive, Whitehouse Station, New Jersey, USA 08889), 
> and/or its affiliates (which may be known outside the United 
> States as Merck Frosst, Merck Sharp & Dohme or MSD and in 
> Japan, as Banyu) that may be confidential, proprietary 
> copyrighted and/or legally privileged. It is intended solely 
> for the use of the individual or entity named on this 
> message.  If you are not the intended recipient, and have 
> received this message in error, please notify us immediately 
> by reply e-mail and then delete it from your system.
> > 
> --------------------------------------------------------------
> ----------------
> > 
> 
> 
> -- 
> Weiwei Shi, Ph.D
> 
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
> 
> 
>

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Jul 2005 - randomForest

[R] randomForest

[R] randomForest

[R] randomForest

Apparently Analagous Threads