A factor with 5000 levels looks like it may be a numeric variable that was
accidently coded as a factor (functions like read.table will do this if there is
a non numeric character in with the numbers).
If you really have a 5000 level factor, which levels can be discarded or
combined is a question for the subject specific scientist, not the statistician.
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Saeed Abu Nimeh
> Sent: Thursday, August 26, 2010 1:40 PM
> To: r-help at r-project.org
> Subject: [R] Importance of levels in a factor variable
>
> I have a dataset of multiple variables and a response. For example,
> > str(x)
> 'data.frame': 3557238 obs. of 44 variables:
> $ response : Factor w/ 2 levels
> $ var2: Factor w/5000 levels
>
>
> If var2 for example is a factor with 5000 levels, what is the best
> approach to determine which of these levels is the most important to
> include in building the model, and which ones to discard. Assuming
> there is a way to do that, is it accurate to only include the
> important levels and discard the rest for that variable when building
> the model.
> Thansk,
> Saeed
>
> ---
> > sessionInfo()
> R version 2.10.1 (2009-12-14)
> x86_64-pc-linux-gnu
> 32 GB RAM
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.