On Apr 22, 2010, at 5:16 AM, Michael Haenlein wrote:
> Dear all,
>
> I have several character strings with a high number of different
> levels.
> unique(x) gives me values in the range of 100-200.
> This creates problems as I would like to use them as predictors in a
> coxph
> model.
>
> I therefore would like to convert each of these strings to a new
> string
> (x_new).
> x_new should be equal to x for the top n categories (i.e. the top n
> levels
> with the highest occurrence) and NAN elsewhere.
> For example, for n=3 x_new would have three levels: The three most
> common
> levels of x + NAN.
>
> Is there some convenient way of doing this?
x <- sample(c("top", "three", "levels",
"0ther", "strings"), 30,
replace=TRUE, prob=c(.3,.3,.3,.1,.1))
y <- c("top", "three", "levels")
xnew <- x
xnew[ !xnew %in% y ] <- "NAN" # not same as NaN
table(xnew)
#--------
xnew
levels NAN three top
5 5 9 11
--
David.
>
> Thanks in advance,
>
> Michael
>
>
> Michael Haenlein
> Associate Professor of Marketing
> ESCP Europe
> Paris, France
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT