Roger Leigh
2007-Feb-08 21:51 UTC
[R] Problem with factor state when subset()ing a data.frame
Hi folks, I am running into a problem when calling subset() on a large data.frame. One of the columns contains strings which are used as factors. R seems to automatically factor the column when the data.frame is contstructed, and this appears to not get updated when I create a subset of the table. A minimal testcase to demonstrate the problem follows: sample <- data.frame(c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C"), c(5,3,5,3,6,7,8,3,2,6)) names(sample) <- c("ID", "Value") print(sample) sample.filtered <- subset(sample, ID != "B", select=c(ID, Value)) # Or sample.filtered <- subset(sample, ID != "B", select=c(ID, Value), drop=T) print(sample.filtered) plot(sample.filtered) plot(sample.filtered$Value ~ sample.filtered$ID) print(levels(sample.filtered$ID)) print(levels(factor(sample.filtered$ID))) plot(sample.filtered$Value ~ factor(sample.filtered$ID)) Am I doing something wrong here, or is this an R bug? How can I get the new data.frame to update the factors, so I don't get redundant "empty" factors on the plot by eliminating the "phantom" factors? (I also need to remove the unused factors for other analyses, and factoring them "by hand" seems a little redundant.) Kind regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `- GPG Public Key: 0x25BFB848 Please GPG sign your mail. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 188 bytes Desc: not available Url : https://stat.ethz.ch/pipermail/r-help/attachments/20070208/3566d3e5/attachment.bin
Peter Dalgaard
2007-Feb-09 13:24 UTC
[R] Problem with factor state when subset()ing a data.frame
Roger Leigh wrote:> Hi folks, > > I am running into a problem when calling subset() on a large > data.frame. One of the columns contains strings which are used as > factors. R seems to automatically factor the column when the > data.frame is contstructed, and this appears to not get updated when I > create a subset of the table. > > A minimal testcase to demonstrate the problem follows: > [snip] > Am I doing something wrong here, or is this an R bug?Not really, and no. This has been discussed a number of times in the past, and the consensus (grudgingly by some) seems to be that R's current behaviour is the rational one. The basic issue is whether the fact that a factor level is absent in a subgroup should change the level set . I.e., if you split a population by occupation, should the fact that there are no women in the subgroup of firefighters turn gender in to a one-level factor for that group? Sometimes it is sensible, but often it is not: If you do a series of barplots of the gender distribution, should they not have an empty bar for females when there are none? Similarly, if you have a semiquantitative scale like terrible-poor-mediocre-good-excellent would you not prefer to have tables and plots represent all five possible values always?> How can I get > the new data.frame to update the factors, so I don't get redundant > "empty" factors on the plot by eliminating the "phantom" factors? (I > also need to remove the unused factors for other analyses, and > factoring them "by hand" seems a little redundant.) > >You already know how (it's not redundant as you might want not to do it). I don't think there's an easier way, but you can automate, as in sb <- subset(.....) isf <- sapply(sb, is.factor) sb[isf] <- lapply(sb[isf], factor) -- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907