thr3ads.net - R help - [R] Problem with factor state when subset()ing a data.frame [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Roger Leigh

2007-Feb-08 21:51 UTC

[R] Problem with factor state when subset()ing a data.frame

Hi folks,

I am running into a problem when calling subset() on a large
data.frame.  One of the columns contains strings which are used as
factors.  R seems to automatically factor the column when the
data.frame is contstructed, and this appears to not get updated when I
create a subset of the table.

A minimal testcase to demonstrate the problem follows:


sample <- data.frame(c("A", "A", "A",
"A", "B", "B", "B", "C",
"C", "C"),
                     c(5,3,5,3,6,7,8,3,2,6))
names(sample) <- c("ID", "Value")

print(sample)

sample.filtered <- subset(sample, ID != "B", select=c(ID, Value))
# Or sample.filtered <- subset(sample, ID != "B", select=c(ID,
Value), drop=T)

print(sample.filtered)

plot(sample.filtered)
plot(sample.filtered$Value ~ sample.filtered$ID)

print(levels(sample.filtered$ID))
print(levels(factor(sample.filtered$ID)))

plot(sample.filtered$Value ~ factor(sample.filtered$ID))


Am I doing something wrong here, or is this an R bug?  How can I get
the new data.frame to update the factors, so I don't get redundant
"empty" factors on the plot by eliminating the "phantom"
factors?  (I
also need to remove the unused factors for other analyses, and
factoring them "by hand" seems a little redundant.)


Kind regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 188 bytes
Desc: not available
Url :
https://stat.ethz.ch/pipermail/r-help/attachments/20070208/3566d3e5/attachment.bin

Peter Dalgaard

2007-Feb-09 13:24 UTC

head link

[R] Problem with factor state when subset()ing a data.frame

Roger Leigh wrote:> Hi folks,
>
> I am running into a problem when calling subset() on a large
> data.frame.  One of the columns contains strings which are used as
> factors.  R seems to automatically factor the column when the
> data.frame is contstructed, and this appears to not get updated when I
> create a subset of the table.
>
> A minimal testcase to demonstrate the problem follows:
> [snip]
> Am I doing something wrong here, or is this an R bug?  Not really, and no.

This has been discussed a number of times in the past, and the consensus
(grudgingly by some) seems to be that R's current behaviour is the
rational one. The basic issue is whether the fact that a factor level is
absent in a subgroup should change the level set . I.e., if you split a
population by occupation, should the fact that there are no women in the
subgroup of firefighters turn gender in to a one-level factor for that
group?  Sometimes it is sensible, but often it is not: If you do a
series of barplots of the gender distribution, should they not have an
empty bar for females when there are none? Similarly, if you have a
semiquantitative scale like terrible-poor-mediocre-good-excellent would
you not prefer to have tables and plots represent all five possible
values always?

> How can I get
> the new data.frame to update the factors, so I don't get redundant
> "empty" factors on the plot by eliminating the
"phantom" factors?  (I
> also need to remove the unused factors for other analyses, and
> factoring them "by hand" seems a little redundant.)
>
>   You already know how (it's not redundant as you might want not to do
it). I don't think there's an easier way, but you can automate, as in

sb <- subset(.....)
isf <-  sapply(sb, is.factor)
sb[isf] <- lapply(sb[isf], factor)

-- 
   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Feb 2007 - Problem with factor state when subset()ing a data.frame

[R] Problem with factor state when subset()ing a data.frame

[R] Problem with factor state when subset()ing a data.frame

Apparently Analagous Threads