Bryan Hanson
2009-Jul-24 13:17 UTC
[R] str(data.frame) after subsetting reflects original structure, not subsetted structure?
I find that after subsetting (you may prefer "conditional selection") a data frame and assigning it to a new object, the str(new object) reflects the original data frame, not the new one: A <- rnorm(20) B <- factor(rep(c("t", "g"), 10)) C <- factor(rep(c("h", "l"), 10)) D <- data.frame(A, B, C) str(D) # reports correctly E <- D[D$C == "h",] str(E) # reports that D$C still has 2 levels, but E # or E$C shows that subsetting worked properly Summary(E) # shows the original structure and that subsetting worked Is this the expected behavior, and if so, is there a particular rationale? I would be pretty certain that the information about E was inherited from D, but why wasn't it updated to reflect the revised object? Is there an argument that I can use to force the updating? For better or worse, I use str() a lot to check my work, and in this case, it seems to have misled me. Thanks as always, Bryan ************* Bryan Hanson Professor of Chemistry & Biochemistry DePauw University, Greencastle IN USA
Ben Bolker
2009-Jul-24 13:40 UTC
[R] str(data.frame) after subsetting reflects original structure, not subsetted structure?
Bryan Hanson wrote:> > I find that after subsetting (you may prefer "conditional selection") a > data > frame and assigning it to a new object, the str(new object) reflects the > original data frame, not the new one: > > A <- rnorm(20) > B <- factor(rep(c("t", "g"), 10)) > C <- factor(rep(c("h", "l"), 10)) > D <- data.frame(A, B, C) > > str(D) # reports correctly > > E <- D[D$C == "h",] > > str(E) # reports that D$C still has 2 levels, but > E # or E$C shows that subsetting worked properly > Summary(E) # shows the original structure and that subsetting worked > > Is this the expected behavior, and if so, is there a particular rationale? > I would be pretty certain that the information about E was inherited from > D, > but why wasn't it updated to reflect the revised object? Is there an > argument that I can use to force the updating? > > For better or worse, I use str() a lot to check my work, and in this case, > it seems to have misled me. > >This is a FAQ, but not one that's documented (I think). subset() does not drop unused levels. If you try table(E$C) you will see that there are no "l" values left: h l 10 0 E$C <- factor(E$C) or E$C <- E$C[drop=TRUE] or library(gdata) E <- drop.levels(E) will all work. RSiteSearch("subset drop",restrict=c("Rhelp02","Rhelp08")) will get you lots of information (perhaps more than you want) on the pros and cons of this design decision ... Ben Bolker -- View this message in context: http://www.nabble.com/str%28data.frame%29-after-subsetting-reflects-original-structure%2C-not-subsetted-structure--tp24644407p24644727.html Sent from the R help mailing list archive at Nabble.com.
Marc Schwartz
2009-Jul-24 13:46 UTC
[R] str(data.frame) after subsetting reflects original structure, not subsetted structure?
On Jul 24, 2009, at 8:17 AM, Bryan Hanson wrote:> I find that after subsetting (you may prefer "conditional > selection") a data > frame and assigning it to a new object, the str(new object) reflects > the > original data frame, not the new one: > > A <- rnorm(20) > B <- factor(rep(c("t", "g"), 10)) > C <- factor(rep(c("h", "l"), 10)) > D <- data.frame(A, B, C) > > str(D) # reports correctly > > E <- D[D$C == "h",] > > str(E) # reports that D$C still has 2 levels, but > E # or E$C shows that subsetting worked properly > Summary(E) # shows the original structure and that subsetting worked > > Is this the expected behavior, and if so, is there a particular > rationale? > I would be pretty certain that the information about E was inherited > from D, > but why wasn't it updated to reflect the revised object? Is there an > argument that I can use to force the updating? > > For better or worse, I use str() a lot to check my work, and in this > case, > it seems to have misled me. > > Thanks as always, BryanSee ?"[.factor" which is the extract (subset) method for factors. Note that the 'drop' argument is FALSE by default. It is this argument that controls the retention of unused factor levels. The reason that it is FALSE by default is to ensure that if you are comparing factors from more than one data source, the comparisons of or the use of the factor levels are consistent. For one approach to dropping unused factor levels from a data frame, see: http://wiki.r-project.org/rwiki/doku.php?id=tips:data-manip:drop_unused_levels HTH, Marc Schwartz
Bryan Hanson
2009-Jul-24 14:16 UTC
[R] str(data.frame) after subsetting reflects original structure, not subsetted structure?
Thanks Marc and Ben... Your answers were most helpful. I suspected something had been written about it, but was having trouble formulating a reasonable search query. I was looking in the help page for str(), which was sort of a dead end. Bryan ************* Bryan Hanson Professor of Chemistry & Biochemistry DePauw University, Greencastle IN USA On 7/24/09 9:46 AM, "Marc Schwartz" <marc_schwartz at me.com> wrote:> On Jul 24, 2009, at 8:17 AM, Bryan Hanson wrote: > >> I find that after subsetting (you may prefer "conditional >> selection") a data >> frame and assigning it to a new object, the str(new object) reflects >> the >> original data frame, not the new one: >> >> A <- rnorm(20) >> B <- factor(rep(c("t", "g"), 10)) >> C <- factor(rep(c("h", "l"), 10)) >> D <- data.frame(A, B, C) >> >> str(D) # reports correctly >> >> E <- D[D$C == "h",] >> >> str(E) # reports that D$C still has 2 levels, but >> E # or E$C shows that subsetting worked properly >> Summary(E) # shows the original structure and that subsetting worked >> >> Is this the expected behavior, and if so, is there a particular >> rationale? >> I would be pretty certain that the information about E was inherited >> from D, >> but why wasn't it updated to reflect the revised object? Is there an >> argument that I can use to force the updating? >> >> For better or worse, I use str() a lot to check my work, and in this >> case, >> it seems to have misled me. >> >> Thanks as always, Bryan > > See ?"[.factor" which is the extract (subset) method for factors. Note > that the 'drop' argument is FALSE by default. It is this argument that > controls the retention of unused factor levels. > > The reason that it is FALSE by default is to ensure that if you are > comparing factors from more than one data source, the comparisons of > or the use of the factor levels are consistent. > > For one approach to dropping unused factor levels from a data frame, > see: > > > http://wiki.r-project.org/rwiki/doku.php?id=tips:data-manip:drop_unused_levels > > HTH, > > Marc Schwartz >