Liaw, Andy
2005-Jun-28 12:37 UTC
[Rd] boxplot by factor (Package base version 2.1.1) ( PR#7976)
The issue is not with boxplot, but with split. boxplot.formula() calls boxplot(split(split(mf[[response]], mf[-response]), ...), but look at what split() returns when there are empty levels in the factor:> f <- factor(gl(3, 6), levels=1:5) > y <- rnorm(f) > split(y, f)$"1" [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520 $"2" [1] -1.1296642 -0.4808355 -0.2789933 0.1220718 0.1287742 -0.7573801 $"3" [1] 1.2320902 0.5090700 -1.5508074 2.1373780 1.1681297 -0.7151561 The "culprit" is the following in split.default(): f <- factor(f) which drops empty levels in f, if there are any. BTW, ?split doesn't mention what it does in such situation. Perhaps it should? If this is to be "fixed", I suppose an additional argument, e.g., drop=TRUE, can be added, and the corresponding line mentioned above changed to something like: if (drop || !is.factor(f)) f <- factor(f) Then this additional argument can be pass on from boxplot.formula() to split(). Just my $0.02... Andy> From: mwtoews at sfu.ca > > I consider this to be an old bug, which also persists in Splus 7. It > is unnecessary, and annoying. > > ## Section 1: Consider a simple data frame with three possible > factors (in levels) > > d <- data.frame(a=sort(rnorm(10)*10), b=factor(c(rep("A",4), rep("C", > 6)), levels=c("A","B","C"))) > tapply(d$a, d$b, mean) # returns three results, which I would expect > plot(a ~ b, d) # plots only two of three objects, ignoring > that there > was "C" in the second position > > # if I tried to plot a blank in between the two boxplots: > plot(a ~ b, d, at=1:3) # nope: error > plot(a ~ b, d, at=c(1,3)) # nope: out of range (also xlim does > nothing for the formula boxplot method) > > # to make this work with the current R/Splus implementation, I have > to add a zero: > d <- rbind(d, data.frame(a=0,b="B")) # which I don't want to do, > since there are no "B" > plot(a ~ b, d) # yuk! > > ## Section 2: Why is this important? Consider another realistic > example of [synthetic] daily temperature > > temp <- 5 - 10*cos(1:365*2*pi/365) + rnorm(365)*3 > d1 <- data.frame(year=2005, jday=1:365, date=NA, month=NA, temp) # > jday is Julian day [1,365] > d1$date <- as.Date(paste(d1$year, d1$jday), "%Y %j") > d1$month <- factor(months(d1$date,TRUE), levels=month.abb) > plot(temp ~ month, d1) # perfect, in a perfect meteorological world > > d2 <- d1[!d1$month %in% c("Mar","Apr","May","Sep"),] # now let's > remove some data > tapply(d2$temp,d2$month,mean) # perfect > plot(temp ~ month, d2) # ugly, not 12 months, etc. (despite > having 12 > levels) > > # again the only cure is to add zeros to the missing months > (unnecessary forgery of data) > d3 <- d2 > for (i in c("Mar","Apr","May","Sep")) { > d3 <- rbind(d3,NA) > d3$month[nrow(d3)] <- i > d3$temp[nrow(d3)] <- 0 > } > plot(temp ~ month, d3) # still ugly, but at least has 12 months! > > ## Section 3: Solution > The obvious solution is to leave a blank where a boxplot should go, > similar to tapply. This would have 1:n positions, where n is the > number of levels of the factor, not the number of factors that have > one or more numbers. The position should also have a label > under the > tick mark. > I don't see any reason why the missing data should be completely > ignored. Users wishing to not plot the blanks where the data > could go > can simply type (for back-compatibility): > > d2$month <- factor(d2$month) # from 12 to 8 levels > > Which will produce the same 8-factor plot as above. > > ## Section 4: Conclusion > I consider this to be a bug in regards to data representation, and > this function is not consistant with other functions like `tapply'. > Considering that the back-compatibility solution is very simple, and > most users would probably prefer a result including all levels (NULL > or real values in each), I feel this an appropriate improvement (and > easy to fix in the code). At the very least, include an option to > honour the factor levels. > > Thanks. > -mt > > --please do not edit the information below-- > > Version: > platform = powerpc-apple-darwin8.1.0 > arch = powerpc > os = darwin8.1.0 > system = powerpc, darwin8.1.0 > status = Patched > major = 2 > minor = 1.1 > year = 2005 > month = 06 > day = 26 > language = R > > Locale: > en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 > > Search Path: > .GlobalEnv, package:methods, package:stats, package:graphics, > package:grDevices, package:utils, package:datasets, Autoloads, > package:base > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > >
Peter Dalgaard
2005-Jun-28 12:57 UTC
[Rd] boxplot by factor (Package base version 2.1.1) ( PR#7976)
"Liaw, Andy" <andy_liaw at merck.com> writes:> The issue is not with boxplot, but with split. boxplot.formula() > calls boxplot(split(split(mf[[response]], mf[-response]), ...), > but look at what split() returns when there are empty levels in > the factor: > > > f <- factor(gl(3, 6), levels=1:5) > > y <- rnorm(f) > > split(y, f) > $"1" > [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520 > > $"2" > [1] -1.1296642 -0.4808355 -0.2789933 0.1220718 0.1287742 -0.7573801 > > $"3" > [1] 1.2320902 0.5090700 -1.5508074 2.1373780 1.1681297 -0.7151561 > > The "culprit" is the following in split.default(): > > f <- factor(f) > > which drops empty levels in f, if there are any. BTW, ?split doesn't > mention what it does in such situation. Perhaps it should? > > If this is to be "fixed", I suppose an additional argument, e.g., > drop=TRUE, can be added, and the corresponding line mentioned > above changed to something like: > > if (drop || !is.factor(f)) f <- factor(f) > > Then this additional argument can be pass on from boxplot.formula() to > split().Alternatively, I suspect that the intention was as.factor() rather than factor(). It does require a bit of care to fix it that way, though. There could be problems with empty levels popping up in unexpected places. -- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Gabor Grothendieck
2005-Jun-28 15:25 UTC
[Rd] boxplot by factor (Package base version 2.1.1) ( PR#7976)
Based on Andy's comment a workaround can consist of not using boxplot.formula, e.g. using the data frame d defined by the original poster (see below): boxplot( by(d, d$b, function(x)x$a) ) On 6/28/05, Liaw, Andy <andy_liaw at merck.com> wrote:> The issue is not with boxplot, but with split. boxplot.formula() > calls boxplot(split(split(mf[[response]], mf[-response]), ...), > but look at what split() returns when there are empty levels in > the factor: > > > f <- factor(gl(3, 6), levels=1:5) > > y <- rnorm(f) > > split(y, f) > $"1" > [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520 > > $"2" > [1] -1.1296642 -0.4808355 -0.2789933 0.1220718 0.1287742 -0.7573801 > > $"3" > [1] 1.2320902 0.5090700 -1.5508074 2.1373780 1.1681297 -0.7151561 > > The "culprit" is the following in split.default(): > > f <- factor(f) > > which drops empty levels in f, if there are any. BTW, ?split doesn't > mention what it does in such situation. Perhaps it should? > > If this is to be "fixed", I suppose an additional argument, e.g., > drop=TRUE, can be added, and the corresponding line mentioned > above changed to something like: > > if (drop || !is.factor(f)) f <- factor(f) > > Then this additional argument can be pass on from boxplot.formula() to > split(). > > Just my $0.02... > > Andy > > > From: mwtoews at sfu.ca > > > > I consider this to be an old bug, which also persists in Splus 7. It > > is unnecessary, and annoying. > > > > ## Section 1: Consider a simple data frame with three possible > > factors (in levels) > > > > d <- data.frame(a=sort(rnorm(10)*10), b=factor(c(rep("A",4), rep("C", > > 6)), levels=c("A","B","C"))) > > tapply(d$a, d$b, mean) # returns three results, which I would expect > > plot(a ~ b, d) # plots only two of three objects, ignoring > > that there > > was "C" in the second position > > > > # if I tried to plot a blank in between the two boxplots: > > plot(a ~ b, d, at=1:3) # nope: error > > plot(a ~ b, d, at=c(1,3)) # nope: out of range (also xlim does > > nothing for the formula boxplot method) > > > > # to make this work with the current R/Splus implementation, I have > > to add a zero: > > d <- rbind(d, data.frame(a=0,b="B")) # which I don't want to do, > > since there are no "B" > > plot(a ~ b, d) # yuk! > > > > ## Section 2: Why is this important? Consider another realistic > > example of [synthetic] daily temperature > > > > temp <- 5 - 10*cos(1:365*2*pi/365) + rnorm(365)*3 > > d1 <- data.frame(year=2005, jday=1:365, date=NA, month=NA, temp) # > > jday is Julian day [1,365] > > d1$date <- as.Date(paste(d1$year, d1$jday), "%Y %j") > > d1$month <- factor(months(d1$date,TRUE), levels=month.abb) > > plot(temp ~ month, d1) # perfect, in a perfect meteorological world > > > > d2 <- d1[!d1$month %in% c("Mar","Apr","May","Sep"),] # now let's > > remove some data > > tapply(d2$temp,d2$month,mean) # perfect > > plot(temp ~ month, d2) # ugly, not 12 months, etc. (despite > > having 12 > > levels) > > > > # again the only cure is to add zeros to the missing months > > (unnecessary forgery of data) > > d3 <- d2 > > for (i in c("Mar","Apr","May","Sep")) { > > d3 <- rbind(d3,NA) > > d3$month[nrow(d3)] <- i > > d3$temp[nrow(d3)] <- 0 > > } > > plot(temp ~ month, d3) # still ugly, but at least has 12 months! > > > > ## Section 3: Solution > > The obvious solution is to leave a blank where a boxplot should go, > > similar to tapply. This would have 1:n positions, where n is the > > number of levels of the factor, not the number of factors that have > > one or more numbers. The position should also have a label > > under the > > tick mark. > > I don't see any reason why the missing data should be completely > > ignored. Users wishing to not plot the blanks where the data > > could go > > can simply type (for back-compatibility): > > > > d2$month <- factor(d2$month) # from 12 to 8 levels > > > > Which will produce the same 8-factor plot as above. > > > > ## Section 4: Conclusion > > I consider this to be a bug in regards to data representation, and > > this function is not consistant with other functions like `tapply'. > > Considering that the back-compatibility solution is very simple, and > > most users would probably prefer a result including all levels (NULL > > or real values in each), I feel this an appropriate improvement (and > > easy to fix in the code). At the very least, include an option to > > honour the factor levels. > > > > Thanks. > > -mt > > > > --please do not edit the information below-- > > > > Version: > > platform = powerpc-apple-darwin8.1.0 > > arch = powerpc > > os = darwin8.1.0 > > system = powerpc, darwin8.1.0 > > status = Patched > > major = 2 > > minor = 1.1 > > year = 2005 > > month = 06 > > day = 26 > > language = R > > > > Locale: > > en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 > > > > Search Path: > > .GlobalEnv, package:methods, package:stats, package:graphics, > > package:grDevices, package:utils, package:datasets, Autoloads, > > package:base > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >