mwtoews@sfu.ca
2005-Jun-28 07:40 UTC
[Rd] boxplot by factor (Package base version 2.1.1) (PR#7976)
I consider this to be an old bug, which also persists in Splus 7. It is unnecessary, and annoying. ## Section 1: Consider a simple data frame with three possible factors (in levels) d <- data.frame(a=sort(rnorm(10)*10), b=factor(c(rep("A",4), rep("C", 6)), levels=c("A","B","C"))) tapply(d$a, d$b, mean) # returns three results, which I would expect plot(a ~ b, d) # plots only two of three objects, ignoring that there was "C" in the second position # if I tried to plot a blank in between the two boxplots: plot(a ~ b, d, at=1:3) # nope: error plot(a ~ b, d, at=c(1,3)) # nope: out of range (also xlim does nothing for the formula boxplot method) # to make this work with the current R/Splus implementation, I have to add a zero: d <- rbind(d, data.frame(a=0,b="B")) # which I don't want to do, since there are no "B" plot(a ~ b, d) # yuk! ## Section 2: Why is this important? Consider another realistic example of [synthetic] daily temperature temp <- 5 - 10*cos(1:365*2*pi/365) + rnorm(365)*3 d1 <- data.frame(year=2005, jday=1:365, date=NA, month=NA, temp) # jday is Julian day [1,365] d1$date <- as.Date(paste(d1$year, d1$jday), "%Y %j") d1$month <- factor(months(d1$date,TRUE), levels=month.abb) plot(temp ~ month, d1) # perfect, in a perfect meteorological world d2 <- d1[!d1$month %in% c("Mar","Apr","May","Sep"),] # now let's remove some data tapply(d2$temp,d2$month,mean) # perfect plot(temp ~ month, d2) # ugly, not 12 months, etc. (despite having 12 levels) # again the only cure is to add zeros to the missing months (unnecessary forgery of data) d3 <- d2 for (i in c("Mar","Apr","May","Sep")) { d3 <- rbind(d3,NA) d3$month[nrow(d3)] <- i d3$temp[nrow(d3)] <- 0 } plot(temp ~ month, d3) # still ugly, but at least has 12 months! ## Section 3: Solution The obvious solution is to leave a blank where a boxplot should go, similar to tapply. This would have 1:n positions, where n is the number of levels of the factor, not the number of factors that have one or more numbers. The position should also have a label under the tick mark. I don't see any reason why the missing data should be completely ignored. Users wishing to not plot the blanks where the data could go can simply type (for back-compatibility): d2$month <- factor(d2$month) # from 12 to 8 levels Which will produce the same 8-factor plot as above. ## Section 4: Conclusion I consider this to be a bug in regards to data representation, and this function is not consistant with other functions like `tapply'. Considering that the back-compatibility solution is very simple, and most users would probably prefer a result including all levels (NULL or real values in each), I feel this an appropriate improvement (and easy to fix in the code). At the very least, include an option to honour the factor levels. Thanks. -mt --please do not edit the information below-- Version: platform = powerpc-apple-darwin8.1.0 arch = powerpc os = darwin8.1.0 system = powerpc, darwin8.1.0 status = Patched major = 2 minor = 1.1 year = 2005 month = 06 day = 26 language = R Locale: en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 Search Path: .GlobalEnv, package:methods, package:stats, package:graphics, package:grDevices, package:utils, package:datasets, Autoloads, package:base