thr3ads.net - R devel - [Rd] boxplot by factor (Package base version 2.1.1) ( PR#7976) [Jun 2005]

If this information is useful, please help other people find it:
Share via:

Liaw, Andy

2005-Jun-28 12:37 UTC

[Rd] boxplot by factor (Package base version 2.1.1) ( PR#7976)

The issue is not with boxplot, but with split.  boxplot.formula() 
calls boxplot(split(split(mf[[response]], mf[-response]), ...), 
but look at what split() returns when there are empty levels in
the factor:
> f <- factor(gl(3, 6), levels=1:5)
> y <- rnorm(f)
> split(y, f)$"1"
[1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520

$"2"
[1] -1.1296642 -0.4808355 -0.2789933  0.1220718  0.1287742 -0.7573801

$"3"
[1]  1.2320902  0.5090700 -1.5508074  2.1373780  1.1681297 -0.7151561

The "culprit" is the following in split.default():

    f <- factor(f)

which drops empty levels in f, if there are any.  BTW, ?split doesn't
mention what it does in such situation.  Perhaps it should?

If this is to be "fixed", I suppose an additional argument, e.g.,
drop=TRUE, can be added, and the corresponding line mentioned
above changed to something like:

    if (drop || !is.factor(f)) f <- factor(f)

Then this additional argument can be pass on from boxplot.formula() to 
split().

Just my $0.02...

Andy
> From: mwtoews at sfu.ca
> 
> I consider this to be an old bug, which also persists in Splus 7. It  
> is unnecessary, and annoying.
> 
> ## Section 1: Consider a simple data frame with three possible  
> factors (in levels)
> 
> d <- data.frame(a=sort(rnorm(10)*10), b=factor(c(rep("A",4),
rep("C",
> 6)), levels=c("A","B","C")))
> tapply(d$a, d$b, mean) # returns three results, which I would expect
> plot(a ~ b, d) # plots only two of three objects, ignoring 
> that there  
> was "C" in the second position
> 
> # if I tried to plot a blank in between the two boxplots:
> plot(a ~ b, d, at=1:3) # nope: error
> plot(a ~ b, d, at=c(1,3)) # nope: out of range (also xlim does  
> nothing for the formula boxplot method)
> 
> # to make this work with the current R/Splus implementation, I have  
> to add a zero:
> d <- rbind(d, data.frame(a=0,b="B")) # which I don't want
to do,
> since there are no "B"
> plot(a ~ b, d) # yuk!
> 
> ## Section 2: Why is this important? Consider another realistic  
> example of [synthetic] daily temperature
> 
> temp <- 5 - 10*cos(1:365*2*pi/365) + rnorm(365)*3
> d1 <- data.frame(year=2005, jday=1:365, date=NA, month=NA, temp) #  
> jday is Julian day [1,365]
> d1$date <- as.Date(paste(d1$year, d1$jday), "%Y %j")
> d1$month <- factor(months(d1$date,TRUE), levels=month.abb)
> plot(temp ~ month, d1) # perfect, in a perfect meteorological world
> 
> d2 <- d1[!d1$month %in%
c("Mar","Apr","May","Sep"),] # now
let's
> remove some data
> tapply(d2$temp,d2$month,mean)  # perfect
> plot(temp ~ month, d2) # ugly, not 12 months, etc. (despite 
> having 12  
> levels)
> 
> # again the only cure is to add zeros to the missing months  
> (unnecessary forgery of data)
> d3 <- d2
> for (i in
c("Mar","Apr","May","Sep")) {
>      d3 <- rbind(d3,NA)
>      d3$month[nrow(d3)] <- i
>      d3$temp[nrow(d3)] <- 0
> }
> plot(temp ~ month, d3) # still ugly, but at least has 12 months!
> 
> ## Section 3: Solution
> The obvious solution is to leave a blank where a boxplot should go,  
> similar to tapply. This would have 1:n positions, where n is the  
> number of levels of the factor, not the number of factors that have  
> one or more numbers.  The position should also have a label 
> under the  
> tick mark.
> I don't see any reason why the missing data should be completely  
> ignored. Users wishing to not plot the blanks where the data 
> could go  
> can simply type (for back-compatibility):
> 
> d2$month <- factor(d2$month) # from 12 to 8 levels
> 
> Which will produce the same 8-factor plot as above.
> 
> ## Section 4: Conclusion
> I consider this to be a bug in regards to data representation, and  
> this function is not consistant with other functions like `tapply'.   
> Considering that the back-compatibility solution is very simple, and  
> most users would probably prefer a result including all levels (NULL  
> or real values in each), I feel this an appropriate improvement (and  
> easy to fix in the code). At the very least, include an option to  
> honour the factor levels.
> 
> Thanks.
> -mt
> 
> --please do not edit the information below--
> 
> Version:
> platform = powerpc-apple-darwin8.1.0
> arch = powerpc
> os = darwin8.1.0
> system = powerpc, darwin8.1.0
> status = Patched
> major = 2
> minor = 1.1
> year = 2005
> month = 06
> day = 26
> language = R
> 
> Locale:
> en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
> 
> Search Path:
> .GlobalEnv, package:methods, package:stats, package:graphics,  
> package:grDevices, package:utils, package:datasets, Autoloads,  
> package:base
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
>

Peter Dalgaard

2005-Jun-28 12:57 UTC

head link

[Rd] boxplot by factor (Package base version 2.1.1) ( PR#7976)

"Liaw, Andy" <andy_liaw at merck.com> writes:
> The issue is not with boxplot, but with split.  boxplot.formula() 
> calls boxplot(split(split(mf[[response]], mf[-response]), ...), 
> but look at what split() returns when there are empty levels in
> the factor:
> 
> > f <- factor(gl(3, 6), levels=1:5)
> > y <- rnorm(f)
> > split(y, f)
> $"1"
> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
> 
> $"2"
> [1] -1.1296642 -0.4808355 -0.2789933  0.1220718  0.1287742 -0.7573801
> 
> $"3"
> [1]  1.2320902  0.5090700 -1.5508074  2.1373780  1.1681297 -0.7151561
> 
> The "culprit" is the following in split.default():
> 
>     f <- factor(f)
> 
> which drops empty levels in f, if there are any.  BTW, ?split doesn't
> mention what it does in such situation.  Perhaps it should?
> 
> If this is to be "fixed", I suppose an additional argument, e.g.,
> drop=TRUE, can be added, and the corresponding line mentioned
> above changed to something like:
> 
>     if (drop || !is.factor(f)) f <- factor(f)
> 
> Then this additional argument can be pass on from boxplot.formula() to 
> split().
Alternatively, I suspect that the intention was as.factor() rather
than factor(). It does require a bit of care to fix it that way,
though. There could be problems with empty levels popping up in
unexpected places. 

-- 
   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Gabor Grothendieck

2005-Jun-28 15:25 UTC

head link

[Rd] boxplot by factor (Package base version 2.1.1) ( PR#7976)

Based on Andy's comment a workaround can consist of
not using boxplot.formula, e.g. using the data frame d
defined by the original poster (see below):

	boxplot( by(d, d$b, function(x)x$a) )


On 6/28/05, Liaw, Andy <andy_liaw at merck.com>
wrote:> The issue is not with boxplot, but with split.  boxplot.formula()
> calls boxplot(split(split(mf[[response]], mf[-response]), ...),
> but look at what split() returns when there are empty levels in
> the factor:
> 
> > f <- factor(gl(3, 6), levels=1:5)
> > y <- rnorm(f)
> > split(y, f)
> $"1"
> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
> 
> $"2"
> [1] -1.1296642 -0.4808355 -0.2789933  0.1220718  0.1287742 -0.7573801
> 
> $"3"
> [1]  1.2320902  0.5090700 -1.5508074  2.1373780  1.1681297 -0.7151561
> 
> The "culprit" is the following in split.default():
> 
>    f <- factor(f)
> 
> which drops empty levels in f, if there are any.  BTW, ?split doesn't
> mention what it does in such situation.  Perhaps it should?
> 
> If this is to be "fixed", I suppose an additional argument, e.g.,
> drop=TRUE, can be added, and the corresponding line mentioned
> above changed to something like:
> 
>    if (drop || !is.factor(f)) f <- factor(f)
> 
> Then this additional argument can be pass on from boxplot.formula() to
> split().
> 
> Just my $0.02...
> 
> Andy
> 
> > From: mwtoews at sfu.ca
> >
> > I consider this to be an old bug, which also persists in Splus 7. It
> > is unnecessary, and annoying.
> >
> > ## Section 1: Consider a simple data frame with three possible
> > factors (in levels)
> >
> > d <- data.frame(a=sort(rnorm(10)*10),
b=factor(c(rep("A",4), rep("C",
> > 6)), levels=c("A","B","C")))
> > tapply(d$a, d$b, mean) # returns three results, which I would expect
> > plot(a ~ b, d) # plots only two of three objects, ignoring
> > that there
> > was "C" in the second position
> >
> > # if I tried to plot a blank in between the two boxplots:
> > plot(a ~ b, d, at=1:3) # nope: error
> > plot(a ~ b, d, at=c(1,3)) # nope: out of range (also xlim does
> > nothing for the formula boxplot method)
> >
> > # to make this work with the current R/Splus implementation, I have
> > to add a zero:
> > d <- rbind(d, data.frame(a=0,b="B")) # which I don't
want to do,
> > since there are no "B"
> > plot(a ~ b, d) # yuk!
> >
> > ## Section 2: Why is this important? Consider another realistic
> > example of [synthetic] daily temperature
> >
> > temp <- 5 - 10*cos(1:365*2*pi/365) + rnorm(365)*3
> > d1 <- data.frame(year=2005, jday=1:365, date=NA, month=NA, temp) #
> > jday is Julian day [1,365]
> > d1$date <- as.Date(paste(d1$year, d1$jday), "%Y %j")
> > d1$month <- factor(months(d1$date,TRUE), levels=month.abb)
> > plot(temp ~ month, d1) # perfect, in a perfect meteorological world
> >
> > d2 <- d1[!d1$month %in%
c("Mar","Apr","May","Sep"),] # now
let's
> > remove some data
> > tapply(d2$temp,d2$month,mean)  # perfect
> > plot(temp ~ month, d2) # ugly, not 12 months, etc. (despite
> > having 12
> > levels)
> >
> > # again the only cure is to add zeros to the missing months
> > (unnecessary forgery of data)
> > d3 <- d2
> > for (i in
c("Mar","Apr","May","Sep")) {
> >      d3 <- rbind(d3,NA)
> >      d3$month[nrow(d3)] <- i
> >      d3$temp[nrow(d3)] <- 0
> > }
> > plot(temp ~ month, d3) # still ugly, but at least has 12 months!
> >
> > ## Section 3: Solution
> > The obvious solution is to leave a blank where a boxplot should go,
> > similar to tapply. This would have 1:n positions, where n is the
> > number of levels of the factor, not the number of factors that have
> > one or more numbers.  The position should also have a label
> > under the
> > tick mark.
> > I don't see any reason why the missing data should be completely
> > ignored. Users wishing to not plot the blanks where the data
> > could go
> > can simply type (for back-compatibility):
> >
> > d2$month <- factor(d2$month) # from 12 to 8 levels
> >
> > Which will produce the same 8-factor plot as above.
> >
> > ## Section 4: Conclusion
> > I consider this to be a bug in regards to data representation, and
> > this function is not consistant with other functions like
`tapply'.
> > Considering that the back-compatibility solution is very simple, and
> > most users would probably prefer a result including all levels (NULL
> > or real values in each), I feel this an appropriate improvement (and
> > easy to fix in the code). At the very least, include an option to
> > honour the factor levels.
> >
> > Thanks.
> > -mt
> >
> > --please do not edit the information below--
> >
> > Version:
> > platform = powerpc-apple-darwin8.1.0
> > arch = powerpc
> > os = darwin8.1.0
> > system = powerpc, darwin8.1.0
> > status = Patched
> > major = 2
> > minor = 1.1
> > year = 2005
> > month = 06
> > day = 26
> > language = R
> >
> > Locale:
> > en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
> >
> > Search Path:
> > .GlobalEnv, package:methods, package:stats, package:graphics,
> > package:grDevices, package:utils, package:datasets, Autoloads,
> > package:base
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
> >
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Possibly Parallel Threads

Search for more maybe matching threads

R devel - Jun 2005 - boxplot by factor (Package base version 2.1.1) ( PR#7976)

[Rd] boxplot by factor (Package base version 2.1.1) ( PR#7976)

[Rd] boxplot by factor (Package base version 2.1.1) ( PR#7976)

[Rd] boxplot by factor (Package base version 2.1.1) ( PR#7976)

Possibly Parallel Threads