thr3ads.net - R help - [R] by inconsistently strips class

If this information is useful, please help other people find it:
Share via:

Alex Brown

2008-Apr-15 11:28 UTC

[R] by inconsistently strips class - with fix

summary:

The function 'by' inconsistently strips class from the data to which  
it is applied.

quick reason:

tapply strips class when simplify is set to TRUE (the default) due to  
the class stripping behaviour of unlist.

quick answer:

This can be fixed by invoking tapply with simplify=FALSE, or changing  
tapply to use do.call(c instead of unlist

executable example:

mytimes=data.frame(date = 1:3 + Sys.time(), set = c(1,1,2))

by(mytimes$date, mytimes$set, function(x)x)

INDICES: 1
[1] "2008-04-15 11:41:38 BST" "2008-04-15 11:41:39 BST"
----------------------------------------------------------------------------------------
INDICES: 2
[1] "2008-04-15 11:41:40 BST"

by(mytimes[1,]$date, mytimes[1,]$set, function(x)x)

INDICES: 1
[1] 1208256099

why this is a problem:

This is a problem when you are feeding the output of this by into a  
function which expects the class to be maintained.  I see this problem  
when constructing

reason:

tapply strips class when simplify is set to TRUE (the default) due to  
the behaviour of unlist:

"Where possible the list elements are coerced to a common mode during  
the unlisting, and so the result often ends up as a character vector.  
Vectors will be coerced to the highest type of the components in the  
hierarchy NULL < raw < logical < integer < real < complex <
character
< list < expression: pairlists are treated as lists."

solution:

This problem can be fixed in the function by.data.frame by modifying  
the call to tapply in the function "by":

by.data.frame = function (data, INDICES, FUN, ...)
{
   if (!is.list(INDICES)) {
       IND <- vector("list", 1)
       IND[[1]] <- INDICES
       names(IND) <- deparse(substitute(INDICES))[1]
   }
   else IND <- INDICES
   FUNx <- function(x) FUN(data[x, ], ...)
   nd <- nrow(data)
<<<<
   ans <- eval(substitute(tapply(1:nd, IND, FUNx)), data)
===   ans <- eval(substitute(tapply(1:nd, IND, FUNx, simplify=FALSE)),  
data)
 >>>>
   attr(ans, "call") <- match.call()
   class(ans) <- "by"
   ans
}

alternative solution:

the call in tapply to unlist(ans, recursive=F) can be replaced by  
do.call(c,ans, recursive=F) to fix this issue, since c does not strip  
class.

However, I haven't taken the time to work out if this will work in all  
cases.

for example:

function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
   FUN <- if (!is.null(FUN))
       match.fun(FUN)
   if (!is.list(INDEX))
       INDEX <- list(INDEX)
   nI <- length(INDEX)
   namelist <- vector("list", nI)
   names(namelist) <- names(INDEX)
   extent <- integer(nI)
   nx <- length(X)
   one <- 1L
   group <- rep.int(one, nx)
   ngroup <- one
   for (i in seq.int(INDEX)) {
       index <- as.factor(INDEX[[i]])
       if (length(index) != nx)
           stop("arguments must have same length")
       namelist[[i]] <- levels(index)
       extent[i] <- nlevels(index)
       group <- group + ngroup * (as.integer(index) - one)
       ngroup <- ngroup * nlevels(index)
   }
   if (is.null(FUN))
       return(group)
   ans <- lapply(split(X, group), FUN, ...)
   index <- as.integer(names(ans))
   if (simplify && all(unlist(lapply(ans, length)) == 1)) {
       ansmat <- array(dim = extent, dimnames = namelist)
<<<<
       ans <- unlist(ans, recursive = FALSE)
===	ans <- do.call(c, ans, recursive = FALSE)
 >>>>
   }
   else {
       ansmat <- array(vector("list", prod(extent)), dim = extent,
           dimnames = namelist)
   }
   if (length(index)) {
       names(ans) <- NULL
       ansmat[index] <- ans
   }
   ansmat
}

Alexander Brown
Principal Engineer
Transitive
Maybrook House, 40 Blackfriars Street, Manchester M3 2EG
Phone: +44 (0)161 836 2321    Fax: +44 (0)161 836 2399    Mobile: +44  
(0)7980 708 221
www.transitive.com
* The leader in cross-platform virtualization

Prof Brian Ripley

2008-Apr-17 06:03 UTC

head link

[R] by inconsistently strips class - with fix

Unfortunately your proposed change changes the type of the output: 
simplification is intended in many applications of by().

Before:
> str(by(mytimes$date[1], mytimes$set[1], function(x)x))  by [, 1] 1.21e+09
  - attr(*, "dimnames")=List of 1
   ..$ mytimes$set[1]: chr "1"
  - attr(*, "call")= language by.default(data = mytimes$date[1],
INDICES =
mytimes$set[1],      FUN = function(x) x)

After:
> str(by(mytimes$date[1], mytimes$set[1], function(x)x))List of 1
  $ 1: POSIXct[1:1], format: "2008-04-17 06:53:31"
  - attr(*, "dim")= int 1
  - attr(*, "dimnames")=List of 1
   ..$ mytimes$set[1]: chr "1"
  - attr(*, "call")= language by.default(data = mytimes$date[1],
INDICES =
mytimes$set[1],      FUN = function(x) x)
  - attr(*, "class")= chr "by"

c() does not do the same thing as unlist() in general, and it is untrue 
that 'c does not strip class'.  What happens in your example is that
there
is a c() method for your class (and not many others).

What we could is to add a 'simplify' argument to by() so you can control
the simplification.


On Tue, 15 Apr 2008, Alex Brown wrote:
> summary:
>
> The function 'by' inconsistently strips class from the data to
which
> it is applied.
>
> quick reason:
>
> tapply strips class when simplify is set to TRUE (the default) due to
> the class stripping behaviour of unlist.
>
> quick answer:
>
> This can be fixed by invoking tapply with simplify=FALSE, or changing
> tapply to use do.call(c instead of unlist
>
> executable example:
>
> mytimes=data.frame(date = 1:3 + Sys.time(), set = c(1,1,2))
>
> by(mytimes$date, mytimes$set, function(x)x)
>
> INDICES: 1
> [1] "2008-04-15 11:41:38 BST" "2008-04-15 11:41:39 BST"
>
----------------------------------------------------------------------------------------
> INDICES: 2
> [1] "2008-04-15 11:41:40 BST"
>
> by(mytimes[1,]$date, mytimes[1,]$set, function(x)x)
>
> INDICES: 1
> [1] 1208256099
>
> why this is a problem:
>
> This is a problem when you are feeding the output of this by into a
> function which expects the class to be maintained.  I see this problem
> when constructing
>
> reason:
>
> tapply strips class when simplify is set to TRUE (the default) due to
> the behaviour of unlist:
>
> "Where possible the list elements are coerced to a common mode during
> the unlisting, and so the result often ends up as a character vector.
> Vectors will be coerced to the highest type of the components in the
> hierarchy NULL < raw < logical < integer < real < complex
< character
> < list < expression: pairlists are treated as lists."
>
> solution:
>
> This problem can be fixed in the function by.data.frame by modifying
> the call to tapply in the function "by":
>
> by.data.frame = function (data, INDICES, FUN, ...)
> {
>   if (!is.list(INDICES)) {
>       IND <- vector("list", 1)
>       IND[[1]] <- INDICES
>       names(IND) <- deparse(substitute(INDICES))[1]
>   }
>   else IND <- INDICES
>   FUNx <- function(x) FUN(data[x, ], ...)
>   nd <- nrow(data)
> <<<<
>   ans <- eval(substitute(tapply(1:nd, IND, FUNx)), data)
> ===>   ans <- eval(substitute(tapply(1:nd, IND, FUNx,
simplify=FALSE)),
> data)
> >>>>
>   attr(ans, "call") <- match.call()
>   class(ans) <- "by"
>   ans
> }
>
> alternative solution:
>
> the call in tapply to unlist(ans, recursive=F) can be replaced by
> do.call(c,ans, recursive=F) to fix this issue, since c does not strip
> class.
>
> However, I haven't taken the time to work out if this will work in all
> cases.
>
> for example:
>
> function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
> {
>   FUN <- if (!is.null(FUN))
>       match.fun(FUN)
>   if (!is.list(INDEX))
>       INDEX <- list(INDEX)
>   nI <- length(INDEX)
>   namelist <- vector("list", nI)
>   names(namelist) <- names(INDEX)
>   extent <- integer(nI)
>   nx <- length(X)
>   one <- 1L
>   group <- rep.int(one, nx)
>   ngroup <- one
>   for (i in seq.int(INDEX)) {
>       index <- as.factor(INDEX[[i]])
>       if (length(index) != nx)
>           stop("arguments must have same length")
>       namelist[[i]] <- levels(index)
>       extent[i] <- nlevels(index)
>       group <- group + ngroup * (as.integer(index) - one)
>       ngroup <- ngroup * nlevels(index)
>   }
>   if (is.null(FUN))
>       return(group)
>   ans <- lapply(split(X, group), FUN, ...)
>   index <- as.integer(names(ans))
>   if (simplify && all(unlist(lapply(ans, length)) == 1)) {
>       ansmat <- array(dim = extent, dimnames = namelist)
> <<<<
>       ans <- unlist(ans, recursive = FALSE)
> ===> 	ans <- do.call(c, ans, recursive = FALSE)
> >>>>
>   }
>   else {
>       ansmat <- array(vector("list", prod(extent)), dim =
extent,
>           dimnames = namelist)
>   }
>   if (length(index)) {
>       names(ans) <- NULL
>       ansmat[index] <- ans
>   }
>   ansmat
> }
>
> Alexander Brown
> Principal Engineer
> Transitive
> Maybrook House, 40 Blackfriars Street, Manchester M3 2EG
> Phone: +44 (0)161 836 2321    Fax: +44 (0)161 836 2399    Mobile: +44
> (0)7980 708 221
> www.transitive.com
> * The leader in cross-platform virtualization
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Reasonably Related Threads

Search for more possibly parallel threads

R help - Apr 2008 - by inconsistently strips class - with fix

[R] by inconsistently strips class - with fix

[R] by inconsistently strips class - with fix

Reasonably Related Threads