Hey guys, I noticed something curious in the lapply call. I'll copy+paste the function call here because it's short enough: lapply <- function (X, FUN, ...) { FUN <- match.fun(FUN) if (!is.vector(X) || is.object(X)) X <- as.list(X) .Internal(lapply(X, FUN)) } Notice that lapply coerces X to a list if the !is.vector || is.object(X) check passes. Curiously, data.frames fail the test (is.vector(data.frame()) returns FALSE); but it seems that coercion of a data.frame to a list would be unnecessary for the *apply family of functions. Is there a reason why we must coerce data.frames to list for these functions? I thought data.frames were essentially just 'structured lists'? I ask because it is generally quite slow coercing a (large) data.frame to a list, and it seems like this could be avoided for data.frames. Thanks, -Kevin [[alternative HTML version deleted]]
R. Michael Weylandt
2013-Jan-05 20:46 UTC
[R] lapply (and friends) with data.frames are slow
On Sat, Jan 5, 2013 at 7:38 PM, Kevin Ushey <kevinushey at gmail.com> wrote:> Hey guys, > > I noticed something curious in the lapply call. I'll copy+paste the > function call here because it's short enough: > > lapply <- function (X, FUN, ...) > { > FUN <- match.fun(FUN) > if (!is.vector(X) || is.object(X)) > X <- as.list(X) > .Internal(lapply(X, FUN)) > } > > Notice that lapply coerces X to a list if the !is.vector || is.object(X) > check passes. > > Curiously, data.frames fail the test (is.vector(data.frame()) returns > FALSE); but it seems that coercion of a data.frame > to a list would be unnecessary for the *apply family of functions. > > Is there a reason why we must coerce data.frames to list for these > functions? I thought data.frames were essentially just 'structured lists'? > > I ask because it is generally quite slow coercing a (large) data.frame to a > list, and it seems like this could be avoided for data.frames.Note sure it's a huge deal, but It does seem to be an avoidable function call with something like this: lapply1 <- function (X, FUN, ...) { FUN <- match.fun(FUN) if (!(is.vector(X) && is.object(X) || is.data.frame(X))) X <- as.list(X) .Internal(lapply(X, FUN)) } On a microbenchmark: xx <- data.frame(rnorm(5e7), rexp(5e7), runif(5e7)) xx <- cbind(xx, xx, xx, xx, xx) system.time(lapply(x, range)) system.time(lapply1(x, range)) It saves me about 50% of the time -- that's of course only using a relatively cheap FUN argument. Others will hopefully comment more M
On Jan 5, 2013, at 11:38 AM, Kevin Ushey wrote:> Hey guys, > > I noticed something curious in the lapply call. I'll copy+paste the > function call here because it's short enough: > > lapply <- function (X, FUN, ...) > { > FUN <- match.fun(FUN) > if (!is.vector(X) || is.object(X)) > X <- as.list(X) > .Internal(lapply(X, FUN)) > } > > Notice that lapply coerces X to a list if the !is.vector || > is.object(X) > check passes. > > Curiously, data.frames fail the test (is.vector(data.frame()) returns > FALSE); but it seems that coercion of a data.frame > to a list would be unnecessary for the *apply family of functions. > > Is there a reason why we must coerce data.frames to list for these > functions? I thought data.frames were essentially just 'structured > lists'? > > I ask because it is generally quite slow coercing a (large) > data.frame to a > list, and it seems like this could be avoided for data.frames.Is this related to this SO question that uses the microbenchmark function to illustrate the costs of the (possibly) superfluous coercion? http://stackoverflow.com/questions/14169818/why-is-sapply-relatively-slow-when-querying-attributes-on-variables-in-a-data-fr -- David Winsemius, MD Alameda, CA, USA