thr3ads.net - R help - [R] Data Frame Indexing [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Jesse Brown

2011-Aug-22 12:13 UTC

[R] Data Frame Indexing

Hello,

I've been dealing with a set of values that contain time stamps and part 
of my summary needs to look at just weekend data. In trying to limit the 
data I've found a large difference in performance in the way I index a 
data frame. I've constructed a minimal example here to try to explain my 
observation.

    is.weekend <- function(x) {
        tm <- as.POSIXlt(x,origin="1970/01/01")
        format(tm,"%a") %in% c("Sat","Sun")
    }

    use.lapply <- function(data) {
        data[do.call(rbind,lapply(data$TIME,FUN=is.weekend)),]
    }

    use.sapply <- function(data) {
        data[sapply(data$TIME,FUN=is.weekend),]
    }

    use.vapply <- function(data) {
        data[vapply(data$TIME,FUN=is.weekend,FALSE),]
    }

    use.indexing <- function(data) {
        data[is.weekend(data$TIME),]
    }

And the results of these methods:

     > names(csv.data)
    [1] "TIME"     "FILE"     "RADIAN"  
"BITS"     "DURATION"

     > length(csv.data$TIME)
    [1] 21471

     > system.time(v1 <- use.lapply(csv.data))
       user  system elapsed
     19.562   6.402  25.967

     > system.time(v2 <- use.sapply(csv.data))
       user  system elapsed
     19.456   6.492  25.951

     > system.time(v3 <- use.vapply(csv.data))
       user  system elapsed
     19.334   6.468  25.808

     > system.time(v4 <- use.indexing(csv.data))
       user  system elapsed
      0.032   0.020   0.052

     > all(identical(v1,v2),identical(v2,v3),identical(v3,v4))
    [1] TRUE



Forgive what is probably a trivial question, but why is there such a 
large difference in the *apply functions as opposed to the direct 
indexing method? On the surface it seems as though the use.indexing 
method uses the entire vector as an argument to the function while the 
others /might/ iterate over the values using one at a time as an 
argument to the function. In either case all elements must be part of 
the calculation...

Thanks for any insight.

Jesse

jim holtman

2011-Aug-22 13:26 UTC

head link

[R] Data Frame Indexing

The problem is that the way you are using "*apply", there are
individual calls to the function for each item.  In the direct
indexing, you are only making a single call with a vector of values;
Here is a illustration that shows the number of calls:
> # count the calls
> f.test <- function(x) callCnt <<- callCnt + 1  # test function;
just increment counter
>
> # test vector
> x <- 1:100
> callCnt <- 0
> invisible(sapply(x, f.test))
> callCnt  # notice that there were 100 calls made[1] 100

This again indicates that you need to think about how to vectorize
your operations.  Also if you have used Rprof, it may have shown where
you were spending time.


On Mon, Aug 22, 2011 at 8:13 AM, Jesse Brown <jesse.r.brown at lmco.com>
wrote:> Hello,
>
> I've been dealing with a set of values that contain time stamps and
part of
> my summary needs to look at just weekend data. In trying to limit the data
> I've found a large difference in performance in the way I index a data
> frame. I've constructed a minimal example here to try to explain my
> observation.
>
> ? is.weekend <- function(x) {
> ? ? ? tm <- as.POSIXlt(x,origin="1970/01/01")
> ? ? ? format(tm,"%a") %in% c("Sat","Sun")
> ? }
>
> ? use.lapply <- function(data) {
> ? ? ? data[do.call(rbind,lapply(data$TIME,FUN=is.weekend)),]
> ? }
>
> ? use.sapply <- function(data) {
> ? ? ? data[sapply(data$TIME,FUN=is.weekend),]
> ? }
>
> ? use.vapply <- function(data) {
> ? ? ? data[vapply(data$TIME,FUN=is.weekend,FALSE),]
> ? }
>
> ? use.indexing <- function(data) {
> ? ? ? data[is.weekend(data$TIME),]
> ? }
>
> And the results of these methods:
>
> ? ?> names(csv.data)
> ? [1] "TIME" ? ? "FILE" ? ? "RADIAN" ?
"BITS" ? ? "DURATION"
>
> ? ?> length(csv.data$TIME)
> ? [1] 21471
>
> ? ?> system.time(v1 <- use.lapply(csv.data))
> ? ? ?user ?system elapsed
> ? ?19.562 ? 6.402 ?25.967
>
> ? ?> system.time(v2 <- use.sapply(csv.data))
> ? ? ?user ?system elapsed
> ? ?19.456 ? 6.492 ?25.951
>
> ? ?> system.time(v3 <- use.vapply(csv.data))
> ? ? ?user ?system elapsed
> ? ?19.334 ? 6.468 ?25.808
>
> ? ?> system.time(v4 <- use.indexing(csv.data))
> ? ? ?user ?system elapsed
> ? ? 0.032 ? 0.020 ? 0.052
>
> ? ?> all(identical(v1,v2),identical(v2,v3),identical(v3,v4))
> ? [1] TRUE
>
>
>
> Forgive what is probably a trivial question, but why is there such a large
> difference in the *apply functions as opposed to the direct indexing
method?
> On the surface it seems as though the use.indexing method uses the entire
> vector as an argument to the function while the others /might/ iterate over
> the values using one at a time as an argument to the function. In either
> case all elements must be part of the calculation...
>
> Thanks for any insight.
>
> Jesse
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

Apparently Analagous Threads

Search for more seemingly similar threads

R help - Aug 2011 - Data Frame Indexing

[R] Data Frame Indexing

[R] Data Frame Indexing

Apparently Analagous Threads