thr3ads.net - R devel - [Rd] Speed difference between df$a[1] and df[1,"a"] [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Stavros Macrakis

2011-Oct-19 21:34 UTC

[Rd] Speed difference between df$a[1] and df[1,"a"]

I was surprised to find that df$a[1] is an order of magnitude faster than
df[1,"a"]:
> df <- data.frame(a=1:10)
> system.time(replicate(100000, df$a[3]))   user  system elapsed
   0.36    0.00    0.36
> system.time(replicate(100000, df[3,"a"]))   user  system elapsed
   4.09    0.00    4.09


A priori, I'd have thought that combining the row and column selections into
a single operation would at worst be equally fast, at best would be faster
by having fewer intermediate results and avoiding redundant operations.

I thought this might be because df[,] builds a data frame before simplifying
it to a vector, but with drop=F, it is even slower, so that doesn't seem to
be the problem:
> system.time(replicate(100000, df[3,"a",drop=FALSE]))   user  system elapsed
  15.00    0.00   14.99


I then wondered if it might be because '[' allows multiple columns and
handles rownames. Sure enough, '[[,]]', which allows only one column,
and
does not handle rownames, is almost 3x faster:
> system.time(replicate(100000, df[[3,"a"]]))   user  system elapsed
   1.48    0.00    1.48


...but it is still 4x slower than $[].

The timings are not sensitive to the number of rows in df (except for the
drop=FALSE case, which is much slower for large dfs).  I will be avoiding
[,] and [[,]] when I don't need their functionality, but I still wonder why
they should be so much slower than $[].

            -s

R 2.13.1 on Windows 7, i7-860 (2.8GHz) 8GB RAM

	[[alternative HTML version deleted]]

Allan Engelhardt

2011-Oct-20 20:58 UTC

head link

[Rd] Speed difference between df$a[1] and df[1,"a"]

`$` and `[` are primitives while `[.data.frame` is a longish R function 
that does all sorts of clever things.

On 19/10/11 22:34, Stavros Macrakis wrote:> I was surprised to find that df$a[1] is an order of magnitude faster than
> df[1,"a"]:
>
>> df<- data.frame(a=1:10)
>> system.time(replicate(100000, df$a[3]))
>     user  system elapsed
>     0.36    0.00    0.36
>
>> system.time(replicate(100000, df[3,"a"]))
>     user  system elapsed
>     4.09    0.00    4.09
>
>
> A priori, I'd have thought that combining the row and column selections
into
> a single operation would at worst be equally fast, at best would be faster
> by having fewer intermediate results and avoiding redundant operations.
>
> I thought this might be because df[,] builds a data frame before
simplifying
> it to a vector, but with drop=F, it is even slower, so that doesn't
seem to
> be the problem:
>
>> system.time(replicate(100000, df[3,"a",drop=FALSE]))
>     user  system elapsed
>    15.00    0.00   14.99
>
>
> I then wondered if it might be because '[' allows multiple columns
and
> handles rownames. Sure enough, '[[,]]', which allows only one
column, and
> does not handle rownames, is almost 3x faster:
>
>> system.time(replicate(100000, df[[3,"a"]]))
>     user  system elapsed
>     1.48    0.00    1.48
>
>
> ...but it is still 4x slower than $[].
>
> The timings are not sensitive to the number of rows in df (except for the
> drop=FALSE case, which is much slower for large dfs).  I will be avoiding
> [,] and [[,]] when I don't need their functionality, but I still wonder
why
> they should be so much slower than $[].
>
>              -s
>
> R 2.13.1 on Windows 7, i7-860 (2.8GHz) 8GB RAM
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Thomas Lumley

2011-Oct-21 05:23 UTC

head link

[Rd] Speed difference between df$a[1] and df[1,"a"]

On Wed, Oct 19, 2011 at 2:34 PM, Stavros Macrakis <macrakis at
alum.mit.edu> wrote:> I was surprised to find that df$a[1] is an order of magnitude faster than
> df[1,"a"]:
Yes.  This treats a data frame as a list, and is much faster.
> I thought this might be because df[,] builds a data frame before
simplifying
> it to a vector, but with drop=F, it is even slower, so that doesn't
seem to
> be the problem:
drop=FALSE creates a data frame first, and then simplifies it to a
vector, so this test isn't showing what you think it is.
> I then wondered if it might be because '[' allows multiple columns
and
> handles rownames. Sure enough, '[[,]]', which allows only one
column, and
> does not handle rownames, is almost 3x faster:
That's part of it, but if you look at [.data.frame you see there is
also quite a bit of copying that could be avoided in simple cases but
is hard to avoid in full generality.

    -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

Maybe Matching Threads

Search for more possibly parallel threads

R devel - Oct 2011 - Speed difference between df$a[1] and df[1,"a"]

[Rd] Speed difference between df$a[1] and df[1,"a"]

[Rd] Speed difference between df$a[1] and df[1,"a"]

[Rd] Speed difference between df$a[1] and df[1,"a"]

Maybe Matching Threads