Stavros Macrakis
2011-Oct-19 21:34 UTC
[Rd] Speed difference between df$a[1] and df[1,"a"]
I was surprised to find that df$a[1] is an order of magnitude faster than df[1,"a"]:> df <- data.frame(a=1:10)> system.time(replicate(100000, df$a[3]))user system elapsed 0.36 0.00 0.36> system.time(replicate(100000, df[3,"a"]))user system elapsed 4.09 0.00 4.09 A priori, I'd have thought that combining the row and column selections into a single operation would at worst be equally fast, at best would be faster by having fewer intermediate results and avoiding redundant operations. I thought this might be because df[,] builds a data frame before simplifying it to a vector, but with drop=F, it is even slower, so that doesn't seem to be the problem:> system.time(replicate(100000, df[3,"a",drop=FALSE]))user system elapsed 15.00 0.00 14.99 I then wondered if it might be because '[' allows multiple columns and handles rownames. Sure enough, '[[,]]', which allows only one column, and does not handle rownames, is almost 3x faster:> system.time(replicate(100000, df[[3,"a"]]))user system elapsed 1.48 0.00 1.48 ...but it is still 4x slower than $[]. The timings are not sensitive to the number of rows in df (except for the drop=FALSE case, which is much slower for large dfs). I will be avoiding [,] and [[,]] when I don't need their functionality, but I still wonder why they should be so much slower than $[]. -s R 2.13.1 on Windows 7, i7-860 (2.8GHz) 8GB RAM [[alternative HTML version deleted]]
Allan Engelhardt
2011-Oct-20 20:58 UTC
[Rd] Speed difference between df$a[1] and df[1,"a"]
`$` and `[` are primitives while `[.data.frame` is a longish R function that does all sorts of clever things. On 19/10/11 22:34, Stavros Macrakis wrote:> I was surprised to find that df$a[1] is an order of magnitude faster than > df[1,"a"]: > >> df<- data.frame(a=1:10) >> system.time(replicate(100000, df$a[3])) > user system elapsed > 0.36 0.00 0.36 > >> system.time(replicate(100000, df[3,"a"])) > user system elapsed > 4.09 0.00 4.09 > > > A priori, I'd have thought that combining the row and column selections into > a single operation would at worst be equally fast, at best would be faster > by having fewer intermediate results and avoiding redundant operations. > > I thought this might be because df[,] builds a data frame before simplifying > it to a vector, but with drop=F, it is even slower, so that doesn't seem to > be the problem: > >> system.time(replicate(100000, df[3,"a",drop=FALSE])) > user system elapsed > 15.00 0.00 14.99 > > > I then wondered if it might be because '[' allows multiple columns and > handles rownames. Sure enough, '[[,]]', which allows only one column, and > does not handle rownames, is almost 3x faster: > >> system.time(replicate(100000, df[[3,"a"]])) > user system elapsed > 1.48 0.00 1.48 > > > ...but it is still 4x slower than $[]. > > The timings are not sensitive to the number of rows in df (except for the > drop=FALSE case, which is much slower for large dfs). I will be avoiding > [,] and [[,]] when I don't need their functionality, but I still wonder why > they should be so much slower than $[]. > > -s > > R 2.13.1 on Windows 7, i7-860 (2.8GHz) 8GB RAM > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
On Wed, Oct 19, 2011 at 2:34 PM, Stavros Macrakis <macrakis at alum.mit.edu> wrote:> I was surprised to find that df$a[1] is an order of magnitude faster than > df[1,"a"]:Yes. This treats a data frame as a list, and is much faster.> I thought this might be because df[,] builds a data frame before simplifying > it to a vector, but with drop=F, it is even slower, so that doesn't seem to > be the problem:drop=FALSE creates a data frame first, and then simplifies it to a vector, so this test isn't showing what you think it is.> I then wondered if it might be because '[' allows multiple columns and > handles rownames. Sure enough, '[[,]]', which allows only one column, and > does not handle rownames, is almost 3x faster:That's part of it, but if you look at [.data.frame you see there is also quite a bit of copying that could be avoided in simple cases but is hard to avoid in full generality. -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland