Hadley Wickham
2011-Dec-28 15:37 UTC
[Rd] Subsetting a data frame vs. subsetting the columns
Hi all, There seems to be rather a large speed disparity in subsetting when working with a whole data frame vs. working with just columns individually: df <- as.data.frame(replicate(10, runif(1e5))) ord <- order(df[[1]]) system.time(df[ord, ]) # user system elapsed # 0.043 0.007 0.059 system.time(lapply(df, function(x) x[ord])) # user system elapsed # 0.022 0.008 0.029 What's going on? I realise this isn't quite a fair example because the second case makes a list not a data frame, but I thought it would be quick operation to turn a list into a data frame if you don't do any checking: list_to_df <- function(list) { n <- length(list[[1]]) structure(list, class = "data.frame", row.names = c(NA, -n)) } system.time(list_to_df(lapply(df, function(x) x[ord]))) # user system elapsed # 0.031 0.017 0.048 So I guess this is slow because it has to make a copy of the whole data frame to modify the structure. But couldn't [.data.frame avoid that? Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
Simon Urbanek
2011-Dec-28 16:14 UTC
[Rd] Subsetting a data frame vs. subsetting the columns
Hadley, there was a whole discussion about subsetting and subassigning data frames (and general efficiency issues) some time ago (I can't find it in a hurry but others might) -- just look at the `[.data.frame` code to see why it's so slow. It would need to be pushed into C code to allow certain optimizations, but it's a quite complex code so I don't think there were volunteers. So the advice is don't do it ;). Treating DFs as lists is always faster since you get to the fast internal code. Cheers, S On Dec 28, 2011, at 10:37 AM, Hadley Wickham wrote:> Hi all, > > There seems to be rather a large speed disparity in subsetting when > working with a whole data frame vs. working with just columns > individually: > > df <- as.data.frame(replicate(10, runif(1e5))) > ord <- order(df[[1]]) > > system.time(df[ord, ]) > # user system elapsed > # 0.043 0.007 0.059 > system.time(lapply(df, function(x) x[ord])) > # user system elapsed > # 0.022 0.008 0.029 > > What's going on? > > I realise this isn't quite a fair example because the second case > makes a list not a data frame, but I thought it would be quick > operation to turn a list into a data frame if you don't do any > checking: > > list_to_df <- function(list) { > n <- length(list[[1]]) > structure(list, > class = "data.frame", > row.names = c(NA, -n)) > } > system.time(list_to_df(lapply(df, function(x) x[ord]))) > # user system elapsed > # 0.031 0.017 0.048 > > So I guess this is slow because it has to make a copy of the whole > data frame to modify the structure. But couldn't [.data.frame avoid > that? > > Hadley > > > -- > Assistant Professor / Dobelman Family Junior Chair > Department of Statistics / Rice University > http://had.co.nz/ > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >