Sarah, you make it sound as though everyone should be using matrices, even
though they have distinct disadvantages for many types of analysis.
You are right that rbind on data frames is slow, but dplyr::bind_rows
handles data frames almost as fast as your rbind-ing matrices solution.
And if you apply knowledge of your data frames and don't do the error
checking that bind_rows does, you can beat both of them without converting
to matrices, as the "tm.dfcolcat" solution below illustrates. (Not for
everyday use, but if you have a big job and the data are clean this may
make a difference.)
Data frames, handled properly, are only slightly slower than matrices for
most purposes. I have seen numerical solutions of partial differential
equations run lightning fast using pre-allocated data frames and vector
calculations, so even traditional "matrix" calculation domains
don't have
use matrices to be competitive.
######################
testsize <- 5000
N <- 20
set.seed(1234)
testdf.list <- lapply( seq_len( testsize )
, function( x ) {
data.frame( matrix( runif( 300 ), nrow=100 ) )
}
)
tm.rbind <- function( x = 0 ) {
system.time( r.df <- do.call( "rbind", testdf.list ) )
}
#toss the first one
tm.rbind()
tms.rbind <- data.frame( do.call( rbind
, lapply( 1:N
, tm.rbind
)
)
, which = "rbind"
)
tm.rbindm <- function( x = 0 ) {
system.time({
testm.list <- lapply( testdf.list, as.matrix )
r.m <- do.call( rbind, testm.list )
})
}
#toss the first one
tm.rbindm()
tms.rbindm <- data.frame( do.call( rbind
, lapply( 1:N
, tm.rbindm
)
)
, which = "rbindm"
)
tm.dfcopy <- function(x=0) {
system.time({
l.df <- data.frame( matrix( NA
, nrow=100 * testsize
, ncol=3
)
)
for ( i in seq_len( testsize ) ) {
start <- ( i - 1 ) * 100 + 1
end <- i * 100
l.df[ start:end, ] <- testdf.list[[ i ]]
}
})
}
#toss the first one
tm.dfcopy()
tms.dfcopy <- data.frame( do.call( rbind
, lapply( 1:N
, tm.dfcopy
)
)
, which = "dfcopy"
)
tm.dfmatcopy <- function(x=0) {
system.time({
l.m <- data.frame( matrix( NA
, nrow=100 * testsize
, ncol = 3
)
)
testm.list <- lapply( testdf.list, as.matrix )
for ( i in seq_len( testsize ) ) {
start <- ( i - 1 ) * 100 + 1
end <- i * 100
l.m[ start:end, ] <- testm.list[[ i ]]
}
})
}
#toss the first one
tm.dfmatcopy()
tms.dfmatcopy <- data.frame( do.call( rbind
, lapply( 1:N
, tm.dfmatcopy
)
)
, which = "dfmatcopy"
)
tm.bind_rows <- function(x=0) {
system.time({
dplyr::bind_rows( testdf.list )
})
}
#toss the first one
tm.bind_rows()
tms.bind_rows <- data.frame( do.call( rbind
, lapply( 1:N
, tm.bind_rows
)
)
, which = "bind_rows"
)
tm.dfcolcat <- function(x=0) {
system.time({
mycolnames <- names( testdf.list[[ 1 ]] )
result <-
setNames( data.frame( lapply( mycolnames
, function( colidx ) {
do.call( c
, lapply( testdf.list
, function( v ) {
v[[ colidx ]]
}
)
)
}
)
)
, mycolnames
)
})
}
#toss the first one
tm.dfcolcat()
tms.dfcolcat <- data.frame( do.call( rbind, lapply( 1:N
, tm.dfcolcat
)
)
, which = "dfcolcat"
)
tms.sarah <- read.table( text" user system elapsed which
34.280 0.009 34.317 tm.rbind
0.310 0.000 0.311 tm.rbindm
81.890 0.069 82.162 tm.dfcopy
67.664 0.047 68.009 tm.dfmatcopy
", header = TRUE, as.is=TRUE )
mergetms <- rbind( tms.rbind
, tms.rbindm
, tms.dfcopy
, tms.dfmatcopy
, tms.bind_rows
, tms.dfcolcat
)
mergetms$which <- factor( mergetms$which
, levels = c( "rbind"
, "rbindm"
, "dfcopy"
, "dfmatcopy"
, "bind_rows"
, "dfcolcat"
)
)
plot( user.self ~ which, data=mergetms )
plot( user.self ~ which, data=mergetms, ylim=c(0,4) )
summary( tms.rbind )
# user.self sys.self elapsed user.child sys.child
# Min. :18.84 Min. :0.0000 Min. :18.92 Min. : NA Min. : NA
# 1st Qu.:20.83 1st Qu.:0.0275 1st Qu.:20.96 1st Qu.: NA 1st Qu.: NA
# Median :22.91 Median :0.0400 Median :23.00 Median : NA Median : NA
# Mean :25.06 Mean :0.0430 Mean :25.21 Mean :NaN Mean :NaN
# 3rd Qu.:24.29 3rd Qu.:0.0600 3rd Qu.:24.39 3rd Qu.: NA 3rd Qu.: NA
# Max. :39.36 Max. :0.1000 Max. :39.94 Max. : NA Max. : NA
# NA's :20 NA's
:20
summary( tms.rbindm )
# user.self sys.self elapsed user.child sys.child
# Min. :0.2200 Min. :0 Min. :0.2200 Min. : NA Min. : NA
# 1st Qu.:0.5600 1st Qu.:0 1st Qu.:0.5800 1st Qu.: NA 1st Qu.: NA
# Median :0.5850 Median :0 Median :0.5900 Median : NA Median : NA
# Mean :0.5465 Mean :0 Mean :0.5555 Mean :NaN Mean :NaN
# 3rd Qu.:0.5900 3rd Qu.:0 3rd Qu.:0.5925 3rd Qu.: NA 3rd Qu.: NA
# Max. :0.6100 Max. :0 Max. :0.6100 Max. : NA Max. : NA
# NA's :20 NA's :20
summary( tms.dfcopy )
# user.self sys.self elapsed user.child sys.child
# Min. :114.2 Min. :0.0000 Min. :114.3 Min. : NA Min. : NA
# 1st Qu.:122.7 1st Qu.:0.0000 1st Qu.:123.0 1st Qu.: NA 1st Qu.: NA
# Median :128.3 Median :0.0050 Median :128.4 Median : NA Median : NA
# Mean :134.5 Mean :0.0185 Mean :134.8 Mean :NaN Mean :NaN
# 3rd Qu.:134.7 3rd Qu.:0.0325 3rd Qu.:134.8 3rd Qu.: NA 3rd Qu.: NA
# Max. :261.5 Max. :0.0800 Max. :263.4 Max. : NA Max. : NA
# NA's :20 NA's
:20
summary( tms.dfmatcopy )
# user.self sys.self elapsed user.child sys.child
# Min. : 98.15 Min. : 0.050 Min. :102.0 Min. : NA Min. : NA
# 1st Qu.:136.47 1st Qu.: 3.495 1st Qu.:144.6 1st Qu.: NA 1st Qu.: NA
# Median :147.53 Median : 7.135 Median :158.3 Median : NA Median : NA
# Mean :177.10 Mean : 7.030 Mean :185.2 Mean :NaN Mean :NaN
# 3rd Qu.:159.12 3rd Qu.:10.932 3rd Qu.:166.9 3rd Qu.: NA 3rd Qu.: NA
# Max. :362.95 Max. :16.100 Max. :364.3 Max. : NA Max. : NA
# NA's :20 NA's
summary( tms.bind_rows )
# user.self sys.self elapsed user.child sys.child
# Min. :0.8200 Min. :0 Min. :0.8200 Min. : NA Min. : NA
# 1st Qu.:0.8300 1st Qu.:0 1st Qu.:0.8375 1st Qu.: NA 1st Qu.: NA
# Median :0.8400 Median :0 Median :0.8400 Median : NA Median : NA
# Mean :0.8460 Mean :0 Mean :0.8480 Mean :NaN Mean :NaN
# 3rd Qu.:0.8525 3rd Qu.:0 3rd Qu.:0.8525 3rd Qu.: NA 3rd Qu.: NA
# Max. :0.9400 Max. :0 Max. :0.9900 Max. : NA Max. : NA
# NA's :20 NA's :20
summary( tms.dfcolcat )
# user.self sys.self elapsed user.child sys.child
# Min. :0.340 Min. :0 Min. :0.340 Min. : NA Min. : NA
# 1st Qu.:0.350 1st Qu.:0 1st Qu.:0.350 1st Qu.: NA 1st Qu.: NA
# Median :0.360 Median :0 Median :0.360 Median : NA Median : NA
# Mean :0.358 Mean :0 Mean :0.357 Mean :NaN Mean :NaN
# 3rd Qu.:0.360 3rd Qu.:0 3rd Qu.:0.360 3rd Qu.: NA 3rd Qu.: NA
# Max. :0.380 Max. :0 Max. :0.380 Max. : NA Max. : NA
# NA's :20 NA's :20
######################
On Mon, 27 Jun 2016, Sarah Goslee wrote:
> That's not what I said, though, and it's not necessarily true.
Growing
> an object within a loop _is_ a slow process, but that's not the
> problem here. The problem is using data frames instead of matrices.
> The need to manage column classes is very costly. Converting to
> matrices will almost always be enormously faster.
>
> Here's an expansion of the previous example I posted, in four parts:
> 1. do.call with data frame - very slow - 34.317 s elapsed time for
> 2000 data frames
> 2. do.call with matrix - very fast - 0.311 s elapsed
> 3. pre-allocated loop with data frame - even slower (!) - 82.162 s
> 4. pre-allocated loop with matrix - very fast - 68.009 s
>
> It matters whether the columns are converted to numeric or character,
> and the time doesn't scale linearly with list length. For a particular
> problem, the best solution may vary greatly (and I didn't even include
> packages beyond the base functionality). In general, though, using
> matrices is faster than using data frames, and using do.call is faster
> than using a pre-allocated loop, which is much faster than growing an
> object.
>
> Sarah
>
>> testsize <- 5000
>>
>> set.seed(1234)
>> testdf <- data.frame(matrix(runif(300), nrow=100, ncol=3))
>> testdf.list <- lapply(seq_len(testsize), function(x)testdf)
>>
>> system.time(r.df <- do.call("rbind", testdf.list))
> user system elapsed
> 34.280 0.009 34.317
>>
>> system.time({
> + testm.list <- lapply(testdf.list, as.matrix)
> + r.m <- do.call("rbind", testm.list)
> + })
> user system elapsed
> 0.310 0.000 0.311
>>
>> system.time({
> + l.df <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3))
> + for(i in seq_len(testsize)) {
> + start <- (i-1)*100 + 1
> + end <- i*100
> + l.df[start:end, ] <- testdf.list[[i]]
> + }
> + })
> user system elapsed
> 81.890 0.069 82.162
>>
>> system.time({
> + l.m <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3))
> + testm.list <- lapply(testdf.list, as.matrix)
> + for(i in seq_len(testsize)) {
> + start <- (i-1)*100 + 1
> + end <- i*100
> + l.m[start:end, ] <- testm.list[[i]]
> + }
> + })
> user system elapsed
> 67.664 0.047 68.009
>
>
>
>
> On Mon, Jun 27, 2016 at 1:05 PM, Marc Schwartz <marc_schwartz at
me.com> wrote:
>> Hi,
>>
>> Just to add my tuppence, which might not even be worth that these
days...
>>
>> I found the following blog post from 2013, which is likely dated to
some extent, but provided some benchmarks for a few methods:
>>
>>
http://rcrastinate.blogspot.com/2013/05/the-rbinding-race-for-vs-docall-vs.html
>>
>> There is also a comment with a reference there to using the data.table
package, which I don't use, but may be something to evaluate.
>>
>> As Bert and Sarah hinted at, there is overhead in taking the repetitive
piecemeal approach.
>>
>> If all of your data frames are of the exact same column structure
(column order, column types), it may be prudent to do your own pre-allocation of
a data frame that is the target row total size and then "insert" each
"sub" data frame by using row indexing into the target structure.
>>
>> Regards,
>>
>> Marc Schwartz
>>
>>
>>> On Jun 27, 2016, at 11:54 AM, Witold E Wolski <wewolski at
gmail.com> wrote:
>>>
>>> Hi Bert,
>>>
>>> You are most likely right. I just thought that
do.call("rbind", is
>>> somehow more clever and allocates the memory up front. My error.
After
>>> more searching I did find rbind.fill from plyr which seems to do
the
>>> job (it computes the size of the result data.frame and allocates it
>>> first).
>>>
>>> best
>>>
>>> On 27 June 2016 at 18:49, Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>>>> The following might be nonsense, as I have no understanding of
R
>>>> internals; but ....
>>>>
>>>> "Growing" structures in R by iteratively adding new
pieces is often
>>>> warned to be inefficient when the number of iterations is
large, and
>>>> your rbind() invocation might fall under this rubric. If so,
you might
>>>> try issuing the call say, 20 times, over 10k disjoint subsets
of the
>>>> list, and then rbinding up the 20 large frames.
>>>>
>>>> Again, caveat emptor.
>>>>
>>>> Cheers,
>>>> Bert
>>>>
>>>>
>>>> Bert Gunter
>>>>
>>>> "The trouble with having an open mind is that people keep
coming along
>>>> and sticking things into it."
>>>> -- Opus (aka Berkeley Breathed in his "Bloom County"
comic strip )
>>>>
>>>>
>>>> On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewolski
at gmail.com> wrote:
>>>>> I have a list (variable name data.list) with approx 200k
data.frames
>>>>> with dim(data.frame) approx 100x3.
>>>>>
>>>>> a call
>>>>>
>>>>> data <-do.call("rbind", data.list)
>>>>>
>>>>> does not complete - run time is prohibitive (I killed the
rsession
>>>>> after 5 minutes).
>>>>>
>>>>> I would think that merging data.frame's is a common
operation. Is
>>>>> there a better function (more performant) that I could use?
>>>>>
>>>>> Thank you.
>>>>> Witold
>>>>>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k