Hi
I have 6 rather big data sets (between 400000 and 800000 lines) on transport
data (times, distances and travelers between nodes). They all have a common
index (start-end nodes).
I want to aggregate this data, but for that I have to merge them.
I tried to use "merge" with the result that R (3.0.1) crashes (Windows
8 machine, 16 Gb Ram).
Then I tried the join from the data.table package. Here I got the message that
2^34 is too big (no idea why it is 2^34 as it is a left join).
Then I decided to do a loop using the tables and assigning them, which takes a
very, very long time (still running at the moment).
Here is the code:
for (i in 1:length(dataP$Start)){
c<-dataP$Start[i]
d<-dataP$End[i]
dataP[J(c,d)]$OEV.T<-ttoevP[J(c,d)]$OEV.T
}
dataP has 800'000 lines and ttoevP has about 500'000 lines.
Any hints to speed up this process are welcome.
Renger
_________________________________________
Centre of Economic Research (CER-ETH)
Zürichbergstrasse 18 (ZUE)
CH - 8032 Zürich
+41 44 632 02 63
mailto: rengerv@etzh.ch<mailto:rengerv@etzh.ch>
blog.modelworks.ch
[[alternative HTML version deleted]]
Please don't post in HTML. (Read the Posting Guide.)
Consider using the sqldf package.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
Renger van Nieuwkoop <renger at vannieuwkoop.ch>
wrote:>Hi
>I have 6 rather big data sets (between 400000 and 800000 lines) on
>transport data (times, distances and travelers between nodes). They all
>have a common index (start-end nodes).
>I want to aggregate this data, but for that I have to merge them.
>I tried to use "merge" with the result that R (3.0.1) crashes
(Windows
>8 machine, 16 Gb Ram).
>Then I tried the join from the data.table package. Here I got the
>message that 2^34 is too big (no idea why it is 2^34 as it is a left
>join).
>Then I decided to do a loop using the tables and assigning them, which
>takes a very, very long time (still running at the moment).
>
>Here is the code:
>for (i in 1:length(dataP$Start)){
> c<-dataP$Start[i]
> d<-dataP$End[i]
> dataP[J(c,d)]$OEV.T<-ttoevP[J(c,d)]$OEV.T
>}
>
>dataP has 800'000 lines and ttoevP has about 500'000 lines.
>
>Any hints to speed up this process are welcome.
>
>Renger
>_________________________________________
>Centre of Economic Research (CER-ETH)
>Z?richbergstrasse 18 (ZUE)
>CH - 8032 Z?rich
>+41 44 632 02 63
>mailto: rengerv at etzh.ch<mailto:rengerv at etzh.ch>
>blog.modelworks.ch
>
>
> [[alternative HTML version deleted]]
>
>
>
>------------------------------------------------------------------------
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
Your sample code does not run and you didn't show the format of
your inputs. If your 'ttoevP' is a table without duplicate Start/End
pairs that maps Start/End to OEV.T then using match and subscripting
can be quicker than merge. E.g.,
f0 <- function (dataP, ttoevP)
{
encode <- function(df) paste(df$Start, df$End, sep = "\r")
if (anyDuplicated(ttoevPEncoded <- encode(ttoevP))) {
stop("duplicated Start/End pairs in ttoevP")
}
i <- match(encode(dataP), ttoevPEncoded)
dataP$OEV.T <- ttoevP$OEV.T[i]
dataP
}
I made sample inputs with
makeData <- function (nrow, nTimes)
{
Start <- trunc(runif(nrow, 1, nTimes))
End <- trunc(runif(nrow, Start, nTimes))
dataP <- data.frame(Start = Start, End = End)
ttoevP <- expand.grid(Start = seq_len(nTimes), End = seq_len(nTimes))
ttoevP <- ttoevP[ttoevP$Start <= ttoevP$End, ]
ttoevP$OEV.T <- paste(ttoevP$Start, ttoevP$End, sep = "-")
list(dataP = dataP, ttoevP = ttoevP)
}
For nrows=4*10^4 and nTimes=10^3 f0 took 1.5 seconds and merge 10.5.
Aside from the order of the output, they produced the same output.
You can make your looping solution faster by removing repeated
operations from the loop, especially when those operations operate
on a data.frame (vectors operations are much faster, and no operation
is faster still). E.g., replace
for(i in seq_len(nrow(df))) {
df$column[i] <- func(i)
}
with
column <- df$column
for(i in seq_len(nrow(df))) {
column[i] <- func(i)
}
df$column <- column
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org] On Behalf
> Of Renger van Nieuwkoop
> Sent: Monday, September 09, 2013 5:45 AM
> To: r-help at r-project.org
> Subject: [R] Merging big data sets
>
> Hi
> I have 6 rather big data sets (between 400000 and 800000 lines) on
transport data (times,
> distances and travelers between nodes). They all have a common index
(start-end nodes).
> I want to aggregate this data, but for that I have to merge them.
> I tried to use "merge" with the result that R (3.0.1) crashes
(Windows 8 machine, 16 Gb
> Ram).
> Then I tried the join from the data.table package. Here I got the message
that 2^34 is too
> big (no idea why it is 2^34 as it is a left join).
> Then I decided to do a loop using the tables and assigning them, which
takes a very, very
> long time (still running at the moment).
>
> Here is the code:
> for (i in 1:length(dataP$Start)){
> c<-dataP$Start[i]
> d<-dataP$End[i]
> dataP[J(c,d)]$OEV.T<-ttoevP[J(c,d)]$OEV.T
> }
>
> dataP has 800'000 lines and ttoevP has about 500'000 lines.
>
> Any hints to speed up this process are welcome.
>
> Renger
> _________________________________________
> Centre of Economic Research (CER-ETH)
> Z?richbergstrasse 18 (ZUE)
> CH - 8032 Z?rich
> +41 44 632 02 63
> mailto: rengerv at etzh.ch<mailto:rengerv at etzh.ch>
> blog.modelworks.ch
>
>
> [[alternative HTML version deleted]]