Hi I have 6 rather big data sets (between 400000 and 800000 lines) on transport data (times, distances and travelers between nodes). They all have a common index (start-end nodes). I want to aggregate this data, but for that I have to merge them. I tried to use "merge" with the result that R (3.0.1) crashes (Windows 8 machine, 16 Gb Ram). Then I tried the join from the data.table package. Here I got the message that 2^34 is too big (no idea why it is 2^34 as it is a left join). Then I decided to do a loop using the tables and assigning them, which takes a very, very long time (still running at the moment). Here is the code: for (i in 1:length(dataP$Start)){ c<-dataP$Start[i] d<-dataP$End[i] dataP[J(c,d)]$OEV.T<-ttoevP[J(c,d)]$OEV.T } dataP has 800'000 lines and ttoevP has about 500'000 lines. Any hints to speed up this process are welcome. Renger _________________________________________ Centre of Economic Research (CER-ETH) Zürichbergstrasse 18 (ZUE) CH - 8032 Zürich +41 44 632 02 63 mailto: rengerv@etzh.ch<mailto:rengerv@etzh.ch> blog.modelworks.ch [[alternative HTML version deleted]]
Please don't post in HTML. (Read the Posting Guide.) Consider using the sqldf package. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. Renger van Nieuwkoop <renger at vannieuwkoop.ch> wrote:>Hi >I have 6 rather big data sets (between 400000 and 800000 lines) on >transport data (times, distances and travelers between nodes). They all >have a common index (start-end nodes). >I want to aggregate this data, but for that I have to merge them. >I tried to use "merge" with the result that R (3.0.1) crashes (Windows >8 machine, 16 Gb Ram). >Then I tried the join from the data.table package. Here I got the >message that 2^34 is too big (no idea why it is 2^34 as it is a left >join). >Then I decided to do a loop using the tables and assigning them, which >takes a very, very long time (still running at the moment). > >Here is the code: >for (i in 1:length(dataP$Start)){ > c<-dataP$Start[i] > d<-dataP$End[i] > dataP[J(c,d)]$OEV.T<-ttoevP[J(c,d)]$OEV.T >} > >dataP has 800'000 lines and ttoevP has about 500'000 lines. > >Any hints to speed up this process are welcome. > >Renger >_________________________________________ >Centre of Economic Research (CER-ETH) >Z?richbergstrasse 18 (ZUE) >CH - 8032 Z?rich >+41 44 632 02 63 >mailto: rengerv at etzh.ch<mailto:rengerv at etzh.ch> >blog.modelworks.ch > > > [[alternative HTML version deleted]] > > > >------------------------------------------------------------------------ > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Your sample code does not run and you didn't show the format of your inputs. If your 'ttoevP' is a table without duplicate Start/End pairs that maps Start/End to OEV.T then using match and subscripting can be quicker than merge. E.g., f0 <- function (dataP, ttoevP) { encode <- function(df) paste(df$Start, df$End, sep = "\r") if (anyDuplicated(ttoevPEncoded <- encode(ttoevP))) { stop("duplicated Start/End pairs in ttoevP") } i <- match(encode(dataP), ttoevPEncoded) dataP$OEV.T <- ttoevP$OEV.T[i] dataP } I made sample inputs with makeData <- function (nrow, nTimes) { Start <- trunc(runif(nrow, 1, nTimes)) End <- trunc(runif(nrow, Start, nTimes)) dataP <- data.frame(Start = Start, End = End) ttoevP <- expand.grid(Start = seq_len(nTimes), End = seq_len(nTimes)) ttoevP <- ttoevP[ttoevP$Start <= ttoevP$End, ] ttoevP$OEV.T <- paste(ttoevP$Start, ttoevP$End, sep = "-") list(dataP = dataP, ttoevP = ttoevP) } For nrows=4*10^4 and nTimes=10^3 f0 took 1.5 seconds and merge 10.5. Aside from the order of the output, they produced the same output. You can make your looping solution faster by removing repeated operations from the loop, especially when those operations operate on a data.frame (vectors operations are much faster, and no operation is faster still). E.g., replace for(i in seq_len(nrow(df))) { df$column[i] <- func(i) } with column <- df$column for(i in seq_len(nrow(df))) { column[i] <- func(i) } df$column <- column Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf > Of Renger van Nieuwkoop > Sent: Monday, September 09, 2013 5:45 AM > To: r-help at r-project.org > Subject: [R] Merging big data sets > > Hi > I have 6 rather big data sets (between 400000 and 800000 lines) on transport data (times, > distances and travelers between nodes). They all have a common index (start-end nodes). > I want to aggregate this data, but for that I have to merge them. > I tried to use "merge" with the result that R (3.0.1) crashes (Windows 8 machine, 16 Gb > Ram). > Then I tried the join from the data.table package. Here I got the message that 2^34 is too > big (no idea why it is 2^34 as it is a left join). > Then I decided to do a loop using the tables and assigning them, which takes a very, very > long time (still running at the moment). > > Here is the code: > for (i in 1:length(dataP$Start)){ > c<-dataP$Start[i] > d<-dataP$End[i] > dataP[J(c,d)]$OEV.T<-ttoevP[J(c,d)]$OEV.T > } > > dataP has 800'000 lines and ttoevP has about 500'000 lines. > > Any hints to speed up this process are welcome. > > Renger > _________________________________________ > Centre of Economic Research (CER-ETH) > Z?richbergstrasse 18 (ZUE) > CH - 8032 Z?rich > +41 44 632 02 63 > mailto: rengerv at etzh.ch<mailto:rengerv at etzh.ch> > blog.modelworks.ch > > > [[alternative HTML version deleted]]