Chris Howden
2010-Oct-12 06:39 UTC
[R] merging and working with BIG data sets. Is sqldf the best way??
Hi everyone, I’m working with some very big datasets (each dataset has 11 million rows and 2 columns). My first step is to merge all my individual data sets together (I have about 20) I’m using the following command from sqldf data1 <- sqldf("select A.*, B.* from A inner join B using(ID)") But it’s taking A VERY VERY LONG TIME to merge just 2 of the datasets (well over 2 hours, possibly longer since it’s still going). I was wondering if anyone could suggest a better way, or maybe some suggestions on how I could tweak my computer set up to speed it up? I’ve looked at the following packages and this is the only way I’ve found to actually merge large data sets in R. These packages seem great for accessing large data sets by avoiding storing them in RAM….but I can’t see how they can be used to merge data sets together: · ff · filehash · bigmemory Does anyone have any ideas? At the moment my best idea is to hand it over to someone with a dedicated database server and get them to do the merges (and then hope package biglm can do the modelling) Thanks for any ideas at all!! Chris Howden Founding Partner Tricky Solutions Tricky Solutions 4 Tricky Problems Evidence Based Strategic Development, IP development, Data Analysis, Modelling, and Training (mobile) 0410 689 945 (fax / office) (+618) 8952 7878 chris@trickysolutions.com.au [[alternative HTML version deleted]]
Jeff Newmiller
2010-Oct-12 06:55 UTC
[R] merging and working with BIG data sets. Is sqldf the best way??
Chris Howden wrote:> Hi everyone, > > I?m working with some very big datasets (each dataset has 11 million rows > and 2 columns). My first step is to merge all my individual data sets > together (I have about 20) > > I?m using the following command from sqldf > data1 <- sqldf("select A.*, B.* from A inner join B > using(ID)") > > But it?s taking A VERY VERY LONG TIME to merge just 2 of the datasets (well > over 2 hours, possibly longer since it?s still going). > > I was wondering if anyone could suggest a better way, or maybe some > suggestions on how I could tweak my computer set up to speed it up? > > I?ve looked at the following packages and this is the only way I?ve found to > actually merge large data sets in R. These packages seem great for accessing > large data sets by avoiding storing them in RAM?.but I can?t see how they > can be used to merge data sets together: > > ? ff > ? filehash > ? bigmemory > > Does anyone have any ideas? > > At the moment my best idea is to hand it over to someone with a dedicated > database server and get them to do the merges (and then hope package biglm > can do the modelling) >SQLite is not optimized for large datasets. If you don't have familiarity with index generation in SQL databases, then you may well be best served by finding someone who has such experience. If the data are already sorted by ID, it may be feasible to do the merge with a Perl or Python script, processing records from two files at a time incrementally. Good luck!> > Thanks for any ideas at all!! > > > Chris Howden > Founding Partner > Tricky Solutions > Tricky Solutions 4 Tricky Problems > Evidence Based Strategic Development, IP development, Data Analysis, > Modelling, and Training > (mobile) 0410 689 945 > (fax / office) (+618) 8952 7878 > chris at trickysolutions.com.au > >
Gabor Grothendieck
2010-Oct-13 23:31 UTC
[R] merging and working with BIG data sets. Is sqldf the best way??
On Tue, Oct 12, 2010 at 2:39 AM, Chris Howden <chris at trickysolutions.com.au> wrote:> I?m working with some very big datasets (each dataset has 11 million rows > and 2 columns). My first step is to merge all my individual data sets > together (I have about 20) > > I?m using the following command from sqldf > > ? ? ? ? ? ? ? data1 <- sqldf("select A.*, B.* from A inner join B > using(ID)") > > But it?s taking A VERY VERY LONG TIME to merge just 2 of the datasets (well > over 2 hours, possibly longer since it?s still going).You need to add indexes to your tables. See example 4i on the sqldf home page http://sqldf.googlecode.com This can result in huge speedups for large tables. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Apparently Analagous Threads
- can't find and install reshape2??
- is there a way to update both packages if they occur in 2 libraries?
- how to replace NA with a specific score that is dependant on another indicator variable
- anova or liklihood ratio test from biglm output
- sqldf: issues with natural joins