Hi all, I am having problems importing a VERY large dataset in R. I have looked into the package ff, and that seems to suit me, but also, from all the examples I have seen, it either requires a manual creation of the database, or it needs a read.table kind of step. Being a survey kind of data the file is big (like 20,000 times 50,000 for a total of about 1.2Gb in plain text) the memory I have isn't enough to do a read.table and my computer freezes every time :( This far I have managed to import the required subset of the data by using a "cheat": I used GRETL to read an equivalent Stata file (released by the same source that offered the csv file), manipulate it and export it in a format that R can read into memory. Easy! But I am wondering, how is it possible to do this in R entirely from scratch? Thanks -- View this message in context: http://www.nabble.com/How-to-import-BIG-csv-files-with-separate-%22map%22--tp24484588p24484588.html Sent from the R help mailing list archive at Nabble.com.
Gabor Grothendieck
2009-Jul-14 19:48 UTC
[R] How to import BIG csv files with separate "map"?
Either of the following can be done in one line of code: Using the nrows and skip arguments to read.table one can read in a subset of rows. Using the colClasses argument of read.table the class "NULL" will suppress reading in the corresponding column. read.csv.sql from the sqldf package will create a database on the fly, read in the data, extract it to R according to whatever SQL statement you give to its sql argument and then destroy the database so that you have all the flexiblity of SQL in selecting a portion of data. See http://sqldf.googlecode.com and the example here: http://code.google.com/p/sqldf/#Example_13._read.csv.sql On Tue, Jul 14, 2009 at 1:53 PM, giusto<giusto at uoregon.edu> wrote:> > Hi all, > > I am having problems importing a VERY large dataset in R. I have looked into > the package ff, and that seems to suit me, but also, from all the examples I > have seen, it either requires a manual creation of the database, or it needs > a read.table kind of step. Being a survey kind of data the file is big (like > 20,000 times 50,000 for a total of about 1.2Gb in plain text) the memory I > have isn't enough to do a read.table and my computer freezes every time :( > > This far I have managed to import the required subset of the data by using a > "cheat": I used GRETL to read an equivalent Stata file (released by the same > source that offered the csv file), manipulate it and export it in a format > that R can read into memory. Easy! But I am wondering, how is it possible to > do this in R entirely from scratch? > > Thanks > -- > View this message in context: http://www.nabble.com/How-to-import-BIG-csv-files-with-separate-%22map%22--tp24484588p24484588.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Steve Lianoglou
2009-Jul-14 19:50 UTC
[R] How to import BIG csv files with separate "map"?
Hi, On Jul 14, 2009, at 1:53 PM, giusto wrote:> > Hi all, > > I am having problems importing a VERY large dataset in R. I have > looked into > the package ff, and that seems to suit me, but also, from all the > examples I > have seen, it either requires a manual creation of the database, or > it needs > a read.table kind of step. Being a survey kind of data the file is > big (like > 20,000 times 50,000 for a total of about 1.2Gb in plain text) the > memory I > have isn't enough to do a read.table and my computer freezes every > time :(Look at the documentation near the end of ?read.table: """Note that unless colClasses is specified, all columns are read as character columns and then converted. This means that quotes are interpreted in all fields and that a column of values like "42" will result in an integer column.""" So all the data is read in as characters, then R tries to convert it to the appropriate data type (uses mucho memory). Perhaps specifying the types of each column in the colClasses param can get you where you need to be.> This far I have managed to import the required subset of the data by > using a > "cheat": I used GRETL to read an equivalent Stata file (released by > the same > source that offered the csv file), manipulate it and export it in a > format > that R can read into memory.I'm not sure if you're suggesting that R can read in the whole data file when stored in some SPSS binary format. If so, perhaps the colClass trick above might work. If the read.table w/ colClasses doesn't work (and you know you can load the entire dataset into R via some binary format), perhaps you'll have to parse the file line by line by opening it with a "file(.., 'r')" command, and using "scan" (or readChar?) to run through the file w/o having to load it all into memory at once. HTH, -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact