Hi, I'm sure that a large fixed width file, such as 300 million rows and 1,000 columns, is too large for R to handle on a PC, but are there ways to deal with it? For example, is there a way to combine some sampling method with read.fwf so that you can read in a sample of 100,000 records, for example? Something like this may make analysis possible. Once analyzed, is there a way to, say, read in only x rows at a time, save and score each subset separately, and finally append them back together? I haven't seen any information on this, if it is possible. Thank you for reading, and sorry if the information was easily available and I simply didn't find it. -- View this message in context: http://www.nabble.com/Dealing-With-Extremely-Large-Files-tp19695311p19695311.html Sent from the R help mailing list archive at Nabble.com.
Try RSiteSearch("biglm") for some threads that discuss strategy for analyzing big datasets. HTH, Chuck On Fri, 26 Sep 2008, zerfetzen wrote:> > Hi, > I'm sure that a large fixed width file, such as 300 million rows and 1,000 > columns, is too large for R to handle on a PC, but are there ways to deal > with it? > > For example, is there a way to combine some sampling method with read.fwf so > that you can read in a sample of 100,000 records, for example? > > Something like this may make analysis possible. > > Once analyzed, is there a way to, say, read in only x rows at a time, save > and score each subset separately, and finally append them back together? > > I haven't seen any information on this, if it is possible. Thank you for > reading, and sorry if the information was easily available and I simply > didn't find it. > -- > View this message in context: http://www.nabble.com/Dealing-With-Extremely-Large-Files-tp19695311p19695311.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
You can always setup a "connection" and then read in the number of lines you need for the analysis, write out the results and then read in the next ones. I have also used 'filehash' to initially read in portions of a file and then write the objects into the database. These are quickly retrieved if I want to make subsequent passes through the data. A 100,000 rows will also probably tax your machine since if these are numeric, you will need 800MB to store a since copy of the object and you will probably need 3-4X that amount (a total of 4GB of physical memory) if you are doing any processing that might make copies. Hopefully you are running on a 64-bit system with lots of memory. On Fri, Sep 26, 2008 at 3:55 PM, zerfetzen <zerfetzen at yahoo.com> wrote:> > Hi, > I'm sure that a large fixed width file, such as 300 million rows and 1,000 > columns, is too large for R to handle on a PC, but are there ways to deal > with it? > > For example, is there a way to combine some sampling method with read.fwf so > that you can read in a sample of 100,000 records, for example? > > Something like this may make analysis possible. > > Once analyzed, is there a way to, say, read in only x rows at a time, save > and score each subset separately, and finally append them back together? > > I haven't seen any information on this, if it is possible. Thank you for > reading, and sorry if the information was easily available and I simply > didn't find it. > -- > View this message in context: http://www.nabble.com/Dealing-With-Extremely-Large-Files-tp19695311p19695311.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Not sure if it applies to your file or not but if it does then the sqldf package facilitates reading a large file into an SQLite database. Its a front end to RSQLite which is a front end to SQLite and it reads the data straight into the database without going through R so R does not limit it in any way -- its only actuated from R. The code to do this is basically just two lines of code. You don;t have to install database software (its included with RSQLite package) and you don't have to set up a database at all -- it does that for you automatically. See example 6e on the home page which creates a database transparently, reads in the data and extracts random rows from the database into R: http://sqldf.googlecode.com On Fri, Sep 26, 2008 at 3:55 PM, zerfetzen <zerfetzen at yahoo.com> wrote:> > Hi, > I'm sure that a large fixed width file, such as 300 million rows and 1,000 > columns, is too large for R to handle on a PC, but are there ways to deal > with it? > > For example, is there a way to combine some sampling method with read.fwf so > that you can read in a sample of 100,000 records, for example? > > Something like this may make analysis possible. > > Once analyzed, is there a way to, say, read in only x rows at a time, save > and score each subset separately, and finally append them back together? > > I haven't seen any information on this, if it is possible. Thank you for > reading, and sorry if the information was easily available and I simply > didn't find it. > -- > View this message in context: http://www.nabble.com/Dealing-With-Extremely-Large-Files-tp19695311p19695311.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >