I'm finding that readLines() and read.fwf() take nearly two hours to work through a 3.5 GB file, even when reading in large (100 MB) chunks. The unix command wc by contrast processes the same file in three minutes. Is there a faster way to read files in R? Thanks!
You could try it with sqldf and see if that is any faster. It use RSQLite/sqlite to read the data into a database without going through R and from there it reads all or a portion as specified into R. It requires two lines of code of the form: f < file("myfile.dat") DF <- sqldf("select * from f", dbname = tempfile()) with appropriate modification to specify the format of your file and possibly to indicate a portion only. See example 6 on the sqldf home page: http://sqldf.googlecode.com and ?sqldf On Sat, May 9, 2009 at 12:25 PM, Rob Steele <freenx.10.robsteele at xoxy.net> wrote:> I'm finding that readLines() and read.fwf() take nearly two hours to > work through a 3.5 GB file, even when reading in large (100 MB) chunks. > ?The unix command wc by contrast processes the same file in three > minutes. ?Is there a faster way to read files in R? > > Thanks! > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
First 'wc' and readLines are doing vastly different functions. 'wc' is just reading through the file without having to allocate memory to it; 'readLines' is actually storing the data in memory. I have a 150MB file I was trying it on, and here is what 'wc' did on my Windows system: /cygdrive/c: time wc tempxx.txt 1055808 13718468 151012320 tempxx.txt real 0m2.343s user 0m1.702s sys 0m0.436s /cygdrive/c: If I multiply that by 25 to extrapolate to a 3.5GB file, it should take about a little less than one minute to process on my relatively slow laptop. 'readLines' on the same file takes:> system.time(x <- readLines('/tempxx.txt'))user system elapsed 37.82 0.47 39.23 If I extrapolate that to 3.5GB, it would take about 16 minutes. Now considering that I only have 2GB on my system, I would not be able to read the whole file in at once. You never did specify what type of system you were running on and how much memory you had. Were you 'paging' due to lack of memory?> system.time(x <- readLines('/tempxx.txt'))user system elapsed 37.82 0.47 39.23> object.size(x)84814016 bytes On Sat, May 9, 2009 at 12:25 PM, Rob Steele <freenx.10.robsteele@xoxy.net>wrote:> I'm finding that readLines() and read.fwf() take nearly two hours to > work through a 3.5 GB file, even when reading in large (100 MB) chunks. > The unix command wc by contrast processes the same file in three > minutes. Is there a faster way to read files in R? > > Thanks! > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]]
Rob Steele wrote:> I'm finding that readLines() and read.fwf() take nearly two hours to > work through a 3.5 GB file, even when reading in large (100 MB) chunks. > The unix command wc by contrast processes the same file in three > minutes. Is there a faster way to read files in R?I use statist to convert the fixed width data file into a csv file because read.table() is considerably faster than read.fwf(). For example: system("statist --na-string NA --xcols collist big.txt big.csv") bigdf <- read.table(file = "big.csv", header=T, as.is=T) The file collist is a text file whose lines contain the following information: variable begin end where "variable" is the column name, and "begin" and "end" are integer numbers indicating where in big.txt the columns begin and end. Statist can be downloaded from: http://statist.wald.intevation.org/ -- Jakson Aquino Social Sciences Department Federal University of Cear?, Brazil
Thanks guys, good suggestions. To clarify, I'm running on a fast multi-core server with 16 GB RAM under 64 bit CentOS 5 and R 2.8.1. Paging shouldn't be an issue since I'm reading in chunks and not trying to store the whole file in memory at once. Thanks again. Rob Steele wrote:> I'm finding that readLines() and read.fwf() take nearly two hours to > work through a 3.5 GB file, even when reading in large (100 MB) chunks. > The unix command wc by contrast processes the same file in three > minutes. Is there a faster way to read files in R? > > Thanks! >
Rob Steele wrote:> I'm finding that readLines() and read.fwf() take nearly two hours to > work through a 3.5 GB file, even when reading in large (100 MB) chunks. > The unix command wc by contrast processes the same file in three > minutes. Is there a faster way to read files in R? > > Thanks! >readChar() is fast. I use strsplit(..., fixed = TRUE) to separate the input data into lines and then use substr() to separate the lines into fields. I do a little light processing and write the result back out with writeChar(). The whole thing takes thirty minutes where read.fwf() took nearly two hours just to read the data. Thanks for the help!