I have a very large tab-delimited file, too big to store in memory via readLines() or read.delim(). Turns out I only need a few hundred of those lines to be read in. If it were not so large, I could read the entire file in and "grep" the lines I need. For such a large file; many calls to read.delim() with incrementing "skip" and "nrows" parameters, followed by grep() calls is very slow. I am aware of possibilities via SQLite; I would prefer to not use that in this case. My question is...Is there a function for efficiently reading in a file along the lines of read.delim(), which allows me to specify a filter (via grep or something else) that tells the function to only read in certain lines that match? If not, I would *love* to see a "filter" parameter added as an option to read.delim() and/or readLines(). thanks for any pointers. --David [[alternative HTML version deleted]]
Dylan Beaudette
2009-Mar-23 22:09 UTC
[R] read in large data file (tsv) with inline filter?
On Monday 23 March 2009, David Reiss wrote:> I have a very large tab-delimited file, too big to store in memory via > readLines() or read.delim(). Turns out I only need a few hundred of those > lines to be read in. If it were not so large, I could read the entire file > in and "grep" the lines I need. For such a large file; many calls to > read.delim() with incrementing "skip" and "nrows" parameters, followed by > grep() calls is very slow. I am aware of possibilities via SQLite; I would > prefer to not use that in this case. > > My question is...Is there a function for efficiently reading in a file > along the lines of read.delim(), which allows me to specify a filter (via > grep or something else) that tells the function to only read in certain > lines that match? > > If not, I would *love* to see a "filter" parameter added as an option to > read.delim() and/or readLines(). > > thanks for any pointers. > > --DavidHow about pre-filtering before loading the data into R: grep -E 'your pattern here' your_file_here > your_filtered_file alternatively if you need to search in fields, see 'awk', and 'cut', or if you need to delete things see 'tr'. These tools come with any unix-like OS, and you can probably get them on windows without much effort. Cheers, Dylan -- Dylan Beaudette Soil Resource Laboratory http://casoilresource.lawr.ucdavis.edu/ University of California at Davis 530.754.7341
On Mon, 23 Mar 2009, David Reiss wrote:> I have a very large tab-delimited file, too big to store in memory via > readLines() or read.delim(). Turns out I only need a few hundred of those > lines to be read in. If it were not so large, I could read the entire file > in and "grep" the lines I need. For such a large file; many calls to > read.delim() with incrementing "skip" and "nrows" parameters, followed by > grep() calls is very slow.You certainly don't want to use repeated reads from the start of the file with skip=, but if you set up a file connection fileconnection <- file("my.tsv", open="r") you can read from it incrementally with readLines() or read.delim() without going back to the start each time. The speed of approach should be within a reasonable constant factor of anything else, since reading the file once is unavoidable and should be the bottleneck. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Gabor Grothendieck
2009-Mar-24 11:45 UTC
[R] read in large data file (tsv) with inline filter?
On Mon, Mar 23, 2009 at 5:53 PM, David Reiss <dreiss at systemsbiology.org> wrote:> I have a very large tab-delimited file, too big to store in memory via > readLines() or read.delim(). Turns out I only need a few hundred of those > lines to be read in. If it were not so large, I could read the entire file > in and "grep" the lines I need. For such a large file; many calls to > read.delim() with incrementing "skip" and "nrows" parameters, followed by > grep() calls is very slow. I am aware of possibilities via SQLite; I would > prefer to not use that in this case. >Why the restriction? Using sqldf with sqlite its only two R statements. The first statement defines the name of the file and the second statement defines what you want to extract from it and its format. First create a sample file using built in data frame BOD and lets specify tab delimiters. A header, i.e. col.names = TRUE, is the default for write.table: # create tab delimited test file with headers write.table(BOD, file = "myfile.dat", sep = "\t") # now read in only those records satisfying a condition library(sqldf) # here are the two statements myfile <- file("myfile.dat") DF <- sqldf("select * from myfile where demand < 10 or demand > 15", file.format = list(sep = "\t")) See examples on the home page http://sqldf.googlecode.com and see ?sqldf for info on the file.format argument in case your format differs.