thr3ads.net - R help - [R] read in large data file (tsv) with inline filter? [Mar 2009]

If this information is useful, please help other people find it:
Share via:

David Reiss

2009-Mar-23 21:53 UTC

[R] read in large data file (tsv) with inline filter?

I have a very large tab-delimited file, too big to store in memory via
readLines() or read.delim(). Turns out I only need a few hundred of those
lines to be read in. If it were not so large, I could read the entire file
in and "grep" the lines I need. For such a large file; many calls to
read.delim() with incrementing "skip" and "nrows"
parameters, followed by
grep() calls is very slow. I am aware of possibilities via SQLite; I would
prefer to not use that in this case.

My question is...Is there a function for efficiently reading in a file along
the lines of read.delim(), which allows me to specify a filter (via grep or
something else) that tells the function to only read in certain lines that
match?

If not, I would *love* to see a "filter" parameter added as an option
to
read.delim() and/or readLines().

thanks for any pointers.

--David

	[[alternative HTML version deleted]]

Dylan Beaudette

2009-Mar-23 22:09 UTC

head link

[R] read in large data file (tsv) with inline filter?

On Monday 23 March 2009, David Reiss wrote:> I have a very large tab-delimited file, too big to store in memory via
> readLines() or read.delim(). Turns out I only need a few hundred of those
> lines to be read in. If it were not so large, I could read the entire file
> in and "grep" the lines I need. For such a large file; many calls
to
> read.delim() with incrementing "skip" and "nrows"
parameters, followed by
> grep() calls is very slow. I am aware of possibilities via SQLite; I would
> prefer to not use that in this case.
>
> My question is...Is there a function for efficiently reading in a file
> along the lines of read.delim(), which allows me to specify a filter (via
> grep or something else) that tells the function to only read in certain
> lines that match?
>
> If not, I would *love* to see a "filter" parameter added as an
option to
> read.delim() and/or readLines().
>
> thanks for any pointers.
>
> --David

How about pre-filtering before loading the data into R:

grep -E 'your pattern here' your_file_here > your_filtered_file

alternatively if you need to search in fields, see 'awk', and
'cut', or if you
need to delete things see 'tr'.

These tools come with any unix-like OS, and you can probably get them on 
windows without much effort.


Cheers,
Dylan


-- 
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341

Thomas Lumley

2009-Mar-24 08:05 UTC

head link

[R] read in large data file (tsv) with inline filter?

On Mon, 23 Mar 2009, David Reiss wrote:
> I have a very large tab-delimited file, too big to store in memory via
> readLines() or read.delim(). Turns out I only need a few hundred of those
> lines to be read in. If it were not so large, I could read the entire file
> in and "grep" the lines I need. For such a large file; many calls
to
> read.delim() with incrementing "skip" and "nrows"
parameters, followed by
> grep() calls is very slow.
You certainly don't want to use repeated reads from the start of the file
with skip=,  but if you set up a file connection
    fileconnection <- file("my.tsv", open="r")
you can read from it incrementally with readLines() or read.delim() without
going back to the start each time.

The speed of approach should be within a reasonable constant factor of anything
else, since reading the file once is unavoidable and should be the bottleneck.

       -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Gabor Grothendieck

2009-Mar-24 11:45 UTC

head link

[R] read in large data file (tsv) with inline filter?

On Mon, Mar 23, 2009 at 5:53 PM, David Reiss <dreiss at
systemsbiology.org> wrote:> I have a very large tab-delimited file, too big to store in memory via
> readLines() or read.delim(). Turns out I only need a few hundred of those
> lines to be read in. If it were not so large, I could read the entire file
> in and "grep" the lines I need. For such a large file; many calls
to
> read.delim() with incrementing "skip" and "nrows"
parameters, followed by
> grep() calls is very slow. I am aware of possibilities via SQLite; I would
> prefer to not use that in this case.
>
Why the restriction?  Using sqldf with sqlite its only
two R statements.

The first statement defines the name of the file and the
second statement defines what you want to extract from
it and its format.  First create a sample file using built in
data frame BOD and lets specify tab delimiters.  A header,
i.e. col.names = TRUE, is the default for write.table:

# create tab delimited test file with headers
write.table(BOD, file = "myfile.dat", sep = "\t")

# now read in only those records satisfying a condition
library(sqldf)

# here are the two statements
myfile <- file("myfile.dat")
DF <- sqldf("select * from myfile where demand < 10 or demand >
15",
  file.format = list(sep = "\t"))

See examples on the home page http://sqldf.googlecode.com and
see ?sqldf for info on the file.format argument in case your format differs.

Reasonably Related Threads

Search for more reasonably related threads

R help - Mar 2009 - read in large data file (tsv) with inline filter?

[R] read in large data file (tsv) with inline filter?

[R] read in large data file (tsv) with inline filter?

[R] read in large data file (tsv) with inline filter?

[R] read in large data file (tsv) with inline filter?

Reasonably Related Threads