thr3ads.net - R help - [R] Scanning grep through huge files [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Johannes Graumann

2009-Nov-03 14:29 UTC

[R] Scanning grep through huge files

Hi,

I'm dealing which huge files I would like to index. On a linux system
"grep
-buo <PATTERN> <FILENAME>" hands me the byte offsets for
"PATTERN" very
quickly and I am looking to emulate that speed and ease with native R tools 
- for portability and elegance. "gregexpr" should be able to do that
but I
fail to combine it with "scan" or an equivalent to parse the whole
file
without having to read it all into memory.

I'd be grateful for any hints on how to do this without a
"pipe("grep -buo
<PATTERN> <FILENAME>")".

Thanks, Joh

Duncan Murdoch

2009-Nov-03 14:51 UTC

head link

[R] Scanning grep through huge files

On 11/3/2009 9:29 AM, Johannes Graumann wrote:> Hi,
> 
> I'm dealing which huge files I would like to index. On a linux system
"grep
> -buo <PATTERN> <FILENAME>" hands me the byte offsets for
"PATTERN" very
> quickly and I am looking to emulate that speed and ease with native R tools
> - for portability and elegance. "gregexpr" should be able to do
that but I
> fail to combine it with "scan" or an equivalent to parse the
whole file
> without having to read it all into memory.
I think you are going to have to write this yourself.  R doesn't have 
very many stream oriented functions:  almost everything is aimed at 
having the whole thing in memory.

You will also have trouble with the byte offsets.  The semantics of the 
-u option to grep are quite strange (at least according to the man page 
on Cygwin).

What I'd do given your problem is use readLines to read the file, then 
post-process the result of gregexpr to give line and byte offset pairs 
for each match; those are more useful in R than the rather bizarre "byte 
offsets" that grep -buo will give.  But for a huge file you'll probably
have to do this in blocks, as the whole file may be too big.

Duncan Murdoch

> 
> I'd be grateful for any hints on how to do this without a
"pipe("grep -buo
> <PATTERN> <FILENAME>")".
> 
> Thanks, Joh
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Nov 2009 - Scanning grep through huge files

[R] Scanning grep through huge files

[R] Scanning grep through huge files

Seemingly Similar Threads