thr3ads.net - R help - [R] Reading large, non-tabular files [Sep 2011]

If this information is useful, please help other people find it:
Share via:

Stefan McKinnon Høj-Edwards

2011-Sep-14 11:08 UTC

[R] Reading large, non-tabular files

Dear R-help,

I have a very large ascii data file, of which I only want to read in selected
lines (e.g. on fourth of the lines); determining which lines depends on the
lines content. So far, I have found two approaches for doing this in R; 1) Read
the file line by line using a repeat-loop and save the result in a temporary
file or a variable, and 2) Read the entire file and filter/reshape it using
*apply methods.
To my understanding, the use of repeat{}-loops are quite slow in R, and reading
an entire file to discard 3 quarters of the data is a bit of an overkill. Not to
mention loading an 650MB text file into memory.

What I am looking for is a function, that works like the first approach, but
avoiding do- or repeat-loops, so I imagine it is implemented in a lower-level
language, to be more efficient. Naturally, when calling the function, one would
provide a function that determines if/how the line should be appended to a
variable.
Alternatively, an object working as an generator (in Python terms), could be
used with the normal *apply functions. I imagine this working differently from
e.g. sapply(readLines("myfile.txt"), FUN=selector), in that
"readLines" would be executed first, loading the entire file into
memory and supplying it to sapply, whereas the generator-object only reads a
line when sapply requests the next element.

Are there options for this kind of operation?

Kind regards,

Stefan McKinnon H?j-Edwards     Dept. of Genetics and Biotechnology
PhD student                     Faculty of Agricultural Sciences
stefan.hoj-edwards at agrsci.dk    Aarhus University
Tel.: +45 8999 1291             Blichers All? 20, Postboks 50
Web: www.iysik.com              DK-8830?Tjele
                                Tel.: +45 8999 1900
                                Web: www.agrsci.au.dk

jim holtman

2011-Sep-14 13:39 UTC

head link

[R] Reading large, non-tabular files

What is overkill about reading in a 650MB text file if you have the
space?  You are going to have to process one way or another.  I would
use 'readLines' to read it in, and then 'grepl' to determine
which
lines I want to keep and then delete the rest, and then write the new
file out.  At this point I can probably use 'read.table' to now
process the new file.  This works pretty fast if you can apply pattern
matching to determine which lines you want to keep.

If you don't have the memory to read in the whole file, then setup a
look and read in whatever amount makes sense (e.g., 100MB at a time),
and then do the processing above with the output file opened at the
beginning so that you continue to add to it.

You probably need to state what type of criteria you would be applying
to the lines to determine if you want to keep them.

You can also use perl, sed awk, .... to do the processing

2011/9/14 Stefan McKinnon H?j-Edwards <Stefan.Hoj-Edwards at
agrsci.dk>:> Dear R-help,
>
> I have a very large ascii data file, of which I only want to read in
selected lines (e.g. on fourth of the lines); determining which lines depends on
the lines content. So far, I have found two approaches for doing this in R; 1)
Read the file line by line using a repeat-loop and save the result in a
temporary file or a variable, and 2) Read the entire file and filter/reshape it
using *apply methods.
> To my understanding, the use of repeat{}-loops are quite slow in R, and
reading an entire file to discard 3 quarters of the data is a bit of an
overkill. Not to mention loading an 650MB text file into memory.
>
> What I am looking for is a function, that works like the first approach,
but avoiding do- or repeat-loops, so I imagine it is implemented in a
lower-level language, to be more efficient. Naturally, when calling the
function, one would provide a function that determines if/how the line should be
appended to a variable.
> Alternatively, an object working as an generator (in Python terms), could
be used with the normal *apply functions. I imagine this working differently
from e.g. sapply(readLines("myfile.txt"), FUN=selector), in that
"readLines" would be executed first, loading the entire file into
memory and supplying it to sapply, whereas the generator-object only reads a
line when sapply requests the next element.
>
> Are there options for this kind of operation?
>
> Kind regards,
>
> Stefan McKinnon H?j-Edwards ? ? Dept. of Genetics and Biotechnology
> PhD student ? ? ? ? ? ? ? ? ? ? Faculty of Agricultural Sciences
> stefan.hoj-edwards at agrsci.dk ? ?Aarhus University
> Tel.: +45 8999 1291 ? ? ? ? ? ? Blichers All? 20, Postboks 50
> Web: www.iysik.com ? ? ? ? ? ? ?DK-8830?Tjele
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Tel.: +45 8999 1900
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Web: www.agrsci.au.dk
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

Gabor Grothendieck

2011-Sep-14 13:57 UTC

head link

[R] Reading large, non-tabular files

2011/9/14 Stefan McKinnon H?j-Edwards <Stefan.Hoj-Edwards at
agrsci.dk>:> Dear R-help,
>
> I have a very large ascii data file, of which I only want to read in
selected lines (e.g. on fourth of the lines); determining which lines depends on
the lines content. So far, I have found two approaches for doing this in R; 1)
Read the file line by line using a repeat-loop and save the result in a
temporary file or a variable, and 2) Read the entire file and filter/reshape it
using *apply methods.
> To my understanding, the use of repeat{}-loops are quite slow in R, and
reading an entire file to discard 3 quarters of the data is a bit of an
overkill. Not to mention loading an 650MB text file into memory.
>
> What I am looking for is a function, that works like the first approach,
but avoiding do- or repeat-loops, so I imagine it is implemented in a
lower-level language, to be more efficient. Naturally, when calling the
function, one would provide a function that determines if/how the line should be
appended to a variable.
> Alternatively, an object working as an generator (in Python terms), could
be used with the normal *apply functions. I imagine this working differently
from e.g. sapply(readLines("myfile.txt"), FUN=selector), in that
"readLines" would be executed first, loading the entire file into
memory and supplying it to sapply, whereas the generator-object only reads a
line when sapply requests the next element.
>
> Are there options for this kind of operation?
>

read.csv.sql in the sqldf package can read a file and deliver just a
subset to R.  The portion desired is specified using sql and the
entire operation can be done in a single line of code.  It can handle
files too large to read into R since only the portion desired is ever
read into R itself.  See Example 13 on the sqldf home page:

http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql

and also read ?read.csv.sql .


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

David Winsemius

2011-Sep-14 14:00 UTC

head link

[R] Reading large, non-tabular files

On Sep 14, 2011, at 7:08 AM, Stefan McKinnon H?j-Edwards wrote:
> Dear R-help,
>
> I have a very large ascii data file, of which I only want to read in  
> selected lines (e.g. on fourth of the lines); determining which  
> lines depends on the lines content. So far, I have found two  
> approaches for doing this in R; 1) Read the file line by line using  
> a repeat-loop and save the result in a temporary file or a variable,  
> and 2) Read the entire file and filter/reshape it using *apply  
> methods.
Better to use vectorized methods. The `apply functions are really no  
faster than loops.
> To my understanding, the use of repeat{}-loops are quite slow in R,  
> and reading an entire file to discard 3 quarters of the data is a  
> bit of an overkill. Not to mention loading an 650MB text file into  
> memory.
>
Peoples' perception of "large" may vary, and to me that is a
medium
size file. It seems quite likely to fit in most modern computers at  
least for the purpose of eliminating  the undesired rows and then  
having a reduced dataset to write to a working file.
> What I am looking for is a function, that works like the first  
> approach, but avoiding do- or repeat-loops, so I imagine it is  
> implemented in a lower-level language, to be more efficient.  
> Naturally, when calling the function, one would provide a function  
> that determines if/how the line should be appended to a variable.
> Alternatively, an object working as an generator (in Python terms),  
> could be used with the normal *apply functions. I imagine this  
> working differently from e.g. sapply(readLines("myfile.txt"),  
> FUN=selector), in that "readLines" would be executed first,
loading
> the entire file into memory and supplying it to sapply, whereas the  
> generator-object only reads a line when sapply requests the next  
> element.
There are database interfaces to R. You have told us nothing about  
your OS or hardware so it's a bit difficult to match recommendations  
to your specific situation.
>
> Are there options for this kind of operation?

Many, .... once details are provided. This message arrived with useful  
guidance:
----------
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
------------

David.

Rainer Schuermann

2011-Sep-14 14:06 UTC

head link

[R] Reading large, non-tabular files

That looks like a perfect job for (g)awk which is in every Linux distribution 
but also available for Windows.
It can be called with something like

system( "awk -f script.awk inputfile.txt" )

and does its job silently and very fast. 650MB should not be an issue. I'm
not
proficient in awk but would offer my help anyway (off-list...).

Rgds,
Rainer


On Wednesday 14 September 2011 13:08:14 Stefan McKinnon H?j-Edwards
wrote:> Dear R-help,
> 
> I have a very large ascii data file, of which I only want to read in
> selected lines (e.g. on fourth of the lines); determining which lines
> depends on the lines content. So far, I have found two approaches for doing
> this in R; 1) Read the file line by line using a repeat-loop and save the
> result in a temporary file or a variable, and 2) Read the entire file and
> filter/reshape it using *apply methods. To my understanding, the use of
> repeat{}-loops are quite slow in R, and reading an entire file to discard 3
> quarters of the data is a bit of an overkill. Not to mention loading an
> 650MB text file into memory.
> 
> What I am looking for is a function, that works like the first approach,
but
> avoiding do- or repeat-loops, so I imagine it is implemented in a
> lower-level language, to be more efficient. Naturally, when calling the
> function, one would provide a function that determines if/how the line
> should be appended to a variable. Alternatively, an object working as an
> generator (in Python terms), could be used with the normal *apply
> functions. I imagine this working differently from e.g.
> sapply(readLines("myfile.txt"), FUN=selector), in that
"readLines" would be
> executed first, loading the entire file into memory and supplying it to
> sapply, whereas the generator-object only reads a line when sapply requests
> the next element.
> 
> Are there options for this kind of operation?
> 
> Kind regards,
> 
> Stefan McKinnon H?j-Edwards     Dept. of Genetics and Biotechnology
> PhD student                     Faculty of Agricultural Sciences
> stefan.hoj-edwards at agrsci.dk    Aarhus University
> Tel.: +45 8999 1291             Blichers All? 20, Postboks 50
> Web: www.iysik.com              DK-8830 Tjele
>                                 Tel.: +45 8999 1900
>                                 Web: www.agrsci.au.dk
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Maybe Matching Threads

Search for more seemingly similar threads

R help - Sep 2011 - Reading large, non-tabular files

[R] Reading large, non-tabular files

[R] Reading large, non-tabular files

[R] Reading large, non-tabular files

[R] Reading large, non-tabular files

[R] Reading large, non-tabular files

Maybe Matching Threads