Dear R-help, I have a very large ascii data file, of which I only want to read in selected lines (e.g. on fourth of the lines); determining which lines depends on the lines content. So far, I have found two approaches for doing this in R; 1) Read the file line by line using a repeat-loop and save the result in a temporary file or a variable, and 2) Read the entire file and filter/reshape it using *apply methods. To my understanding, the use of repeat{}-loops are quite slow in R, and reading an entire file to discard 3 quarters of the data is a bit of an overkill. Not to mention loading an 650MB text file into memory. What I am looking for is a function, that works like the first approach, but avoiding do- or repeat-loops, so I imagine it is implemented in a lower-level language, to be more efficient. Naturally, when calling the function, one would provide a function that determines if/how the line should be appended to a variable. Alternatively, an object working as an generator (in Python terms), could be used with the normal *apply functions. I imagine this working differently from e.g. sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" would be executed first, loading the entire file into memory and supplying it to sapply, whereas the generator-object only reads a line when sapply requests the next element. Are there options for this kind of operation? Kind regards, Stefan McKinnon H?j-Edwards Dept. of Genetics and Biotechnology PhD student Faculty of Agricultural Sciences stefan.hoj-edwards at agrsci.dk Aarhus University Tel.: +45 8999 1291 Blichers All? 20, Postboks 50 Web: www.iysik.com DK-8830?Tjele Tel.: +45 8999 1900 Web: www.agrsci.au.dk
What is overkill about reading in a 650MB text file if you have the space? You are going to have to process one way or another. I would use 'readLines' to read it in, and then 'grepl' to determine which lines I want to keep and then delete the rest, and then write the new file out. At this point I can probably use 'read.table' to now process the new file. This works pretty fast if you can apply pattern matching to determine which lines you want to keep. If you don't have the memory to read in the whole file, then setup a look and read in whatever amount makes sense (e.g., 100MB at a time), and then do the processing above with the output file opened at the beginning so that you continue to add to it. You probably need to state what type of criteria you would be applying to the lines to determine if you want to keep them. You can also use perl, sed awk, .... to do the processing 2011/9/14 Stefan McKinnon H?j-Edwards <Stefan.Hoj-Edwards at agrsci.dk>:> Dear R-help, > > I have a very large ascii data file, of which I only want to read in selected lines (e.g. on fourth of the lines); determining which lines depends on the lines content. So far, I have found two approaches for doing this in R; 1) Read the file line by line using a repeat-loop and save the result in a temporary file or a variable, and 2) Read the entire file and filter/reshape it using *apply methods. > To my understanding, the use of repeat{}-loops are quite slow in R, and reading an entire file to discard 3 quarters of the data is a bit of an overkill. Not to mention loading an 650MB text file into memory. > > What I am looking for is a function, that works like the first approach, but avoiding do- or repeat-loops, so I imagine it is implemented in a lower-level language, to be more efficient. Naturally, when calling the function, one would provide a function that determines if/how the line should be appended to a variable. > Alternatively, an object working as an generator (in Python terms), could be used with the normal *apply functions. I imagine this working differently from e.g. sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" would be executed first, loading the entire file into memory and supplying it to sapply, whereas the generator-object only reads a line when sapply requests the next element. > > Are there options for this kind of operation? > > Kind regards, > > Stefan McKinnon H?j-Edwards ? ? Dept. of Genetics and Biotechnology > PhD student ? ? ? ? ? ? ? ? ? ? Faculty of Agricultural Sciences > stefan.hoj-edwards at agrsci.dk ? ?Aarhus University > Tel.: +45 8999 1291 ? ? ? ? ? ? Blichers All? 20, Postboks 50 > Web: www.iysik.com ? ? ? ? ? ? ?DK-8830?Tjele > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Tel.: +45 8999 1900 > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Web: www.agrsci.au.dk > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
2011/9/14 Stefan McKinnon H?j-Edwards <Stefan.Hoj-Edwards at agrsci.dk>:> Dear R-help, > > I have a very large ascii data file, of which I only want to read in selected lines (e.g. on fourth of the lines); determining which lines depends on the lines content. So far, I have found two approaches for doing this in R; 1) Read the file line by line using a repeat-loop and save the result in a temporary file or a variable, and 2) Read the entire file and filter/reshape it using *apply methods. > To my understanding, the use of repeat{}-loops are quite slow in R, and reading an entire file to discard 3 quarters of the data is a bit of an overkill. Not to mention loading an 650MB text file into memory. > > What I am looking for is a function, that works like the first approach, but avoiding do- or repeat-loops, so I imagine it is implemented in a lower-level language, to be more efficient. Naturally, when calling the function, one would provide a function that determines if/how the line should be appended to a variable. > Alternatively, an object working as an generator (in Python terms), could be used with the normal *apply functions. I imagine this working differently from e.g. sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" would be executed first, loading the entire file into memory and supplying it to sapply, whereas the generator-object only reads a line when sapply requests the next element. > > Are there options for this kind of operation? >read.csv.sql in the sqldf package can read a file and deliver just a subset to R. The portion desired is specified using sql and the entire operation can be done in a single line of code. It can handle files too large to read into R since only the portion desired is ever read into R itself. See Example 13 on the sqldf home page: http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql and also read ?read.csv.sql . -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On Sep 14, 2011, at 7:08 AM, Stefan McKinnon H?j-Edwards wrote:> Dear R-help, > > I have a very large ascii data file, of which I only want to read in > selected lines (e.g. on fourth of the lines); determining which > lines depends on the lines content. So far, I have found two > approaches for doing this in R; 1) Read the file line by line using > a repeat-loop and save the result in a temporary file or a variable, > and 2) Read the entire file and filter/reshape it using *apply > methods.Better to use vectorized methods. The `apply functions are really no faster than loops.> To my understanding, the use of repeat{}-loops are quite slow in R, > and reading an entire file to discard 3 quarters of the data is a > bit of an overkill. Not to mention loading an 650MB text file into > memory. >Peoples' perception of "large" may vary, and to me that is a medium size file. It seems quite likely to fit in most modern computers at least for the purpose of eliminating the undesired rows and then having a reduced dataset to write to a working file.> What I am looking for is a function, that works like the first > approach, but avoiding do- or repeat-loops, so I imagine it is > implemented in a lower-level language, to be more efficient. > Naturally, when calling the function, one would provide a function > that determines if/how the line should be appended to a variable. > Alternatively, an object working as an generator (in Python terms), > could be used with the normal *apply functions. I imagine this > working differently from e.g. sapply(readLines("myfile.txt"), > FUN=selector), in that "readLines" would be executed first, loading > the entire file into memory and supplying it to sapply, whereas the > generator-object only reads a line when sapply requests the next > element.There are database interfaces to R. You have told us nothing about your OS or hardware so it's a bit difficult to match recommendations to your specific situation.> > Are there options for this kind of operation?Many, .... once details are provided. This message arrived with useful guidance: ---------- PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ------------ David.
That looks like a perfect job for (g)awk which is in every Linux distribution but also available for Windows. It can be called with something like system( "awk -f script.awk inputfile.txt" ) and does its job silently and very fast. 650MB should not be an issue. I'm not proficient in awk but would offer my help anyway (off-list...). Rgds, Rainer On Wednesday 14 September 2011 13:08:14 Stefan McKinnon H?j-Edwards wrote:> Dear R-help, > > I have a very large ascii data file, of which I only want to read in > selected lines (e.g. on fourth of the lines); determining which lines > depends on the lines content. So far, I have found two approaches for doing > this in R; 1) Read the file line by line using a repeat-loop and save the > result in a temporary file or a variable, and 2) Read the entire file and > filter/reshape it using *apply methods. To my understanding, the use of > repeat{}-loops are quite slow in R, and reading an entire file to discard 3 > quarters of the data is a bit of an overkill. Not to mention loading an > 650MB text file into memory. > > What I am looking for is a function, that works like the first approach, but > avoiding do- or repeat-loops, so I imagine it is implemented in a > lower-level language, to be more efficient. Naturally, when calling the > function, one would provide a function that determines if/how the line > should be appended to a variable. Alternatively, an object working as an > generator (in Python terms), could be used with the normal *apply > functions. I imagine this working differently from e.g. > sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" would be > executed first, loading the entire file into memory and supplying it to > sapply, whereas the generator-object only reads a line when sapply requests > the next element. > > Are there options for this kind of operation? > > Kind regards, > > Stefan McKinnon H?j-Edwards Dept. of Genetics and Biotechnology > PhD student Faculty of Agricultural Sciences > stefan.hoj-edwards at agrsci.dk Aarhus University > Tel.: +45 8999 1291 Blichers All? 20, Postboks 50 > Web: www.iysik.com DK-8830 Tjele > Tel.: +45 8999 1900 > Web: www.agrsci.au.dk > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.