Hi, I want to perform some analysis on subsets of huge data files. There are 20 of the files and I want to select the same subsets of each one (each subset is a chunk of 1500 or so consecutive rows from several million). To save time and processing power is there a method to tell R to *only* read in these rows, rather than reading in the entire dataset then selecting subsets and deleting the extraneous data? This method takes a rather silly amount of time and results in memory problems. I am using R 1.9.0 on SuSe 9.0 Thanks in advance! Laura Quinn Institute of Atmospheric Science School of Earth and Environment University of Leeds Leeds LS2 9JT tel: +44 113 343 1596 fax: +44 113 343 6716 mail: laura at env.leeds.ac.uk
Laura Quinn wrote:> Hi, > > I want to perform some analysis on subsets of huge data files. There are > 20 of the files and I want to select the same subsets of each one (each > subset is a chunk of 1500 or so consecutive rows from several million). To > save time and processing power is there a method to tell R to *only* read > in these rows, rather than reading in the entire dataset then selecting > subsets and deleting the extraneous data? This method takes a rather silly > amount of time and results in memory problems. > > I am using R 1.9.0 on SuSe 9.0 > > Thanks in advance! >Hi Laura, I guess if you knew which row of the file your subset started from and you knew how many lines you wanted to read in you could use scan with arguments skip and nlines (see ?scan) A better way that gets recommended a lot on the list is to store your data in a database and use the various R packages and/or tools available that can connect to your database and only extract the rows you need. See the R Data Import/Export manual for more on scan and using relational databases with R. Hope this helps, Gav -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [T] +44 (0)20 7679 5522 ENSIS Research Fellow [F] +44 (0)20 7679 7565 ENSIS Ltd. & ECRC [E] gavin.simpson at ucl.ac.uk UCL Department of Geography [W] http://www.ucl.ac.uk/~ucfagls/cv/ 26 Bedford Way [W] http://www.ucl.ac.uk/~ucfagls/ London. WC1H 0AP. %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
On Wed, 6 Oct 2004, Laura Quinn wrote:> Hi, > > I want to perform some analysis on subsets of huge data files. There are > 20 of the files and I want to select the same subsets of each one (each > subset is a chunk of 1500 or so consecutive rows from several million). To > save time and processing power is there a method to tell R to *only* read > in these rows, rather than reading in the entire dataset then selecting > subsets and deleting the extraneous data? This method takes a rather silly > amount of time and results in memory problems.It depends on the data format. If, for example, you have free-format text files it isn't possible to locate a specific chunk without reading all the earlier entries. You can still save time and space by having some other program (?Perl) read the file and spit out a file with just the 1500 rows you want. A better strategy would be for the data to be either in a database or in a format such as netCDF designed for random access. -thomas
Prof Brian Ripley
2004-Oct-06 18:15 UTC
[R] Performing Analysis on Subset of External data
1) Use the skip= and nrows= arguments to read.table. 2) Open a connection, read and discard rows, read the block you want then close the connection. (Which is how 1 works, essentially.) 3) Use perl, awk or some such to extract the rows you want -- this is probably rather faster. On Wed, 6 Oct 2004, Laura Quinn wrote:> I want to perform some analysis on subsets of huge data files. There are > 20 of the files and I want to select the same subsets of each one (each > subset is a chunk of 1500 or so consecutive rows from several million). To > save time and processing power is there a method to tell R to *only* read > in these rows, rather than reading in the entire dataset then selecting > subsets and deleting the extraneous data? This method takes a rather silly > amount of time and results in memory problems. > > I am using R 1.9.0 on SuSe 9.0-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
On 06-Oct-04 Laura Quinn wrote:> I want to perform some analysis on subsets of huge data files. > There are 20 of the files and I want to select the same subsets > of each one (each subset is a chunk of 1500 or so consecutive > rows from several million). > To save time and processing power is there a method to tell R > to *only* read in these rows, rather than reading in the entire > dataset then selecting subsets and deleting the extraneous data? > This method takes a rather silly amount of time and results in > memory problems. > > I am using R 1.9.0 on SuSe 9.0Hi Laura, If there is a neat time&memory-efficient R solution then I'm sure someone will tell you! But since you're using Linux, I can suggest an alternative, which is to use some combination of the Unix file utilities which you will already have in your SuSE, entering them as a command-line at the system prompt, or executing a shell script file which contains the command. For example, to read just lines (say) 500001-501500 you could use cat bigdata | head -501500 | tail -1500 > smalldata which reads the first 501500 lines of bigdata and then the last 1500 lines of these, and directs the result of this into the file smalldata. That's OK for a single chunk of 1500, but suppose (as seems might be the case) you want (say) the first line of the file (for names) and 5 chunks of 1500 starting at lines 100001, 200001, 300001, 400001, 500001 respectively. Then awk will do what you want, on the lines of cat bigdata | awk ' {nr=NR; if( (nr==1) || ((nr>=100001)&&(nr<=101500)) || ((nr>=200001)&&(nr<=201500)) || ((nr>=300001)&&(nr<=301500)) || ((nr>=400001)&&(nr<=401500)) || ((nr>=500001)&&(nr<=501500)) ) {print $0} else {next} }' > smalldata (The above can be typed in as shown, and will be a single command). Having done this, you can then use smalldata as the dataset to read into R, instead of bigdata. These are just examples of what can be done externally using such utilities. (Now, whether or not there's a simple R-workround, I shall undoubtedly be trumped by some perl freak). Best wishes to all, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 06-Oct-04 Time: 19:58:32 ------------------------------ XFMail ------------------------------