thr3ads.net - R help - [R] Performing Analysis on Subset of External data [Oct 2004]

If this information is useful, please help other people find it:
Share via:

Laura Quinn

2004-Oct-06 17:54 UTC

[R] Performing Analysis on Subset of External data

Hi,

I want to perform some analysis on subsets of huge data files. There are
20 of the files and I want to select the same subsets of each one (each
subset is a chunk of 1500 or so consecutive rows from several million). To
save time and processing power is there a method to tell R to *only* read
in these rows, rather than reading in the entire dataset then selecting
subsets and deleting the extraneous data? This method takes a rather silly
amount of time and results in memory problems.

I am using R 1.9.0 on SuSe 9.0

Thanks in advance!


Laura Quinn
Institute of Atmospheric Science
School of Earth and Environment
University of Leeds
Leeds
LS2 9JT

tel: +44 113 343 1596
fax: +44 113 343 6716
mail: laura at env.leeds.ac.uk

Gavin Simpson

2004-Oct-06 18:11 UTC

head link

[R] Performing Analysis on Subset of External data

Laura Quinn wrote:> Hi,
> 
> I want to perform some analysis on subsets of huge data files. There are
> 20 of the files and I want to select the same subsets of each one (each
> subset is a chunk of 1500 or so consecutive rows from several million). To
> save time and processing power is there a method to tell R to *only* read
> in these rows, rather than reading in the entire dataset then selecting
> subsets and deleting the extraneous data? This method takes a rather silly
> amount of time and results in memory problems.
> 
> I am using R 1.9.0 on SuSe 9.0
> 
> Thanks in advance!
> 
Hi Laura,

I guess if you knew which row of the file your subset started from and 
you knew how many lines you wanted to read in you could use scan with 
arguments skip and nlines (see ?scan)

A better way that gets recommended a lot on the list is to store your 
data in a database and use the various R packages and/or tools available 
that can connect to your database and only extract the rows you need.

See the R Data Import/Export manual for more on scan and using 
relational databases with R.

Hope this helps,

Gav
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson                     [T] +44 (0)20 7679 5522
ENSIS Research Fellow             [F] +44 (0)20 7679 7565
ENSIS Ltd. & ECRC                 [E] gavin.simpson at ucl.ac.uk
UCL Department of Geography       [W] http://www.ucl.ac.uk/~ucfagls/cv/
26 Bedford Way                    [W] http://www.ucl.ac.uk/~ucfagls/
London.  WC1H 0AP.
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

Thomas Lumley

2004-Oct-06 18:13 UTC

head link

[R] Performing Analysis on Subset of External data

On Wed, 6 Oct 2004, Laura Quinn wrote:
> Hi,
>
> I want to perform some analysis on subsets of huge data files. There are
> 20 of the files and I want to select the same subsets of each one (each
> subset is a chunk of 1500 or so consecutive rows from several million). To
> save time and processing power is there a method to tell R to *only* read
> in these rows, rather than reading in the entire dataset then selecting
> subsets and deleting the extraneous data? This method takes a rather silly
> amount of time and results in memory problems.
It depends on the data format.  If, for example, you have free-format text 
files it isn't possible to locate a specific chunk without reading all the 
earlier entries.  You can still save time and space by having some other 
program (?Perl) read the file and spit out a file with just the 1500 rows 
you want.

A better strategy would be for the data to be either in a database or in a 
format such as netCDF designed for random access.

 	-thomas

Prof Brian Ripley

2004-Oct-06 18:15 UTC

head link

[R] Performing Analysis on Subset of External data

1) Use the skip= and nrows= arguments to read.table.

2) Open a connection, read and discard rows, read the block you want then 
close the connection. (Which is how 1 works, essentially.)

3) Use perl, awk or some such to extract the rows you want -- this is 
probably rather faster.

On Wed, 6 Oct 2004, Laura Quinn wrote:
> I want to perform some analysis on subsets of huge data files. There are
> 20 of the files and I want to select the same subsets of each one (each
> subset is a chunk of 1500 or so consecutive rows from several million). To
> save time and processing power is there a method to tell R to *only* read
> in these rows, rather than reading in the entire dataset then selecting
> subsets and deleting the extraneous data? This method takes a rather silly
> amount of time and results in memory problems.
> 
> I am using R 1.9.0 on SuSe 9.0

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

(Ted Harding)

2004-Oct-06 18:58 UTC

head link

[R] Performing Analysis on Subset of External data

On 06-Oct-04 Laura Quinn wrote:> I want to perform some analysis on subsets of huge data files.
> There are 20 of the files and I want to select the same subsets
> of each one (each subset is a chunk of 1500 or so consecutive
> rows from several million).
> To save time and processing power is there a method to tell R
> to *only* read in these rows, rather than reading in the entire
> dataset then selecting subsets and deleting the extraneous data?
> This method takes a rather silly amount of time and results in
> memory problems.
> 
> I am using R 1.9.0 on SuSe 9.0
Hi Laura,
If there is a neat time&memory-efficient R solution then I'm sure
someone will tell you! But since you're using Linux, I can suggest
an alternative, which is to use some combination of the Unix file
utilities which you will already have in your SuSE, entering them
as a command-line at the system prompt, or executing a shell script
file which contains the command.

For example, to read just lines (say) 500001-501500 you could use

  cat bigdata | head -501500 | tail -1500 > smalldata

which reads the first 501500 lines of bigdata and then the last
1500 lines of these, and directs the result of this into the file
smalldata.

That's OK for a single chunk of 1500, but suppose (as seems might
be the case) you want (say) the first line of the file (for names)
and 5 chunks of 1500 starting at lines 100001, 200001, 300001,
400001, 500001 respectively. Then awk will do what you want, on
the lines of

  cat bigdata | awk '
    {nr=NR;
      if(
          (nr==1) ||
          ((nr>=100001)&&(nr<=101500)) ||
          ((nr>=200001)&&(nr<=201500)) ||
          ((nr>=300001)&&(nr<=301500)) ||
          ((nr>=400001)&&(nr<=401500)) ||
          ((nr>=500001)&&(nr<=501500))
        ) {print $0}
      else {next}
    }' > smalldata

(The above can be typed in as shown, and will be a single command).

Having done this, you can then use smalldata as the dataset to
read into R, instead of bigdata.

These are just examples of what can be done externally using such
utilities.

(Now, whether or not there's a simple R-workround, I shall undoubtedly
be trumped by some perl freak).

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 06-Oct-04                                       Time: 19:58:32
------------------------------ XFMail ------------------------------

Maybe Matching Threads

Search for more reasonably related threads

R help - Oct 2004 - Performing Analysis on Subset of External data

[R] Performing Analysis on Subset of External data

[R] Performing Analysis on Subset of External data

[R] Performing Analysis on Subset of External data

[R] Performing Analysis on Subset of External data

[R] Performing Analysis on Subset of External data

Maybe Matching Threads