thr3ads.net - R devel - [Rd] Rapid Random Access [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Barry Rowlingson

2007-Dec-14 18:01 UTC

[Rd] Rapid Random Access

I have some code that can potentially produce a huge number of 
large-ish R data frames, each of a different number of rows. All the 
data frames together will be way too big to keep in R's memory, but 
we'll assume a single one is manageable. It's just when there's a 
million of them that the machine might start to burn up.

  However I might, for example, want to compute some averages over the 
elements in the data frames. Or I might want to sample ten of them at 
random and do some plots. What I need is rapid random access to data 
stored in external files.

  Here's some ideas I've had:

  * Store all the data in an HDF-5 file - problem here is that the 
current HDF package for R reads the whole file in at once.

  * Store the data in some other custom binary format with an index for 
rapid access to the N-th elements. Problems: feels like reinventing HDF, 
cross-platform issues, etc.

  * Store the data in a number of .RData files in a directory. Hence to 
get the N-th element just attach(paste("foo/A-",n,'.RData'))
give or
take a parameter or two.

  * Use a database. Seems a bit heavyweight, but maybe using RSQLite 
could work in order to keep it local.

  What I'm currently doing is keeping it OO enough that I can in theory 
implement all of the above. At the moment I have an implementation that 
does keep them all in R's memory as a list of data frames, which is fine 
for small test cases but things are going to get big shortly. Any other 
ideas or hints are welcome.

thanks

Barry

Sean Davis

2007-Dec-14 18:08 UTC

head link

[Rd] Rapid Random Access

On Dec 14, 2007 1:01 PM, Barry Rowlingson <b.rowlingson@lancaster.ac.uk>
wrote:
>  I have some code that can potentially produce a huge number of
> large-ish R data frames, each of a different number of rows. All the
> data frames together will be way too big to keep in R's memory, but
> we'll assume a single one is manageable. It's just when there's
a
> million of them that the machine might start to burn up.
>
>  However I might, for example, want to compute some averages over the
> elements in the data frames. Or I might want to sample ten of them at
> random and do some plots. What I need is rapid random access to data
> stored in external files.
>
>  Here's some ideas I've had:
>
>  * Store all the data in an HDF-5 file - problem here is that the
> current HDF package for R reads the whole file in at once.
>
>  * Store the data in some other custom binary format with an index for
> rapid access to the N-th elements. Problems: feels like reinventing HDF,
> cross-platform issues, etc.
>
>  * Store the data in a number of .RData files in a directory. Hence to
> get the N-th element just
attach(paste("foo/A-",n,'.RData')) give or
> take a parameter or two.
>
>  * Use a database. Seems a bit heavyweight, but maybe using RSQLite
> could work in order to keep it local.
>
Unless you really need this to be a general solution, I would suggest using
a database.  And if you use one that allows you to create functions within
it, you can even keep some of the calculations on the server side (which may
be a performance advantage).  If you are doing a lot of this, you might
consider Postgres and pl/R, which embeds R in the database.

Sean

	[[alternative HTML version deleted]]

Tony Plate

2007-Dec-14 19:31 UTC

head link

[Rd] Rapid Random Access

Barry Rowlingson wrote:>   I have some code that can potentially produce a huge number of 
> large-ish R data frames, each of a different number of rows. All the 
> data frames together will be way too big to keep in R's memory, but 
> we'll assume a single one is manageable. It's just when there's
a
> million of them that the machine might start to burn up.
>   This is exactly the type of situation that the trackObjs package is 
designed for.  It will automatically (and invisibly) store each object 
in its own .RData file so that objects can be accessed as ordinary R 
objects, but are not kept in memory (actually, there are options to 
control whether or not objects are cached in memory). It also caches 
some characteristics of objects so that a brief summary of objects can 
be provided without having to read each object.  The g.data package and 
the filehash package also do similar things wrt to providing automatic 
access to objects in .RData files (and were part of the inspiration for 
the trackObjs package.)

-- Tony Plate>   However I might, for example, want to compute some averages over the 
> elements in the data frames. Or I might want to sample ten of them at 
> random and do some plots. What I need is rapid random access to data 
> stored in external files.
>
>   Here's some ideas I've had:
>
>   * Store all the data in an HDF-5 file - problem here is that the 
> current HDF package for R reads the whole file in at once.
>
>   * Store the data in some other custom binary format with an index for 
> rapid access to the N-th elements. Problems: feels like reinventing HDF, 
> cross-platform issues, etc.
>
>   * Store the data in a number of .RData files in a directory. Hence to 
> get the N-th element just
attach(paste("foo/A-",n,'.RData')) give or
> take a parameter or two.
>
>   * Use a database. Seems a bit heavyweight, but maybe using RSQLite 
> could work in order to keep it local.
>
>   What I'm currently doing is keeping it OO enough that I can in theory
> implement all of the above. At the moment I have an implementation that 
> does keep them all in R's memory as a list of data frames, which is
fine
> for small test cases but things are going to get big shortly. Any other 
> ideas or hints are welcome.
>
> thanks
>
> Barry
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

Maybe Matching Threads

Search for more apparently analagous threads

R devel - Dec 2007 - Rapid Random Access

[Rd] Rapid Random Access

[Rd] Rapid Random Access

[Rd] Rapid Random Access

Maybe Matching Threads