thr3ads.net - R devel - [Rd] idea for "virtual matrix/array" class [Aug 2004]

If this information is useful, please help other people find it:
Share via:

Tony Plate

2004-Aug-23 22:56 UTC

[Rd] idea for "virtual matrix/array" class

I've been wondering how to work with more data than can fit in memory, in a 
way that allows it to be worked with conveniently and quickly.  Of course, 
a database can be used for this purpose, but extracting data from a 
database is much slower and somewhat less convenient than extracting data 
from a native object (at least in our setup).

One idea I was thinking about was to have a new class of object that 
referred to data in a file on disk, and which had all the standard methods 
of matrices and arrays, i.e., subsetting ("["), dim, dimnames, etc. 
The
object in memory would only store the array attributes, while the actual 
array data (the elements) would reside in a file.  When some extraction 
method was called, it would access data in the file and return the 
appropriate data.  With sensible use of seek operations, the data access 
could probably be quite fast.  The file format of the object on disk could 
possibly be the standard serialized binary format as used in .RData 
files.  Of course, if the object was larger than would fit in memory, then 
trying to extract too large a subarray would exhaust memory, but it should 
be possible to efficiently extract reasonably sized subarrays.  To be more 
useful, one would want want apply() to work with such arrays.  That would 
be doable, either by creating a new method for apply, or possibly just for 
aperm.

Some difficulties that might arise could have to do with functions like 
"typeof" and "mode" and "storage.mode" -- what
should these return for such
an object?  Such a class would probably break the common relationships such 
as x[1] having the same storage mode as x.  I don't know if difficulties 
like these would ultimately make such a "virtual array" class
unworkable.

Does anyone have any opinions as to the merits of this idea?  Would there 
be any interest in seeing such a class in R?

-- Tony Plate

Jeff Gentry

2004-Aug-23 23:08 UTC

head link

[Rd] idea for "virtual matrix/array" class

> Does anyone have any opinions as to the merits of this idea?  Would there 
> be any interest in seeing such a class in R?
Have you looked at the 'externalVector' package in Bioconductor? 
I'm
admittedly not super familiar with it, although my understanding of how it
works and what it does seems pretty similar to what you are describing.

-J

Thomas Lumley

2004-Aug-23 23:13 UTC

head link

[Rd] idea for "virtual matrix/array" class

On Mon, 23 Aug 2004, Tony Plate wrote:>
> One idea I was thinking about was to have a new class of object that
> referred to data in a file on disk, and which had all the standard methods
> of matrices and arrays, i.e., subsetting ("["), dim, dimnames,
etc.  The
> object in memory would only store the array attributes, while the actual
> array data (the elements) would reside in a file.  When some extraction
> method was called, it would access data in the file and return the
> appropriate data.  With sensible use of seek operations, the data access
> could probably be quite fast.  The file format of the object on disk could
> possibly be the standard serialized binary format as used in .RData
> files.  Of course, if the object was larger than would fit in memory, then
> trying to extract too large a subarray would exhaust memory, but it should
> be possible to efficiently extract reasonably sized subarrays.  To be more
> useful, one would want want apply() to work with such arrays.  That would
> be doable, either by creating a new method for apply, or possibly just for
> aperm.
This is what RPgSql does with proxy dataframes and what I did (read-only)
for netCDF access. It's a good idea if you have a data format for which
random access is fairly fast.  I'm not sure that the standard serialized
binary format satisfies this.  Fixed-format text files would work, but
free-format ones wouldn't -- seek() only helps when you can work out where
to seek without reading all the data.

	-thomas

Prof Brian Ripley

2004-Aug-23 23:23 UTC

head link

[Rd] idea for "virtual matrix/array" class

We've seen something very similar before.  A file is just a database writ 
small, and RPgSQL did essentially this in ca 2001.  Fei Chen's research 
system (see his talk at DSC2003) again does similar things.

Both of those were for virtual data frames as I recall.  It's easier for a
matrix/array, but I am not sure it is necessary.  At least on Unix-alikes
one could arrange to memory-map large vectors (logical, integer, double or
raw).  Character vectors would be a little harder, but a char ** version
could be memory mapped.  I presume you don't mean matrix/arrays of lists?

I think you need to show examples where an array/matrix is the right 
structure and the sort of access required.  Accessing rows in a matrix 
would cause severe page-faulting for example (and that's the sort of 
reason we recommend DBMSes).

On Mon, 23 Aug 2004, Tony Plate wrote:
> I've been wondering how to work with more data than can fit in memory,
in a
> way that allows it to be worked with conveniently and quickly.  Of course, 
> a database can be used for this purpose, but extracting data from a 
> database is much slower and somewhat less convenient than extracting data 
> from a native object (at least in our setup).
> 
> One idea I was thinking about was to have a new class of object that 
> referred to data in a file on disk, and which had all the standard methods 
> of matrices and arrays, i.e., subsetting ("["), dim, dimnames,
etc.  The
> object in memory would only store the array attributes, while the actual 
> array data (the elements) would reside in a file.  When some extraction 
> method was called, it would access data in the file and return the 
> appropriate data.  With sensible use of seek operations, the data access 
> could probably be quite fast.  The file format of the object on disk could 
> possibly be the standard serialized binary format as used in .RData 
> files.  Of course, if the object was larger than would fit in memory, then 
> trying to extract too large a subarray would exhaust memory, but it should 
> be possible to efficiently extract reasonably sized subarrays.  To be more 
> useful, one would want want apply() to work with such arrays.  That would 
> be doable, either by creating a new method for apply, or possibly just for 
> aperm.
> 
> Some difficulties that might arise could have to do with functions like 
> "typeof" and "mode" and "storage.mode" --
what should these return for such
> an object?  Such a class would probably break the common relationships such
> as x[1] having the same storage mode as x.  I don't know if
difficulties
> like these would ultimately make such a "virtual array" class
unworkable.
> 
> Does anyone have any opinions as to the merits of this idea?  Would there 
> be any interest in seeing such a class in R?
-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Reasonably Related Threads

Search for more seemingly similar threads

R devel - Aug 2004 - idea for "virtual matrix/array" class

[Rd] idea for "virtual matrix/array" class

[Rd] idea for "virtual matrix/array" class

[Rd] idea for "virtual matrix/array" class

[Rd] idea for "virtual matrix/array" class

Reasonably Related Threads