Maciej Radziejewski
2007-Mar-09 23:47 UTC
[R] Using large datasets: can I overload the subscript operator?
Hello, I do some computations on datasets that come from climate models. These data are huge arrays, significantly larger than typically available RAM, so they have to be accessed row-by-row, or rather slice-by slice, depending on the task. I would like to make an R package to easily access such datasets within R. The C++ backend is ready and being used under Windows/.Net/Visual Basic, but I have yet to learn the specifics of R programming to make a good R interface. I think it should be possible to make a package (call it "slice") that could be used like this: library (slice) dataset <- load.virtualarray ("dataset_definition.xml") ordinaryvector <- dataset [ , 2, 3] # Load a portion of the data from disk and extract it In the above "dataset" is an object that holds a definition of a 3-dimensional large dataset, and "ordinaryvector" is an ordinary R vector. The subscripting operator fetches necessary data from disk and extracts a required slice, taking care of caching and other technical details. So, my questions are: Has anyone ever made a similar extension, with virtual (lazy) arrays? Can the suscript operator be overloaded like that in R? (I know it can be in S, at least for vectors.) And a tough one: is it possible to make an expression like "[1]" (without quoutes) meaningful in R? At the moment it results in a syntax error. I would like to make it return an object of a special class that gets interpreted when subscripting my virtual array as "drop this dimension", like this: dataset [, 2, 3, drop = F] # Return a 3-dimensional array dataset [, [2], 3, drop = F] # Return a 2-dimensional array dataset [, [2], [3], drop = F] # Return a 1-dimensional array, like dataset [, 2, 3] Thanks in advance for any help, Maciej. [[alternative HTML version deleted]]
Duncan Murdoch
2007-Mar-10 02:54 UTC
[R] Using large datasets: can I overload the subscript operator?
On 3/9/2007 6:47 PM, Maciej Radziejewski wrote:> Hello, > > I do some computations on datasets that come from climate models. These data > are huge arrays, significantly larger than typically available RAM, so they > have to be accessed row-by-row, or rather slice-by slice, depending on the > task. I would like to make an R package to easily access such datasets > within R. The C++ backend is ready and being used under Windows/.Net/Visual > Basic, but I have yet to learn the specifics of R programming to make a good > R interface. > > I think it should be possible to make a package (call it "slice") that could > be used like this: > > library (slice) > dataset <- load.virtualarray ("dataset_definition.xml") > ordinaryvector <- dataset [ , 2, 3] # Load a portion of the data from disk > and extract it > > In the above "dataset" is an object that holds a definition of a > 3-dimensional large dataset, and "ordinaryvector" is an ordinary R vector. > The subscripting operator fetches necessary data from disk and extracts a > required slice, taking care of caching and other technical details. So, my > questions are: > > Has anyone ever made a similar extension, with virtual (lazy) arrays?Yes, e.g. the SQLiteDF package.> > Can the suscript operator be overloaded like that in R? (I know it can be in > S, at least for vectors.)Yes.> > And a tough one: is it possible to make an expression like "[1]" (without > quoutes) meaningful in R? At the moment it results in a syntax error. I > would like to make it return an object of a special class that gets > interpreted when subscripting my virtual array as "drop this dimension", > like this: > > dataset [, 2, 3, drop = F] # Return a 3-dimensional array > dataset [, [2], 3, drop = F] # Return a 2-dimensional array > dataset [, [2], [3], drop = F] # Return a 1-dimensional array, like dataset > [, 2, 3]No, that's not legal S or R syntax. However, you might be able to define a special object D and use syntax like dataset [, D[2], 3, drop = F] Duncan Murdoch> > Thanks in advance for any help, > > Maciej. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Roger Bivand
2007-Mar-10 07:43 UTC
[R] Using large datasets: can I overload the subscript operator?
On Sat, 10 Mar 2007, Maciej Radziejewski wrote:> Hello, >The http://www.met.rdg.ac.uk/cag/rclim/ site may have some useful leads. In addition, you'll find ideas in two packages created by Tim Keitt, rgdal, and Rdbi+RdbiPgSQL (now on Bioconductor).> I do some computations on datasets that come from climate models. These data > are huge arrays, significantly larger than typically available RAM, so they > have to be accessed row-by-row, or rather slice-by slice, depending on the > task. I would like to make an R package to easily access such datasets > within R. The C++ backend is ready and being used under Windows/.Net/Visual > Basic, but I have yet to learn the specifics of R programming to make a good > R interface.Look at the Matrix package for examples - you may need finalizers to tidy up memory allocation - see examples in rgdal. The key thing will be thinking through how to implement the R objects as classes, probably not simply reflecting the C++ classes. Classes are covered in the Green Book (Chambers 1998) and Venables & Ripley (2000) S Programming.> > I think it should be possible to make a package (call it "slice") that could > be used like this: > > library (slice) > dataset <- load.virtualarray ("dataset_definition.xml") > ordinaryvector <- dataset [ , 2, 3] # Load a portion of the data from disk > and extract it > > In the above "dataset" is an object that holds a definition of a > 3-dimensional large dataset, and "ordinaryvector" is an ordinary R vector. > The subscripting operator fetches necessary data from disk and extracts a > required slice, taking care of caching and other technical details. So, my > questions are: > > Has anyone ever made a similar extension, with virtual (lazy) arrays? > > Can the suscript operator be overloaded like that in R? (I know it can be in > S, at least for vectors.) >Yes, there are many examples, see the Matrix package for some that use new-style classes (in language issues like this, R is S, the differences are in scoping).> And a tough one: is it possible to make an expression like "[1]" (without > quoutes) meaningful in R? At the moment it results in a syntax error. I > would like to make it return an object of a special class that gets > interpreted when subscripting my virtual array as "drop this dimension", > like this:Most likely not in this context, because "[" in this context will not be what you want. But if your "[.dataset" method is careful about examining its arguments, you ought to be able to get the result you want. You'll likely learn a good deal from looking for example at the code in the Matrix package.> > dataset [, 2, 3, drop = F] # Return a 3-dimensional array > dataset [, [2], 3, drop = F] # Return a 2-dimensional array > dataset [, [2], [3], drop = F] # Return a 1-dimensional array, like dataset > [, 2, 3] > > Thanks in advance for any help, > > Maciej. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Roger Bivand Economic Geography Section, Department of Economics, Norwegian School of Economics and Business Administration, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43 e-mail: Roger.Bivand at nhh.no
Gabor Grothendieck
2007-Mar-10 13:21 UTC
[R] Using large datasets: can I overload the subscript operator?
On 3/9/07, Maciej Radziejewski <maciej.rhelp at gmail.com> wrote:> Hello, > > I do some computations on datasets that come from climate models. These data > are huge arrays, significantly larger than typically available RAM, so they > have to be accessed row-by-row, or rather slice-by slice, depending on the > task. I would like to make an R package to easily access such datasets > within R. The C++ backend is ready and being used under Windows/.Net/Visual > Basic, but I have yet to learn the specifics of R programming to make a good > R interface. > > I think it should be possible to make a package (call it "slice") that could > be used like this: > > library (slice) > dataset <- load.virtualarray ("dataset_definition.xml") > ordinaryvector <- dataset [ , 2, 3] # Load a portion of the data from disk > and extract it > > In the above "dataset" is an object that holds a definition of a > 3-dimensional large dataset, and "ordinaryvector" is an ordinary R vector. > The subscripting operator fetches necessary data from disk and extracts a > required slice, taking care of caching and other technical details. So, my > questions are: > > Has anyone ever made a similar extension, with virtual (lazy) arrays?Not quite the same but you might look at the g.data delayed data package in case its good enough for your needs. Note the dot. gdata without a dot is a different package.> > Can the suscript operator be overloaded like that in R? (I know it can be in > S, at least for vectors.)Yes. You make your objects a class, myclass, and then define "[.myclass" <- function... for myclass in the S3 class system and similarly in S4. S3 is easier to develop for and has higher performance so you probably want that rather than S4. A few examples packages are XML (see "[.XMLNode"), fame and zoo for S3 and 'its' for S4. Be sure to check out ?.subset See think post for context: http://tolstoy.newcastle.edu.au/R/devel/05/05/0853.html> > And a tough one: is it possible to make an expression like "[1]" (without > quoutes) meaningful in R? At the moment it results in a syntax error. I > would like to make it return an object of a special class that gets > interpreted when subscripting my virtual array as "drop this dimension", > like this: > > dataset [, 2, 3, drop = F] # Return a 3-dimensional array > dataset [, [2], 3, drop = F] # Return a 2-dimensional array > dataset [, [2], [3], drop = F] # Return a 1-dimensional array, like dataset > [, 2, 3]No but one idea is to define the single letter . (i.e. a dot) to be of a special class, dot say and define "[.dot" to produce objects of a special class (maybe also "dot"). Then you could write dataset[, .[2], .[3], drop = FALSE] if you define "[.myclass" to look for such objects. Another possibility is to use formula notation: dataset[, ~2, ~3, drop = FALSE] and have [.myclass handle formula arguments specially of perhaps forget about that notation and just extend drop: dataset[drop = 2:3] BTW, its better to use FALSE rather than F since F can be a variable name.
rdporto1
2007-Mar-10 18:59 UTC
[R] Using large datasets: can I overload the subscript operator?
Maciej,> I think it should be possible to make a package (call it "slice") that could > be used like this: > ... > Has anyone ever made a similar extension, with virtual (lazy) arrays? >take a look at the filehash package at http://cran.r-project.org/doc/Rnews/Rnews_2006-4.pdf Regards, Rogerio