Hi, I asked this question 2 years ago, and would like to know if the answer has changed. In S-Plus, I build databases of many large objects. In any given analysis, I only need a few of those objects, but attach'ing the whole database is fine since objects are only read as needed. How can I do the same thing in R, without reading the entire database? One possibility is to treat the database as a package, devoid of code but containing many .RData files under /data, then load() each object I'll need. Perhaps autoload() can be used to avoid having to anticipate which objects I'll need? Another is to use dput and dget. Again I need to know ahead of time which objects I'll want. On July 20, 1999, Ross Ihaka [mailto:ihaka at stat.auckland.ac.nz] wrote:> We are building the infrastructure for adding external databases which can be > attached in the S fashion. One of the class of external data base will be > that of S .Data directories.I'm not sure where that ended up -- could you clarify, Ross? Thanks! -- David Brahm (a215020 at agate.fmr.com) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Hi, Probably not what you wanted to hear, but.... It sounds like you need to use a relational database management system--assuming that these 'objects' are data frames. If that is the case, then I would migrate to a RDBMS and establish ODBC (or RpgSQL, RmySQL, etc....) connections with splus or R. You can then select particular variables with SQL statements, thereby avoiding the problem of reading in the entire database. The downside to this approach is that 'someone' must be capable of managing the DBMS and to be there to help others. If the DB's are large and being used by many people, then it might be worth the time and effort. Jason -- Jason Martinez Sociology Graduate Student University of California, Riverside David Brahm wrote:> > Hi, > > I asked this question 2 years ago, and would like to know if the answer has > changed. > > In S-Plus, I build databases of many large objects. In any given analysis, > I only need a few of those objects, but attach'ing the whole database is fine > since objects are only read as needed. How can I do the same thing in R, > without reading the entire database? > > One possibility is to treat the database as a package, devoid of code but > containing many .RData files under /data, then load() each object I'll need. > Perhaps autoload() can be used to avoid having to anticipate which objects I'll > need? > > Another is to use dput and dget. Again I need to know ahead of time which > objects I'll want. > > On July 20, 1999, Ross Ihaka [mailto:ihaka at stat.auckland.ac.nz] wrote: > > We are building the infrastructure for adding external databases which can be > > attached in the S fashion. One of the class of external data base will be > > that of S .Data directories. > > I'm not sure where that ended up -- could you clarify, Ross? Thanks! > > -- David Brahm (a215020 at agate.fmr.com) > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Probably the suggestion by Jason ("using a relational database management system") would be the best although complicated. As alternatives: 1. Save each object as a separate binary file. Create another object (i.e. a 2 col matrix) that indexes each object to its file. Attach the index object and, in your function, attach only the required object. Unfortunately, this implies that you must know what part (file) of the whole database you need for a given operation (i.e., that if you need individual 3456 you know in which file you have the data for it). 2. You can use delay(). I'm almost done with a short document on "Using R with large objects", for which I've got interesting input from the list. In particular, I got the following message by Ray Brownrigg (Ray.Brownrigg at mcs.vuw.ac.nz): "" A. To set up an object so that it is available at all times, but only loaded into memory when first referenced, consider the following: test.x <- delay({attach(system.file("data", "test.rda", pkg="test")); test.x}) The object test.x has been created and saved as a .rda file using save(test.x, file="test.rda"), and the resulting file test.rda has been stored in the data directory of the (installed) package test. Normally the command above will be executed as part of loading the package test, i.e. when library(test) is entered by the user at the R prompt. Further, because the object test.x is part of package test, it is not saved as part of a new .RData when an R session is terminated, (as long as nothing new is assigned to test.x during the session). "" You could trick R with this delayed attachment of all the objects of your database, but you would actually only attach a given one if your processing really use it. 3. As I point in the document "Getting your styff organized in R", I've not found any way to list the objects within a binary R file, nor to select particular objects from the binary file and attach only the selected ones (which would be the best solution in so many cases). I wonder if future R versions could consider this feature. Dr. Agustin Lobo Instituto de Ciencias de la Tierra (CSIC) Lluis Sole Sabaris s/n 08028 Barcelona SPAIN tel 34 93409 5410 fax 34 93411 0012 alobo at ija.csic.es On Thu, 27 Sep 2001, David Brahm wrote:> Hi, > > I asked this question 2 years ago, and would like to know if the answer has > changed. > > In S-Plus, I build databases of many large objects. In any given analysis, > I only need a few of those objects, but attach'ing the whole database is fine > since objects are only read as needed. How can I do the same thing in R, > without reading the entire database? > > One possibility is to treat the database as a package, devoid of code but > containing many .RData files under /data, then load() each object I'll need. > Perhaps autoload() can be used to avoid having to anticipate which objects I'll > need? > > Another is to use dput and dget. Again I need to know ahead of time which > objects I'll want. > > On July 20, 1999, Ross Ihaka [mailto:ihaka at stat.auckland.ac.nz] wrote: > > We are building the infrastructure for adding external databases which can be > > attached in the S fashion. One of the class of external data base will be > > that of S .Data directories. > > I'm not sure where that ended up -- could you clarify, Ross? Thanks! > > -- David Brahm (a215020 at agate.fmr.com) > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
[snipped...]> 3. As I point in the document "Getting your styff organized in R", > I've not found any way to list the objects within a binary R file, > nor to select particular objects from the binary file and attach > only the selected ones (which would be the best solution > in so many cases). I wonder if future R versions could consider > this feature.I'd like to express my strong second for similar feature. I think it would be very helpful for project management. Also would be nice is something similar to the Unix "ls -l": giving modes, length (or dim), etc. of the objects in some file or the workspace.> Dr. Agustin Lobo > Instituto de Ciencias de la Tierra (CSIC) > Lluis Sole Sabaris s/n > 08028 Barcelona SPAIN > tel 34 93409 5410 > fax 34 93411 0012 > alobo at ija.csic.esRegards, Andy -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
I asked:> In S-Plus, I build databases of many large objects. In any given analysis, > I only need a few of those objects, but attach'ing the whole database is fine > since objects are only read as needed. How can I do the same thing in R, > without reading the entire database?Responses were generally in 2 categories: 1) Use an external SQL database (fine for dataframes, but not flexible enough), 2) Use autoload() or delay() (thanks to Agustin Lobo & Ray Brownrigg!). Here's what I came up with from the second approach. ##### Code: ##### "%&%" <- function(a, b) paste(a, b, sep="") g.save.data <- function(dir, pos=2) { for (i in dir %&% c("", "/data", "/R")) if (!file.exists(i)) dir.create(i) obj <- objects(pos, all.names=T) for (i in obj) save(list=i, file=dir %&% "/data/" %&% i %&% ".RData") code <- obj %&% " <- delay({data(\"" %&% obj %&% "\"); " %&% obj %&% "})" cat(code, file=dir %&% "/R/" %&% basename(dir), sep="\n") } g.attach <- function(dir) library(basename(dir), lib.loc=dirname(dir), char=T) ##### Example: ##### attach(NULL, name="newdata") # Create some data in a new environment assign("x1", 1:10, 2) assign("x2", 11:20, 2) g.save.data("/tmp/newdata") # Save that environment's contents to a pkg detach(2) g.attach("/tmp/newdata") # Open the pkg and see the data! -- David Brahm (a215020 at agate.fmr.com) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On 9/27 I asked:> In S-Plus, I build databases of many large objects. In any given analysis, > I only need a few of those objects, but attach'ing the whole database is fine > since objects are only read as needed. How can I do the same thing in R, > without reading the entire database?Here are the latest versions of my functions to accomplish this. Note I now use system-independent file descriptors (as suggested by Martin Maechler), I have eliminated the need to eliminate conflicts first (I "eval" in the proper environment), and I now store the loaded objects in the package (overwriting the promise objects) rather than in the global environment. ##### Code: ##### # Save objects in position "pos" to a delayed-evaluation data package: g.data.save <- function(dir, obj=z, pos=2) { z <- objects(pos, all.names=T) if (is.character(pos)) pos <- match(pos, search()) pkg <- basename(dir) for (i in file.path(dir,c("","data","R"))) if (!file.exists(i)) dir.create(i) for (i in obj) { file <- file.path(dir, "data", paste(i,"RData",sep=".")) expr <- parse(text=paste("save(list=\"",i,"\", file=\"",file,"\")",sep="")) eval(expr, pos.to.env(pos)) } code <- paste(z," <- delay(g.data.load(\"", z, "\", \"", pkg, "\"))", sep="") cat(code, file=file.path(dir, "R", pkg), sep="\n") } # Routine used in data packages, e.g. x <- delay(g.data.load("x", "newdata")) g.data.load <- function(i, pkg) { load(system.file("data", paste(i,"RData",sep="."), package=pkg), pos.to.env(match(paste("package",pkg,sep=":"), search()))) get(i) } # Attach a delayed-evaluation data package: g.data.attach <- function(dir) library(basename(dir), lib.loc=dirname(dir), char=T) # Get data from an unattached package (like get(item,dir) in S-plus): g.data.get <- function(item, dir) { env <- new.env() load(file.path(dir, "data", paste(item,"RData",sep=".")), env) get(item, envir=env) } ##### Example: ##### attach(NULL, name="newdata") assign("x1", matrix(1, 1000, 1000), 2) assign("x2", matrix(2, 1000, 1000), 2) g.data.save("/tmp/newdata") detach(2) g.data.attach("/tmp/newdata") objects(2) # These are promise objects system.time(print(dim(x1))) # Takes time to load up system.time(print(dim(x1))) # Second time is faster! objects(2) # Now x1 is a real object find("x1") # It's in package:newdata detach(2) unlink("/tmp/newdata", recursive=T) # Clean up ##################### -- David Brahm (a215020 at agate.fmr.com) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._