Emmanuel Charpentier
2001-Nov-27 15:11 UTC
[R] R/S and large datasets - Database access (also Re: SAS and S/R)
A consensus seems to emerge : R would excel to exploratory work on small/middle-sized datasets, while SAS would be able to munch much larger datasets. However, I see the "size" problem as a red herring. The objects that have to stay "in core" are usually much smaller than the dataset. For example, for problems involving fixed-effects linear models, you need only some matrices whose size is proportional to the square of the number of *variables* and the (admittedly large) vector of residues (whose size is equl to the number of observations). Other cases (nonlinear mixed effects models come to mind) are not as easily tamed (any iterative process (shuch as ML estimation) has to get back to original data), but at least, the time penalty involved in the use of such an interface pays back by allowing you to treat problems otherwise untractable. I am aware of at least one database access package that allows to access data without dragging a whole table in memory : the RPgSql package offers what it calls a "proxy variable", which is an objet that behaves, for all practical purposes, as a dataframe, but is an interface to database tables. I see this kind of interface as a way to avoid overloading core memory with data scarcely used. Unfortunately, the said package is now officially orphaned by its developper, which states that he now focuses on the next database access standard : the Rdbi interface, which is currently under development, and which I don't know a thing about. So the question is : do the Rdbi interface offers such a proxy to data still residing in databases ? Or am I barking up the wrong tree and trying to (re-)invent an oversophisticated virtual memory manager ? SShould the use of a suficiently large swapfile be enough for these "large dataset" problems ? -- Emmanuel Charpentier -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
David James
2001-Nov-27 18:05 UTC
[R] R/S and large datasets - Database access (also Re: SAS and S/R)
The Rdbi (or perhaps simply DBI, for database interface, since it is meant to include both R and Splus) is a simple interface to any database management system or DBMS (so far only *relational* databases have been considered) very similar in spirit to Java's Database Connectivity (JDBC), Perl's Database Independent (DBI), Python's Database API. It deals primarily with a common set of function to interface R and Splus to databases (PostgreSQL, Oracle, Access, MySQL, mSQL, etc.) But we should think of this DBI only as a first step, or the infrastructure on which we can build more sophisticated tools. The proxy table/variable is a good example of such a tool. But if it's good for PostgreSQL tables, why not for Microsoft SQL tables? Or MySQL tables? By having a common interface, we hope to be able to build this sort of advanced tools independent of the underlying DBMS. Other applications may include the ability to attach() any database to the search() path (together with the idea of proxy objects, it could be helpful in some cases); also, the possibility to do "database apply" where we apply R functions to chunks on remote tables. (Roger Koenker and his colleague have an LM example, see http://www.econ.uiuc.edu/~roger/research/rq/LM.html). There has also been some interest of approximating quantiles, applying GLM's, etc., to very large datasets, but techniques like these will most likely require new algorithms to work sequentially. And of course, some also have pointed out (Brian Ripley, among others) that sampling has been used quite successfully before by statisticians:-) and thus could be quite useful in some of these cases. I'm not aware of any tools available yet to do this on remote DBMSes, but one would hope that if such a tool were to be developed, it would be done on top of the DBI so that it could be used with any DBMS. Obviously, there's a lot to be done... Regards, Emmanuel Charpentier wrote:> A consensus seems to emerge : R would excel to exploratory work on > small/middle-sized datasets, while SAS would be able to munch much > larger datasets. > > However, I see the "size" problem as a red herring. The objects that > have to stay "in core" are usually much smaller than the dataset. For > example, for problems involving fixed-effects linear models, you need > only some matrices whose size is proportional to the square of the > number of *variables* and the (admittedly large) vector of residues > (whose size is equl to the number of observations). Other cases > (nonlinear mixed effects models come to mind) are not as easily tamed > (any iterative process (shuch as ML estimation) has to get back to > original data), but at least, the time penalty involved in the use of > such an interface pays back by allowing you to treat problems otherwise > untractable. > > I am aware of at least one database access package that allows to access > data without dragging a whole table in memory : the RPgSql package > offers what it calls a "proxy variable", which is an objet that behaves, > for all practical purposes, as a dataframe, but is an interface to > database tables. I see this kind of interface as a way to avoid > overloading core memory with data scarcely used. > > Unfortunately, the said package is now officially orphaned by its > developper, which states that he now focuses on the next database access > standard : the Rdbi interface, which is currently under development, and > which I don't know a thing about. > > So the question is : do the Rdbi interface offers such a proxy to data > still residing in databases ? > > Or am I barking up the wrong tree and trying to (re-)invent an > oversophisticated virtual memory manager ? SShould the use of a > suficiently large swapfile be enough for these "large dataset" problems ? > > -- > Emmanuel Charpentier > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._-- David A. James Statistics Research, Room 2C-253 Phone: (908) 582-3082 Bell Labs, Lucent Technologies Fax: (908) 582-3340 Murray Hill, NJ 09794-0636 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Timothy H. Keitt
2001-Nov-28 18:27 UTC
[R] R/S and large datasets - Database access (also Re: SAS and S/R)
Emmanuel Charpentier wrote:> A consensus seems to emerge : R would excel to exploratory work on > small/middle-sized datasets, while SAS would be able to munch much > larger datasets. > > However, I see the "size" problem as a red herring. The objects that > have to stay "in core" are usually much smaller than the dataset. For > example, for problems involving fixed-effects linear models, you need > only some matrices whose size is proportional to the square of the > number of *variables* and the (admittedly large) vector of residues > (whose size is equl to the number of observations). Other cases > (nonlinear mixed effects models come to mind) are not as easily tamed > (any iterative process (shuch as ML estimation) has to get back to > original data), but at least, the time penalty involved in the use of > such an interface pays back by allowing you to treat problems > otherwise untractable. > > I am aware of at least one database access package that allows to > access data without dragging a whole table in memory : the RPgSql > package offers what it calls a "proxy variable", which is an objet > that behaves, for all practical purposes, as a dataframe, but is an > interface to database tables. I see this kind of interface as a way to > avoid overloading core memory with data scarcely used. > > Unfortunately, the said package is now officially orphaned by its > developper, which states that he now focuses on the next database > access standard : the Rdbi interface, which is currently under > development, and which I don't know a thing about. > > So the question is : do the Rdbi interface offers such a proxy to data > still residing in databases ? > > Or am I barking up the wrong tree and trying to (re-)invent an > oversophisticated virtual memory manager ? SShould the use of a > suficiently large swapfile be enough for these "large dataset" problems ? >The problem with proxy data frames is that you can't pass them to functions like 'lm' (at least when I tried it long ago), because the functions that make the proxy object look like a data frame only exist at the R level. When you drop down to internal C code, you call a different set of (non-overloadable) functions, so it just appears as a scalar object. Duncan's news about the generic "attach" interface may soon make this possible however. Actually, I've found that having learned some SQL, I now find it indespensible. As you say, generally you only work with a small subset of your data, and SQL queries is the best way I've found to do the subsetting. Also, there has been some recent discussion of a proposed generic DBI interface for R/S. Rdbi was my attempt (actually what I originally set out to do with RPgSQL, but some necessary internal functions were not yet documented or in some cases not yet implemented). We more-or-less settled on David James' proposal, but I do not know if anyone is actually implementing it. It would be nice to have a reference implementation so we can try it out and see what we do or don't like. I hope to see all of this resolved soon as I have less and less time to put into it and my interests are moving elsewhere (e.g., more GIS capabilities). T. -- Timothy H. Keitt Department of Ecology and Evolution State University of New York at Stony Brook Stony Brook, New York 11794 USA Phone: 631-632-1101, FAX: 631-632-7626 http://life.bio.sunysb.edu/ee/keitt/ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._