thr3ads.net - R help - [R] R/S and large datasets - Database access (also Re: SAS and S/R) [Nov 2001]

If this information is useful, please help other people find it:
Share via:

Emmanuel Charpentier

2001-Nov-27 15:11 UTC

[R] R/S and large datasets - Database access (also Re: SAS and S/R)

A consensus seems to emerge : R would excel to exploratory work on 
small/middle-sized datasets, while SAS would be able to munch much 
larger datasets.

However, I see the "size" problem as a red herring. The objects that 
have to stay "in core" are usually much smaller than the dataset. For 
example, for problems involving fixed-effects linear models, you need 
only some matrices whose size is proportional to the square of the 
number of *variables* and the (admittedly large) vector of residues 
(whose size is equl to the number of observations). Other cases 
(nonlinear mixed effects models come to mind) are not as easily tamed 
(any iterative process (shuch as ML estimation) has to get back  to 
original data), but at least, the time penalty involved in the use of 
such an interface pays back by allowing you to treat problems otherwise 
untractable.

I am aware of at least one database access package that allows to access 
data without dragging a whole table in memory : the RPgSql package 
offers what it calls a "proxy variable", which is an objet that
behaves,
for all practical purposes, as a dataframe, but is an interface to 
database tables. I see this kind of interface as a way to avoid 
overloading core memory with data scarcely used.

Unfortunately, the said package is now officially orphaned by its 
developper, which states that he now focuses on the next database access 
standard : the Rdbi interface, which is currently under development, and 
which I don't know a thing about.

So the question is : do the Rdbi interface offers such a proxy to data 
still residing in databases ?

Or am I barking up the wrong tree and trying to (re-)invent an 
oversophisticated virtual memory manager ?  SShould the use of a 
suficiently large swapfile be enough for these "large dataset"
problems ?

--
                                        Emmanuel Charpentier


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

David James

2001-Nov-27 18:05 UTC

head link

[R] R/S and large datasets - Database access (also Re: SAS and S/R)

The Rdbi (or perhaps simply DBI, for database interface, since it is
meant to include both R and Splus) is a simple interface to any database
management system or DBMS (so far only *relational* databases have been
considered) very similar in spirit to Java's Database Connectivity (JDBC),
Perl's Database Independent (DBI), Python's Database API.  It deals
primarily with a common set of function to interface R and Splus to
databases (PostgreSQL, Oracle, Access, MySQL, mSQL, etc.)  But we should
think of this DBI only as a first step, or the infrastructure on which
we can build more sophisticated tools.  The proxy table/variable is a
good example of such a tool.  But if it's good for PostgreSQL tables,
why not for Microsoft SQL tables? Or MySQL tables?  By having a common
interface, we hope to be able to build this sort of advanced tools
independent of the underlying DBMS.

Other applications may include the ability to attach() any database
to the search() path (together with the idea of proxy objects,
it could be helpful in some cases);  also, the possibility to do
"database apply" where we apply R functions to chunks on remote
tables.  (Roger Koenker and his colleague have an LM example, see
http://www.econ.uiuc.edu/~roger/research/rq/LM.html).  There has also
been some interest of approximating quantiles, applying GLM's, etc., to
very large datasets, but techniques like these will most likely require
new algorithms to work sequentially.

And of course, some also have pointed out (Brian Ripley, among others)
that sampling has been used quite successfully before by statisticians:-)
and thus could be quite useful in some of these cases.  I'm not aware
of any tools available yet to do this on remote DBMSes, but one would
hope that if such a tool were to be developed, it would be done on top
of the DBI so that it could be used with any DBMS.

Obviously, there's a lot to be done...

Regards,

Emmanuel Charpentier wrote:> A consensus seems to emerge : R would excel to exploratory work on 
> small/middle-sized datasets, while SAS would be able to munch much 
> larger datasets.
> 
> However, I see the "size" problem as a red herring. The objects
that
> have to stay "in core" are usually much smaller than the dataset.
For
> example, for problems involving fixed-effects linear models, you need 
> only some matrices whose size is proportional to the square of the 
> number of *variables* and the (admittedly large) vector of residues 
> (whose size is equl to the number of observations). Other cases 
> (nonlinear mixed effects models come to mind) are not as easily tamed 
> (any iterative process (shuch as ML estimation) has to get back  to 
> original data), but at least, the time penalty involved in the use of 
> such an interface pays back by allowing you to treat problems otherwise 
> untractable.
> 
> I am aware of at least one database access package that allows to access 
> data without dragging a whole table in memory : the RPgSql package 
> offers what it calls a "proxy variable", which is an objet that
behaves,
> for all practical purposes, as a dataframe, but is an interface to 
> database tables. I see this kind of interface as a way to avoid 
> overloading core memory with data scarcely used.
> 
> Unfortunately, the said package is now officially orphaned by its 
> developper, which states that he now focuses on the next database access 
> standard : the Rdbi interface, which is currently under development, and 
> which I don't know a thing about.
> 
> So the question is : do the Rdbi interface offers such a proxy to data 
> still residing in databases ?
> 
> Or am I barking up the wrong tree and trying to (re-)invent an 
> oversophisticated virtual memory manager ?  SShould the use of a 
> suficiently large swapfile be enough for these "large dataset"
problems ?
> 
> --
>                                         Emmanuel Charpentier
> 
> 
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
>
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
-- 
David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Timothy H. Keitt

2001-Nov-28 18:27 UTC

head link

[R] R/S and large datasets - Database access (also Re: SAS and S/R)

Emmanuel Charpentier wrote:
> A consensus seems to emerge : R would excel to exploratory work on 
> small/middle-sized datasets, while SAS would be able to munch much 
> larger datasets.
>
> However, I see the "size" problem as a red herring. The objects
that
> have to stay "in core" are usually much smaller than the dataset.
For
> example, for problems involving fixed-effects linear models, you need 
> only some matrices whose size is proportional to the square of the 
> number of *variables* and the (admittedly large) vector of residues 
> (whose size is equl to the number of observations). Other cases 
> (nonlinear mixed effects models come to mind) are not as easily tamed 
> (any iterative process (shuch as ML estimation) has to get back  to 
> original data), but at least, the time penalty involved in the use of 
> such an interface pays back by allowing you to treat problems 
> otherwise untractable.
>
> I am aware of at least one database access package that allows to 
> access data without dragging a whole table in memory : the RPgSql 
> package offers what it calls a "proxy variable", which is an
objet
> that behaves, for all practical purposes, as a dataframe, but is an 
> interface to database tables. I see this kind of interface as a way to 
> avoid overloading core memory with data scarcely used.
>
> Unfortunately, the said package is now officially orphaned by its 
> developper, which states that he now focuses on the next database 
> access standard : the Rdbi interface, which is currently under 
> development, and which I don't know a thing about.
>
> So the question is : do the Rdbi interface offers such a proxy to data 
> still residing in databases ?
>
> Or am I barking up the wrong tree and trying to (re-)invent an 
> oversophisticated virtual memory manager ?  SShould the use of a 
> suficiently large swapfile be enough for these "large dataset"
problems ?
>The problem with proxy data frames is that you can't pass them to 
functions like 'lm' (at least when I tried it long ago), because the 
functions that make the proxy object look like a data frame only exist 
at the R level. When you drop down to internal C code, you call a 
different set of (non-overloadable) functions, so it just appears as a 
scalar object. Duncan's news about the generic "attach" interface
may
soon make this possible however. Actually, I've found that having 
learned some SQL, I now find it indespensible. As you say, generally you 
only work with a small subset of your data, and SQL queries is the best 
way I've found to do the subsetting.

Also, there has been some recent discussion of a proposed generic DBI 
interface for R/S. Rdbi was my attempt (actually what I originally set 
out to do with RPgSQL, but some necessary internal functions were not 
yet documented or in some cases not yet implemented). We more-or-less 
settled on David James' proposal, but I do not know if anyone is 
actually implementing it. It would be nice to have a reference 
implementation so we can try it out and see what we do or don't like. I 
hope to see all of this resolved soon as I have less and less time to 
put into it and my interests are moving elsewhere (e.g., more GIS 
capabilities).

T.

-- 
Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Nov 2001 - R/S and large datasets - Database access (also Re: SAS and S/R)

[R] R/S and large datasets - Database access (also Re: SAS and S/R)

[R] R/S and large datasets - Database access (also Re: SAS and S/R)

[R] R/S and large datasets - Database access (also Re: SAS and S/R)

Apparently Analagous Threads