Thanks for doing this Thomas, I have been thinking about what it would
take to do this, but if it were left to me, it would have taken a lot
longer.
Back in the 80's there was a statistical package called RUMMAGE that did
all computations based on sufficient statistics and did not keep the
actual data in memory. Memory for computers became cheap before
datasets turned huge so there wasn't much demand for the program (and it
never had a nice GUI to help make it popular). It looks like things are
switching back to that model now though.
Here are a couple of thought that I had that maybe could help with some
future development:
Another function that could be helpful is bigplot which I imagine would
be best based on the hexbin package, just accumulating the counts in
chunks like your biglm function. Once I see the code for biglm I may be
able to contribute this piece. I guess bigbarplot and bigboxplot may
also be useful (accumulating counts for the barplot will be easy, but
does anyone have ideas on the best way to get quantiles for the boxplots
efficiently (the best approach I can think of so far is to have the
database sort the variables, but sorting tends to be slow)).
Another general approach that I thought of would be to read the data in
in chunks, compute the statistic(s) of interest on each chunk (vector of
coefficients for regression models) then average the estimates across
chunks. Each chunk could be treated as a cluster in a cluster sample
for the averaging and estimating variances for the estimates (if only we
can get the author of the survey package involved :-). This would
probably be less accurate than your biglm function for regression, but
it would have the flavor of the bootstrapping routines in that it would
work for many cases that don't have their own big methods written yet
(logistic and other glm models, correlations, ...).
Any other thoughts anyone?
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Thomas Lumley
Sent: Tuesday, May 16, 2006 3:40 PM
To: roger koenker
Cc: r-help list; Robert Citek
Subject: Re: [R] Re : Large database help
On Tue, 16 May 2006, roger koenker wrote:
> In ancient times, 1999 or so, Alvaro Novo and I experimented with an
> interface to mysql that brought chunks of data into R and accumulated
> results.
> This is still described and available on the web in its original form
> at
>
> http://www.econ.uiuc.edu/~roger/research/rq/LM.html
>
> Despite claims of "future developments" nothing emerged, so
anyone
> considering further explorations with it may need training in
> Rchaeology.
A few hours ago I submitted to CRAN a package "biglm" that does large
linear regression models using a similar strategy (it uses incremental
QR decomposition rather than accumalating the crossproduct matrix). It
also computes the Huber/White sandwich variance estimate in the same
single pass over the data.
Assuming I haven't messed up the package checking it will appear in the
next couple of day on CRAN. The syntax looks like
a <- biglm(log(Volume) ~ log(Girth) + log(Height), chunk1)
a <- update(a, chunk2)
a <- update(a, chunk3)
summary(a)
where chunk1, chunk2, chunk3 are chunks of the data.
-thomas
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html