thr3ads.net - R help - [R] lean and mean lm/glm? [Aug 2006]

If this information is useful, please help other people find it:
Share via:

Damien Moore

2006-Aug-21 17:49 UTC

[R] lean and mean lm/glm?

Hi All: I'm new to R and have a few questions about getting R to run
efficiently with large datasets.

I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the
usual windows overhead). I have a dataset that has 4 million observations and
about 20 variables. I want to run probit regressions on this data, but can't
do this with more than about 500,000 observations before I start running out of
ram (you could argue that I'm getting sufficient precision with <500,000
obs but lets pretend otherwise). Loading 500,000 observations into a data frame
only takes about 100Mb of ram, so that isn't the problem. Instead it seems R
uses huge amount of memory when running the glm methods. I called the Fortran
routines that lm and glm use directly but even they create a large number of
extraneous variables in the output (e.g. the Xs, ys, residuals etc) and during
processing. For instance (sample code)

x=runif(1000000)
y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the 7mb of ram
required to store y during processing, but cleans up ok afterwards with a gc()
call
X=cbind(x)
p=ncol(X)
n=NROW(y)
ny=NCOL(y)
tol=1e-7
#this is the fortran routine called by lm - regressing y on X here
z <- .Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, 
tol = as.double(tol), coefficients = mat.or.vec(p, ny), 
residuals = y, effects = y, rank = integer(1), pivot = 1:p, 
qraux = double(p), work = double(2 * p), PACKAGE = "base")

This code runs very quickly - suggesting that in principle R should have no
problem at all handling very large data sets, but uses >100mb during
processing and z is about a 20mb object. Scaling this up to a much larger
dataset with many variables its easy to see i'm going to run into problems

My questions:
1. are there any memory efficient alternatives to lm/glm in R?
2. is there any way to prevent the Fortran routine "dqrls" from
producing so much output? (I suspect not since its output has to be compatible
with the summary method, which seems to rely on having a copy of all variables
instead of just references to the relevant variables - correct me if i'm
wrong on this)
3. failing 1 & 2 how easy would it be to create new versions of lm and glm
that don't use so much memory? (Not that I'm volunteering or anything ;)
). There is no need to hold individual residuals in memory or make copies of the
variables (at least for my purposes). How well documented is the source code?

cheers
Damien Moore

Greg Snow

2006-Aug-21 18:01 UTC

head link

[R] lean and mean lm/glm?

For very large regression problems there is the biglm package (put you
data into a database, read in 500,000 rows at a time, and keep updating
the fit).

This has not been extended to glm yet.

Hope this helps, 


-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111
 

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Damien Moore
Sent: Monday, August 21, 2006 11:49 AM
To: r-help at stat.math.ethz.ch
Subject: [R] lean and mean lm/glm?


Hi All: I'm new to R and have a few questions about getting R to run
efficiently with large datasets.

I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the
usual windows overhead). I have a dataset that has 4 million
observations and about 20 variables. I want to run probit regressions on
this data, but can't do this with more than about 500,000 observations
before I start running out of ram (you could argue that I'm getting
sufficient precision with <500,000 obs but lets pretend otherwise).
Loading 500,000 observations into a data frame only takes about 100Mb of
ram, so that isn't the problem. Instead it seems R uses huge amount of
memory when running the glm methods. I called the Fortran routines that
lm and glm use directly but even they create a large number of
extraneous variables in the output (e.g. the Xs, ys, residuals etc) and
during processing. For instance (sample code)

x=runif(1000000)
y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the
7mb of ram required to store y during processing, but cleans up ok
afterwards with a gc() call
X=cbind(x)
p=ncol(X)
n=NROW(y)
ny=NCOL(y)
tol=1e-7
#this is the fortran routine called by lm - regressing y on X here z <-
.Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, tol
as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y, effects
= y, rank = integer(1), pivot = 1:p, qraux = double(p), work = double(2
* p), PACKAGE = "base")

This code runs very quickly - suggesting that in principle R should have
no problem at all handling very large data sets, but uses >100mb during
processing and z is about a 20mb object. Scaling this up to a much
larger dataset with many variables its easy to see i'm going to run into
problems

My questions:
1. are there any memory efficient alternatives to lm/glm in R?
2. is there any way to prevent the Fortran routine "dqrls" from
producing so much output? (I suspect not since its output has to be
compatible with the summary method, which seems to rely on having a copy
of all variables instead of just references to the relevant variables -
correct me if i'm wrong on this) 3. failing 1 & 2 how easy would it be
to create new versions of lm and glm that don't use so much memory? (Not
that I'm volunteering or anything ;) ). There is no need to hold
individual residuals in memory or make copies of the variables (at least
for my purposes). How well documented is the source code?

cheers
Damien Moore

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Damien Moore

2006-Aug-21 23:44 UTC

head link

[R] lean and mean lm/glm?

>For very large regression problems there is the biglm package (put you
>data into a database, read in 500,000 rows at a time, and keep updating
>the fit).
thanks. I took a look at biglm and it seems pretty easy to use and, looking at
the source, avoids much of the redundancy of lm. Correct me if i'm wrong,
but I think it would be virtually impossible to extend to glm, because of the
non-linearity in glm models.

I might hack around at the source code for glm.fit -- I think I can avoid some
of the redundancy involved in that routine pretty easily, but it will mean
rewriting the summary output code...

cheers
Damien

--- On Mon 08/21, Greg Snow < Greg.Snow at intermountainmail.org >
wrote:From: Greg Snow [mailto: Greg.Snow at intermountainmail.org]To:
damien.moore at excite.com, r-help at stat.math.ethz.chDate: Mon, 21 Aug 2006
12:01:06 -0600Subject: RE: [R] lean and mean lm/glm?

For very large regression problems there is the biglm package (put you
data into a database, read in 500,000 rows at a time, and keep updating
the fit).

This has not been extended to glm yet.

Hope this helps, 

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Damien Moore
Sent: Monday, August 21, 2006 11:49 AM
To: r-help at stat.math.ethz.ch
Subject: [R] lean and mean lm/glm?

Hi All: I'm new to R and have a few questions about getting R to run
efficiently with large datasets.

I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the
usual windows overhead). I have a dataset that has 4 million
observations and about 20 variables. I want to run probit regressions on
this data, but can't do this with more than about 500,000 observations
before I start running out of ram (you could argue that I'm getting
sufficient precision with <500,000 obs but lets pretend otherwise).
Loading 500,000 observations into a data frame only takes about 100Mb of
ram, so that isn't the problem. Instead it seems R uses huge amount of
memory when running the glm methods. I called the Fortran routines that
lm and glm use directly but even they create a large number of
extraneous variables in the output (e.g. the Xs, ys, residuals etc) and
during processing. For instance (sample code)

x=runif(1000000)
y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the
7mb of ram required to store y during processing, but cleans up ok
afterwards with a gc() call
X=cbind(x)
p=ncol(X)
n=NROW(y)
ny=NCOL(y)
tol=1e-7
#this is the fortran routine called by lm - regressing y on X here z <-
.Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, tol
as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y, effects
= y, rank = integer(1), pivot = 1:p, qraux = double(p), work = double(2
* p), PACKAGE = "base")

This code runs very quickly - suggesting that in principle R should have
no problem at all handling very large data sets, but uses >100mb during
processing and z is about a 20mb object. Scaling this up to a much
larger dataset with many variables its easy to see i'm going to run into
problems

My questions:
1. are there any memory efficient alternatives to lm/glm in R?
2. is there any way to prevent the Fortran routine "dqrls" from
producing so much output? (I suspect not since its output has to be
compatible with the summary method, which seems to rely on having a copy
of all variables instead of just references to the relevant variables -
correct me if i'm wrong on this) 3. failing 1 & 2 how easy would it be
to create new versions of lm and glm that don't use so much memory? (Not
that I'm volunteering or anything ;) ). There is no need to hold
individual residuals in memory or make copies of the variables (at least
for my purposes). How well documented is the source code?

cheers
Damien Moore

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Charles C. Berry

2006-Aug-22 17:07 UTC

head link

[R] lean and mean lm/glm?

On Mon, 21 Aug 2006, Damien Moore wrote:
>
>> For very large regression problems there is the biglm package (put you
>> data into a database, read in 500,000 rows at a time, and keep updating
>> the fit).
>
> thanks. I took a look at biglm and it seems pretty easy to use and, 
> looking at the source, avoids much of the redundancy of lm. Correct me 
> if i'm wrong, but I think it would be virtually impossible to extend to
> glm, because of the non-linearity in glm models.
>
> I might hack around at the source code for glm.fit -- I think I can 
> avoid some of the redundancy involved in that routine pretty easily, but 
> it will mean rewriting the summary output code...

Damien,

If you know what is 'under the hood' of glm, you can use the biglm
approach to perform a one-step update of the coefficients of a glm model.

There is plenty of theory for one-step estimators that use consistent 
estimates as starting values.

You can probably get a good starting value by averaging all of the results 
returned by slicing the data set into smaller pieces and running glm.fit 
on each of them.

Chuck
>
> cheers
> Damien
>
>
> --- On Mon 08/21, Greg Snow < Greg.Snow at intermountainmail.org >
wrote:From: Greg Snow [mailto: Greg.Snow at intermountainmail.org]To:
damien.moore at excite.com, r-help at stat.math.ethz.chDate: Mon, 21 Aug 2006
12:01:06 -0600Subject: RE: [R] lean and mean lm/glm?
>
> For very large regression problems there is the biglm package (put you
> data into a database, read in 500,000 rows at a time, and keep updating
> the fit).
>
> This has not been extended to glm yet.
>
> Hope this helps,
>
>
> -- 
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at intermountainmail.org
> (801) 408-8111
>
>
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Damien Moore
> Sent: Monday, August 21, 2006 11:49 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] lean and mean lm/glm?
>
>
> Hi All: I'm new to R and have a few questions about getting R to run
> efficiently with large datasets.
>
> I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after
the
> usual windows overhead). I have a dataset that has 4 million
> observations and about 20 variables. I want to run probit regressions on
> this data, but can't do this with more than about 500,000 observations
> before I start running out of ram (you could argue that I'm getting
> sufficient precision with <500,000 obs but lets pretend otherwise).
> Loading 500,000 observations into a data frame only takes about 100Mb of
> ram, so that isn't the problem. Instead it seems R uses huge amount of
> memory when running the glm methods. I called the Fortran routines that
> lm and glm use directly but even they create a large number of
> extraneous variables in the output (e.g. the Xs, ys, residuals etc) and
> during processing. For instance (sample code)
>
> x=runif(1000000)
> y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the
> 7mb of ram required to store y during processing, but cleans up ok
> afterwards with a gc() call
> X=cbind(x)
> p=ncol(X)
> n=NROW(y)
> ny=NCOL(y)
> tol=1e-7
> #this is the fortran routine called by lm - regressing y on X here z <-
> .Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, tol >
as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y, effects
> = y, rank = integer(1), pivot = 1:p, qraux = double(p), work = double(2
> * p), PACKAGE = "base")
>
> This code runs very quickly - suggesting that in principle R should have
> no problem at all handling very large data sets, but uses >100mb during
> processing and z is about a 20mb object. Scaling this up to a much
> larger dataset with many variables its easy to see i'm going to run
into
> problems
>
> My questions:
> 1. are there any memory efficient alternatives to lm/glm in R?
> 2. is there any way to prevent the Fortran routine "dqrls" from
> producing so much output? (I suspect not since its output has to be
> compatible with the summary method, which seems to rely on having a copy
> of all variables instead of just references to the relevant variables -
> correct me if i'm wrong on this) 3. failing 1 & 2 how easy would it
be
> to create new versions of lm and glm that don't use so much memory?
(Not
> that I'm volunteering or anything ;) ). There is no need to hold
> individual residuals in memory or make copies of the variables (at least
> for my purposes). How well documented is the source code?
>
> cheers
> Damien Moore
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
>    [ Part 3.53: "Included Message" ]
>
Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717

Damien Moore

2006-Aug-23 15:06 UTC

head link

[R] lean and mean lm/glm?

Thomas Lumley wrote:
> No, it is quite straightforward if you are willing to make multiple passes 
> through the data. It is hard with a single pass and may not be possible 
> unless the data are in random order.
> 
> Fisher scoring for glms is just an iterative weighted least squares 
> calculation using a set of 'working' weights and 'working'
response. These
> can be defined chunk by chunk and fed to biglm. Three iterations should 
> be sufficient.
(NB: Although not stated clearly I was referring to single pass when I wrote
"impossible"). Doing as you suggest with multiple passes would entail
either sticking the database input calls into the main iterative loop of a
lookalike glm.fit or lumping the user with a very unattractive sequence of
calls:

big_glm.init
iterate:
load_data_chunk
big_glm.newiter
iterate: #could use a subset of the chunks on the first few go rounds
load_data_chunk
update.big_glm
big_glm.check_convergence #would also need to do coefficient adjustments if
convergence is failing

Because most (if not all) of my data can fit into memory anyway, I propose
simply doing the calcs in a modified glm.fit in chunks (i.e. by subsetting the X
and y data matrices within the loops) with a user defined chunk length. I can
always add database input calls later to handle exceptionally large datasets.

If one of you has a better suggestion I'm willing to hear it.

So far, I have hacked out a lot of the (in my view) extraneous stuff from glm
and halved its memory usage. I can now run a 12 variable, 1 million observation
data set using "only" 200Mb of working memory (excluding the memory
required to store the data). Previously fit.glm was using 500Mb to do the same.
To get convergence took 9 iterations (either way). To reiterate: the
inefficiency is in calculating estimates, not in storing data.

Damien Moore

2006-Aug-23 16:14 UTC

head link

[R] lean and mean lm/glm?

Thomas Lumley < tlumley at u.washington.edu > wrote:
>I have written most of a bigglm() function where the data= argument is a 
>function with a single argument 'reset'. When called with
reset=FALSE the
>function should return another chunk of data, or NULL if no data are 
>available, and when called with reset=TRUE it should go back to the 
>beginning of the data. I don't think this is too inelegant.
yes, that does sound like a pretty elegent solution. It would be even more so if
you could offer a default implementation of the data_function that simply passes
chunks of large X and y matrices held in memory. (ideally you would just
intialize the data_function to reference the X and y data to avoid duplicating
it, don't know if that's possible in R.) how long before its ready? :)

--- On Wed 08/23, Thomas Lumley < tlumley at u.washington.edu > wrote:

From: Thomas Lumley [mailto: tlumley at u.washington.edu]
To: damien.moore at excite.com
Cc: r-help at stat.math.ethz.ch, ripley at stats.ox.ac.uk
Date: Wed, 23 Aug 2006 08:25:54 -0700 (PDT)
Subject: Re: [R] lean and mean lm/glm?

On Wed, 23 Aug 2006, Damien Moore wrote:
>
> Thomas Lumley wrote:
>
>> No, it is quite straightforward if you are willing to make multiple
passes
>> through the data. It is hard with a single pass and may not be possible
>> unless the data are in random order.
>>
>> Fisher scoring for glms is just an iterative weighted least squares
>> calculation using a set of 'working' weights and
'working' response. These
>> can be defined chunk by chunk and fed to biglm. Three iterations should
>> be sufficient.
>
> (NB: Although not stated clearly I was referring to single pass when I 
> wrote "impossible"). Doing as you suggest with multiple passes
would
> entail either sticking the database input calls into the main iterative 
> loop of a lookalike glm.fit or lumping the user with a very unattractive 
> sequence of calls:
I have written most of a bigglm() function where the data= argument is a 
function with a single argument 'reset'. When called with reset=FALSE
the
function should return another chunk of data, or NULL if no data are 
available, and when called with reset=TRUE it should go back to the 
beginning of the data. I don't think this is too inelegant.

In general I don't think a one-pass algorithm is possible. If the data are 
in random order then you could read one chunk, fit a glm, and set up a 
grid of coefficient values around the estimate. You then read the rest of 
the data, computing the loglikelihood and score function at each point in 
the grid. After reading all the data you can then fit a suitable smooth 
surface to the loglikelihood. I don't know whether this will give 
sufficient accuracy, though.

For really big data sets you are probably better off with the approach 
that Brian Ripley and Fei Chen used -- they have shown that it works and 
there unlikely to be anything much simpler that also works that they 
missed.

-thomas

Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle

Seemingly Similar Threads

Search for more reasonably related threads

R help - Aug 2006 - lean and mean lm/glm?

[R] lean and mean lm/glm?

[R] lean and mean lm/glm?

[R] lean and mean lm/glm?

[R] lean and mean lm/glm?

[R] lean and mean lm/glm?

[R] lean and mean lm/glm?

Seemingly Similar Threads