Hi All: I'm new to R and have a few questions about getting R to run efficiently with large datasets. I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the usual windows overhead). I have a dataset that has 4 million observations and about 20 variables. I want to run probit regressions on this data, but can't do this with more than about 500,000 observations before I start running out of ram (you could argue that I'm getting sufficient precision with <500,000 obs but lets pretend otherwise). Loading 500,000 observations into a data frame only takes about 100Mb of ram, so that isn't the problem. Instead it seems R uses huge amount of memory when running the glm methods. I called the Fortran routines that lm and glm use directly but even they create a large number of extraneous variables in the output (e.g. the Xs, ys, residuals etc) and during processing. For instance (sample code) x=runif(1000000) y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the 7mb of ram required to store y during processing, but cleans up ok afterwards with a gc() call X=cbind(x) p=ncol(X) n=NROW(y) ny=NCOL(y) tol=1e-7 #this is the fortran routine called by lm - regressing y on X here z <- .Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, tol = as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y, effects = y, rank = integer(1), pivot = 1:p, qraux = double(p), work = double(2 * p), PACKAGE = "base") This code runs very quickly - suggesting that in principle R should have no problem at all handling very large data sets, but uses >100mb during processing and z is about a 20mb object. Scaling this up to a much larger dataset with many variables its easy to see i'm going to run into problems My questions: 1. are there any memory efficient alternatives to lm/glm in R? 2. is there any way to prevent the Fortran routine "dqrls" from producing so much output? (I suspect not since its output has to be compatible with the summary method, which seems to rely on having a copy of all variables instead of just references to the relevant variables - correct me if i'm wrong on this) 3. failing 1 & 2 how easy would it be to create new versions of lm and glm that don't use so much memory? (Not that I'm volunteering or anything ;) ). There is no need to hold individual residuals in memory or make copies of the variables (at least for my purposes). How well documented is the source code? cheers Damien Moore
For very large regression problems there is the biglm package (put you data into a database, read in 500,000 rows at a time, and keep updating the fit). This has not been extended to glm yet. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111 -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Damien Moore Sent: Monday, August 21, 2006 11:49 AM To: r-help at stat.math.ethz.ch Subject: [R] lean and mean lm/glm? Hi All: I'm new to R and have a few questions about getting R to run efficiently with large datasets. I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the usual windows overhead). I have a dataset that has 4 million observations and about 20 variables. I want to run probit regressions on this data, but can't do this with more than about 500,000 observations before I start running out of ram (you could argue that I'm getting sufficient precision with <500,000 obs but lets pretend otherwise). Loading 500,000 observations into a data frame only takes about 100Mb of ram, so that isn't the problem. Instead it seems R uses huge amount of memory when running the glm methods. I called the Fortran routines that lm and glm use directly but even they create a large number of extraneous variables in the output (e.g. the Xs, ys, residuals etc) and during processing. For instance (sample code) x=runif(1000000) y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the 7mb of ram required to store y during processing, but cleans up ok afterwards with a gc() call X=cbind(x) p=ncol(X) n=NROW(y) ny=NCOL(y) tol=1e-7 #this is the fortran routine called by lm - regressing y on X here z <- .Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, tol as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y, effects = y, rank = integer(1), pivot = 1:p, qraux = double(p), work = double(2 * p), PACKAGE = "base") This code runs very quickly - suggesting that in principle R should have no problem at all handling very large data sets, but uses >100mb during processing and z is about a 20mb object. Scaling this up to a much larger dataset with many variables its easy to see i'm going to run into problems My questions: 1. are there any memory efficient alternatives to lm/glm in R? 2. is there any way to prevent the Fortran routine "dqrls" from producing so much output? (I suspect not since its output has to be compatible with the summary method, which seems to rely on having a copy of all variables instead of just references to the relevant variables - correct me if i'm wrong on this) 3. failing 1 & 2 how easy would it be to create new versions of lm and glm that don't use so much memory? (Not that I'm volunteering or anything ;) ). There is no need to hold individual residuals in memory or make copies of the variables (at least for my purposes). How well documented is the source code? cheers Damien Moore ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
>For very large regression problems there is the biglm package (put you >data into a database, read in 500,000 rows at a time, and keep updating >the fit).thanks. I took a look at biglm and it seems pretty easy to use and, looking at the source, avoids much of the redundancy of lm. Correct me if i'm wrong, but I think it would be virtually impossible to extend to glm, because of the non-linearity in glm models. I might hack around at the source code for glm.fit -- I think I can avoid some of the redundancy involved in that routine pretty easily, but it will mean rewriting the summary output code... cheers Damien --- On Mon 08/21, Greg Snow < Greg.Snow at intermountainmail.org > wrote:From: Greg Snow [mailto: Greg.Snow at intermountainmail.org]To: damien.moore at excite.com, r-help at stat.math.ethz.chDate: Mon, 21 Aug 2006 12:01:06 -0600Subject: RE: [R] lean and mean lm/glm? For very large regression problems there is the biglm package (put you data into a database, read in 500,000 rows at a time, and keep updating the fit). This has not been extended to glm yet. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111 -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Damien Moore Sent: Monday, August 21, 2006 11:49 AM To: r-help at stat.math.ethz.ch Subject: [R] lean and mean lm/glm? Hi All: I'm new to R and have a few questions about getting R to run efficiently with large datasets. I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the usual windows overhead). I have a dataset that has 4 million observations and about 20 variables. I want to run probit regressions on this data, but can't do this with more than about 500,000 observations before I start running out of ram (you could argue that I'm getting sufficient precision with <500,000 obs but lets pretend otherwise). Loading 500,000 observations into a data frame only takes about 100Mb of ram, so that isn't the problem. Instead it seems R uses huge amount of memory when running the glm methods. I called the Fortran routines that lm and glm use directly but even they create a large number of extraneous variables in the output (e.g. the Xs, ys, residuals etc) and during processing. For instance (sample code) x=runif(1000000) y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the 7mb of ram required to store y during processing, but cleans up ok afterwards with a gc() call X=cbind(x) p=ncol(X) n=NROW(y) ny=NCOL(y) tol=1e-7 #this is the fortran routine called by lm - regressing y on X here z <- .Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, tol as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y, effects = y, rank = integer(1), pivot = 1:p, qraux = double(p), work = double(2 * p), PACKAGE = "base") This code runs very quickly - suggesting that in principle R should have no problem at all handling very large data sets, but uses >100mb during processing and z is about a 20mb object. Scaling this up to a much larger dataset with many variables its easy to see i'm going to run into problems My questions: 1. are there any memory efficient alternatives to lm/glm in R? 2. is there any way to prevent the Fortran routine "dqrls" from producing so much output? (I suspect not since its output has to be compatible with the summary method, which seems to rely on having a copy of all variables instead of just references to the relevant variables - correct me if i'm wrong on this) 3. failing 1 & 2 how easy would it be to create new versions of lm and glm that don't use so much memory? (Not that I'm volunteering or anything ;) ). There is no need to hold individual residuals in memory or make copies of the variables (at least for my purposes). How well documented is the source code? cheers Damien Moore ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Mon, 21 Aug 2006, Damien Moore wrote:> >> For very large regression problems there is the biglm package (put you >> data into a database, read in 500,000 rows at a time, and keep updating >> the fit). > > thanks. I took a look at biglm and it seems pretty easy to use and, > looking at the source, avoids much of the redundancy of lm. Correct me > if i'm wrong, but I think it would be virtually impossible to extend to > glm, because of the non-linearity in glm models. > > I might hack around at the source code for glm.fit -- I think I can > avoid some of the redundancy involved in that routine pretty easily, but > it will mean rewriting the summary output code...Damien, If you know what is 'under the hood' of glm, you can use the biglm approach to perform a one-step update of the coefficients of a glm model. There is plenty of theory for one-step estimators that use consistent estimates as starting values. You can probably get a good starting value by averaging all of the results returned by slicing the data set into smaller pieces and running glm.fit on each of them. Chuck> > cheers > Damien > > > --- On Mon 08/21, Greg Snow < Greg.Snow at intermountainmail.org > wrote:From: Greg Snow [mailto: Greg.Snow at intermountainmail.org]To: damien.moore at excite.com, r-help at stat.math.ethz.chDate: Mon, 21 Aug 2006 12:01:06 -0600Subject: RE: [R] lean and mean lm/glm? > > For very large regression problems there is the biglm package (put you > data into a database, read in 500,000 rows at a time, and keep updating > the fit). > > This has not been extended to glm yet. > > Hope this helps, > > > -- > Gregory (Greg) L. Snow Ph.D. > Statistical Data Center > Intermountain Healthcare > greg.snow at intermountainmail.org > (801) 408-8111 > > > -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Damien Moore > Sent: Monday, August 21, 2006 11:49 AM > To: r-help at stat.math.ethz.ch > Subject: [R] lean and mean lm/glm? > > > Hi All: I'm new to R and have a few questions about getting R to run > efficiently with large datasets. > > I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the > usual windows overhead). I have a dataset that has 4 million > observations and about 20 variables. I want to run probit regressions on > this data, but can't do this with more than about 500,000 observations > before I start running out of ram (you could argue that I'm getting > sufficient precision with <500,000 obs but lets pretend otherwise). > Loading 500,000 observations into a data frame only takes about 100Mb of > ram, so that isn't the problem. Instead it seems R uses huge amount of > memory when running the glm methods. I called the Fortran routines that > lm and glm use directly but even they create a large number of > extraneous variables in the output (e.g. the Xs, ys, residuals etc) and > during processing. For instance (sample code) > > x=runif(1000000) > y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the > 7mb of ram required to store y during processing, but cleans up ok > afterwards with a gc() call > X=cbind(x) > p=ncol(X) > n=NROW(y) > ny=NCOL(y) > tol=1e-7 > #this is the fortran routine called by lm - regressing y on X here z <- > .Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, tol > as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y, effects > = y, rank = integer(1), pivot = 1:p, qraux = double(p), work = double(2 > * p), PACKAGE = "base") > > This code runs very quickly - suggesting that in principle R should have > no problem at all handling very large data sets, but uses >100mb during > processing and z is about a 20mb object. Scaling this up to a much > larger dataset with many variables its easy to see i'm going to run into > problems > > My questions: > 1. are there any memory efficient alternatives to lm/glm in R? > 2. is there any way to prevent the Fortran routine "dqrls" from > producing so much output? (I suspect not since its output has to be > compatible with the summary method, which seems to rely on having a copy > of all variables instead of just references to the relevant variables - > correct me if i'm wrong on this) 3. failing 1 & 2 how easy would it be > to create new versions of lm and glm that don't use so much memory? (Not > that I'm volunteering or anything ;) ). There is no need to hold > individual residuals in memory or make copies of the variables (at least > for my purposes). How well documented is the source code? > > cheers > Damien Moore > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > > > > [ Part 3.53: "Included Message" ] >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717
Thomas Lumley wrote:> No, it is quite straightforward if you are willing to make multiple passes > through the data. It is hard with a single pass and may not be possible > unless the data are in random order. > > Fisher scoring for glms is just an iterative weighted least squares > calculation using a set of 'working' weights and 'working' response. These > can be defined chunk by chunk and fed to biglm. Three iterations should > be sufficient.(NB: Although not stated clearly I was referring to single pass when I wrote "impossible"). Doing as you suggest with multiple passes would entail either sticking the database input calls into the main iterative loop of a lookalike glm.fit or lumping the user with a very unattractive sequence of calls: big_glm.init iterate: load_data_chunk big_glm.newiter iterate: #could use a subset of the chunks on the first few go rounds load_data_chunk update.big_glm big_glm.check_convergence #would also need to do coefficient adjustments if convergence is failing Because most (if not all) of my data can fit into memory anyway, I propose simply doing the calcs in a modified glm.fit in chunks (i.e. by subsetting the X and y data matrices within the loops) with a user defined chunk length. I can always add database input calls later to handle exceptionally large datasets. If one of you has a better suggestion I'm willing to hear it. So far, I have hacked out a lot of the (in my view) extraneous stuff from glm and halved its memory usage. I can now run a 12 variable, 1 million observation data set using "only" 200Mb of working memory (excluding the memory required to store the data). Previously fit.glm was using 500Mb to do the same. To get convergence took 9 iterations (either way). To reiterate: the inefficiency is in calculating estimates, not in storing data.
Thomas Lumley < tlumley at u.washington.edu > wrote:>I have written most of a bigglm() function where the data= argument is a >function with a single argument 'reset'. When called with reset=FALSE the >function should return another chunk of data, or NULL if no data are >available, and when called with reset=TRUE it should go back to the >beginning of the data. I don't think this is too inelegant.yes, that does sound like a pretty elegent solution. It would be even more so if you could offer a default implementation of the data_function that simply passes chunks of large X and y matrices held in memory. (ideally you would just intialize the data_function to reference the X and y data to avoid duplicating it, don't know if that's possible in R.) how long before its ready? :) --- On Wed 08/23, Thomas Lumley < tlumley at u.washington.edu > wrote: From: Thomas Lumley [mailto: tlumley at u.washington.edu] To: damien.moore at excite.com Cc: r-help at stat.math.ethz.ch, ripley at stats.ox.ac.uk Date: Wed, 23 Aug 2006 08:25:54 -0700 (PDT) Subject: Re: [R] lean and mean lm/glm? On Wed, 23 Aug 2006, Damien Moore wrote:> > Thomas Lumley wrote: > >> No, it is quite straightforward if you are willing to make multiple passes >> through the data. It is hard with a single pass and may not be possible >> unless the data are in random order. >> >> Fisher scoring for glms is just an iterative weighted least squares >> calculation using a set of 'working' weights and 'working' response. These >> can be defined chunk by chunk and fed to biglm. Three iterations should >> be sufficient. > > (NB: Although not stated clearly I was referring to single pass when I > wrote "impossible"). Doing as you suggest with multiple passes would > entail either sticking the database input calls into the main iterative > loop of a lookalike glm.fit or lumping the user with a very unattractive > sequence of calls:I have written most of a bigglm() function where the data= argument is a function with a single argument 'reset'. When called with reset=FALSE the function should return another chunk of data, or NULL if no data are available, and when called with reset=TRUE it should go back to the beginning of the data. I don't think this is too inelegant. In general I don't think a one-pass algorithm is possible. If the data are in random order then you could read one chunk, fit a glm, and set up a grid of coefficient values around the estimate. You then read the rest of the data, computing the loglikelihood and score function at each point in the grid. After reading all the data you can then fit a suitable smooth surface to the loglikelihood. I don't know whether this will give sufficient accuracy, though. For really big data sets you are probably better off with the approach that Brian Ripley and Fei Chen used -- they have shown that it works and there unlikely to be anything much simpler that also works that they missed. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle