I'd like to fit linear models on very large datasets. My data frames are about 2000000 rows x 200 columns of doubles and I am using an 64 bit build of R. I've googled about this extensively and went over the "R Data Import/Export" guide. My primary issue is although my data represented in ascii form is 4Gb in size (therefore much smaller considered in binary), R consumes about 12Gb of virtual memory. What exactly are my options to improve this? I looked into the biglm package but the problem with it is it uses update() function and is therefore not transparent (I am using a sophisticated script which is hard to modify). I really liked the concept behind the LM package here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html But it is no longer available. How could one fit linear models to very large datasets without loading the entire set into memory but from a file/database (possibly through a connection) using a relatively simple modification of standard lm()? Alternatively how could one improve the memory usage of R given a large dataset (by changing some default parameters of R or even using on-the-fly compression)? I don't mind much higher levels of CPU time required. Thank you in advance for your help.
Here are a couple of options that you could look at: The biglm package also has the bigglm function which you only call once (no update), you just need to give it a function that reads the data in chunks for you. Using bigglm with a gaussian family is equivalent to lm. You could also write your own function that calls biglm and the necessary updates on it, then just call your function. The SQLiteDF package has an sdflm function that uses the same internal code as biglm, but based on having the data stored in an sqlite database. You don't need to call update with this function. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Alp ATICI > Sent: Thursday, August 16, 2007 2:24 PM > To: r-help at stat.math.ethz.ch > Subject: [R] Linear models over large datasets > > I'd like to fit linear models on very large datasets. My data > frames are about 2000000 rows x 200 columns of doubles and I > am using an 64 bit build of R. I've googled about this > extensively and went over the "R Data Import/Export" guide. > My primary issue is although my data represented in ascii > form is 4Gb in size (therefore much smaller considered in > binary), R consumes about 12Gb of virtual memory. > > What exactly are my options to improve this? I looked into > the biglm package but the problem with it is it uses update() > function and is therefore not transparent (I am using a > sophisticated script which is hard to modify). I really liked > the concept behind the LM package > here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html > But it is no longer available. How could one fit linear > models to very large datasets without loading the entire set > into memory but from a file/database (possibly through a > connection) using a relatively simple modification of > standard lm()? Alternatively how could one improve the memory > usage of R given a large dataset (by changing some default > parameters of R or even using on-the-fly compression)? I > don't mind much higher levels of CPU time required. > > Thank you in advance for your help. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
On Thu, Aug 16, 2007 at 03:24:08PM -0500, Alp ATICI wrote:> I'd like to fit linear models on very large datasets. My data frames > are about 2000000 rows x 200 columns of doubles and I am using an 64 > bit build of R. I've googled about this extensively and went over the > "R Data Import/Export" guide. My primary issue is although my data > represented in ascii form is 4Gb in size (therefore much smaller > considered in binary), R consumes about 12Gb of virtual memory.One option is to simply buy more memory, which might work for you in this case, but in larger cases, is not scalable. I'm not sure how to make R happier with handling large datasets, but you may be able to use the power of random sampling to help you? Read the data from mysql, selecting a random 10% subset. This should use 1.2 Gb or so. You then fit the model to this subset. Repeat the procedure 100 times using independent samples. Now you have bootstrapped the coefficients of your model. Use the average value and standard deviation of the coefficients as your coefficient estimates and standard errors?? Since swapping is typically 1000 times slower or more than disk access, this process might take 1/10 of the time or less compared to letting the R process thrash its disk. It's a thought, not sure how well it works. -- Daniel Lakeland dlakelan at street-artists.org http://www.street-artists.org/~dlakelan
Its actually only a few lines of code to do this from first principles. The coefficients depend only on the cross products X'X and X'y and you can build them up easily by extending this example to read files or a database holding x and y instead of getting them from the args. Here we process incr rows of builtin matrix state.x77 at a time building up the two cross productxts, xtx and xty, regressing Income (variable 2) on the other variables: mylm <- function(x, y, incr = 25) { start <- xtx <- xty <- 0 while(start < nrow(x)) { idx <- seq(start + 1, min(start + incr, nrow(x))) x1 <- cbind(1, x[idx,]) xtx <- xtx + crossprod(x1) xty <- xty + crossprod(x1, y[idx]) start <- start + incr } solve(xtx, xty) } mylm(state.x77[,-2], state.x77[,2]) On 8/16/07, Alp ATICI <alpatici at gmail.com> wrote:> I'd like to fit linear models on very large datasets. My data frames > are about 2000000 rows x 200 columns of doubles and I am using an 64 > bit build of R. I've googled about this extensively and went over the > "R Data Import/Export" guide. My primary issue is although my data > represented in ascii form is 4Gb in size (therefore much smaller > considered in binary), R consumes about 12Gb of virtual memory. > > What exactly are my options to improve this? I looked into the biglm > package but the problem with it is it uses update() function and is > therefore not transparent (I am using a sophisticated script which is > hard to modify). I really liked the concept behind the LM package > here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html > But it is no longer available. How could one fit linear models to very > large datasets without loading the entire set into memory but from a > file/database (possibly through a connection) using a relatively > simple modification of standard lm()? Alternatively how could one > improve the memory usage of R given a large dataset (by changing some > default parameters of R or even using on-the-fly compression)? I don't > mind much higher levels of CPU time required. > > Thank you in advance for your help. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
>Its actually only a few lines of code to do this from first principles.>The coefficients depend only on the cross products X'X and X'y and you >can build them up easily by extending this example to read files or >a database holding x and y instead of getting them from the args. >Here we process incr rows of builtin matrix state.x77 at a time >building up the two cross productxts, xtx and xty, regressing >Income (variable 2) on the other variables: >mylm <- function(x, y, incr = 25) { > start <- xtx <- xty <- 0 > while(start < nrow(x)) { > idx <- seq(start + 1, min(start + incr, nrow(x))) > x1 <- cbind(1, x[idx,]) > xtx <- xtx + crossprod(x1) > xty <- xty + crossprod(x1, y[idx]) > start <- start + incr > } > solve(xtx, xty) >} >mylm(state.x77[,-2], state.x77[,2]) >On 8/16/07, Alp ATICI <alpatici at gmail.com> wrote: > I'd like to fit linear models on very large datasets. My data frames > are about 2000000 rows x 200 columns of doubles and I am using an 64 > bit build of R. I've googled about this extensively and went over the > "R Data Import/Export" guide. My primary issue is although my data > represented in ascii form is 4Gb in size (therefore much smaller > considered in binary), R consumes about 12Gb of virtual memory. > > What exactly are my options to improve this? I looked into the biglm > package but the problem with it is it uses update() function and is > therefore not transparent (I am using a sophisticated script which is > hard to modify). I really liked the concept behind the LM package > here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html > But it is no longer available. How could one fit linear models to very > large datasets without loading the entire set into memory but from a > file/database (possibly through a connection) using a relatively > simple modification of standard lm()? Alternatively how could one > improve the memory usage of R given a large dataset (by changing some > default parameters of R or even using on-the-fly compression)? I don't > mind much higher levels of CPU time required. > > Thank you in advance for your help. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > If your design matrix X is very well behaved this approach may work for you. Often just doing solve(X'X,y) will fail for numerical reasons. The right way to do it is tofactor the matrix X as X = A * B where B is 200x200 in your case and A is 2000000 x 200 with A'*A = I (That is A is orthogonal.) so X'*X = B' *B and you use solve(B'*B,y); To find A and B you can use modified Gram-Schmidt which is very easy to program and works well when you wish to store the columns of X on a hard disk and just read in a bit at a time. Some people claim that modifed Gram-Schmidt is unstable but it has always worked well for me. In any event you can check to ensure that A'*A = I and X=A*B Cheers, Dave -- David A. Fournier P.O. Box 2040, Sidney, B.C. V8l 3S3 Canada Phone/FAX 250-655-3364 http://otter-rsch.com