Here's a skeletal example. Embellish as needed: p <- 5 n <- 300 set.seed(1) dat <- cbind(rnorm(n), matrix(runif(n * p), n, p)) write.table(dat, file="c:/temp/big.txt", row=FALSE, col=FALSE) xtx <- matrix(0, p + 1, p + 1) xty <- numeric(p + 1) f <- file("c:/temp/big.txt", open="r") for (i in 1:3) { x <- matrix(scan(f, nlines=100), 100, p + 1, byrow=TRUE) xtx <- xtx + crossprod(cbind(1, x[, -1])) xty <- xty + crossprod(cbind(1, x[, -1]), x[, 1]) } close(f) solve(xtx, xty) coef(lm.fit(cbind(1, dat[,-1]), dat[,1])) ## check result unlink("c:/temp/big.txt") ## clean up. Andy -----Original Message----- From: Sachin J [mailto:sachinj.2006@yahoo.com] Sent: Monday, April 24, 2006 5:09 PM To: Liaw, Andy; R-help@stat.math.ethz.ch Subject: RE: [R] Handling large dataset & dataframe [Broadcast] Hi Andy: I searched through R-archive to find out how to handle large data set using readLines and other related R functions. I couldn't find any single post which elaborates the process. Can you provide me with an example or any pointers to the postings elaborating the process. Thanx in advance Sachin "Liaw, Andy" <andy_liaw@merck.com> wrote: Instead of reading the entire data in at once, you read a chunk at a time, and compute X'X and X'y on that chunk, and accumulate (i.e., add) them. There are examples in "S Programming", taken from independent replies by the two authors to a post on S-news, if I remember correctly. Andy From: Sachin J> > Gabor: > > Can you elaborate more. > > Thanx > Sachin > > Gabor Grothendieck wrote: > You just need the much smaller cross product matrix X'X and > vector X'Y so you can build those up as you read the data in > in chunks. > > > On 4/24/06, Sachin J wrote: > > Hi, > > > > I have a dataset consisting of 350,000 rows and 266 columns. Out of > > 266 columns 250 are dummy variable columns. I am trying to > read this > > data set into R dataframe object but unable to do it due to memory > > size limitations (object size created is too large to > handle in R). Is > > there a way to handle such a large dataset in R. > > > > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP. > > > > Any pointers would be of great help. > > > > TIA > > Sachin > > > > > > --------------------------------- > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > --------------------------------- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >---------------------------------------------------------------------------- -- Notice: This e-mail message, together with any attachments, ...{{dropped}}
Gabor Grothendieck
2006-Apr-25 05:17 UTC
[R] Handling large dataset & dataframe [Broadcast]
The other thing you could try after doing this is to sample some rows from your data and see if the subset gives nearly the same answer as the entire data set. On 4/24/06, Liaw, Andy <andy_liaw at merck.com> wrote:> Here's a skeletal example. Embellish as needed: > > p <- 5 > n <- 300 > set.seed(1) > dat <- cbind(rnorm(n), matrix(runif(n * p), n, p)) > write.table(dat, file="c:/temp/big.txt", row=FALSE, col=FALSE) > > xtx <- matrix(0, p + 1, p + 1) > xty <- numeric(p + 1) > f <- file("c:/temp/big.txt", open="r") > for (i in 1:3) { > x <- matrix(scan(f, nlines=100), 100, p + 1, byrow=TRUE) > xtx <- xtx + crossprod(cbind(1, x[, -1])) > xty <- xty + crossprod(cbind(1, x[, -1]), x[, 1]) > } > close(f) > solve(xtx, xty) > coef(lm.fit(cbind(1, dat[,-1]), dat[,1])) ## check result > > unlink("c:/temp/big.txt") ## clean up. > > Andy > > -----Original Message----- > From: Sachin J [mailto:sachinj.2006 at yahoo.com] > Sent: Monday, April 24, 2006 5:09 PM > To: Liaw, Andy; R-help at stat.math.ethz.ch > Subject: RE: [R] Handling large dataset & dataframe [Broadcast] > > > Hi Andy: > > I searched through R-archive to find out how to handle large data set using > readLines and other related R functions. I couldn't find any single post > which elaborates the process. Can you provide me with an example or any > pointers to the postings elaborating the process. > > Thanx in advance > Sachin > > > "Liaw, Andy" <andy_liaw at merck.com> wrote: > > Instead of reading the entire data in at once, you read a chunk at a time, > and compute X'X and X'y on that chunk, and accumulate (i.e., add) them. > There are examples in "S Programming", taken from independent replies by the > two authors to a post on S-news, if I remember correctly. > > Andy > > From: Sachin J > > > > Gabor: > > > > Can you elaborate more. > > > > Thanx > > Sachin > > > > Gabor Grothendieck wrote: > > You just need the much smaller cross product matrix X'X and > > vector X'Y so you can build those up as you read the data in > > in chunks. > > > > > > On 4/24/06, Sachin J wrote: > > > Hi, > > > > > > I have a dataset consisting of 350,000 rows and 266 columns. Out of > > > 266 columns 250 are dummy variable columns. I am trying to > > read this > > > data set into R dataframe object but unable to do it due to memory > > > size limitations (object size created is too large to > > handle in R). Is > > > there a way to handle such a large dataset in R. > > > > > > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP. > > > > > > Any pointers would be of great help. > > > > > > TIA > > > Sachin > > > > > > > > > --------------------------------- > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > --------------------------------- > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > > ---------------------------------------------------------------------------- > -- > Notice: This e-mail message, together with any attachments, ...{{dropped}} > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >