thr3ads.net - R help - [R] Handling large dataset & dataframe [Broadcast] [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Liaw, Andy

2006-Apr-24 22:20 UTC

[R] Handling large dataset & dataframe [Broadcast]

Here's a skeletal example.  Embellish as needed:
 
p <- 5
n <- 300
set.seed(1)
dat <- cbind(rnorm(n), matrix(runif(n * p), n, p))
write.table(dat, file="c:/temp/big.txt", row=FALSE, col=FALSE)
 
xtx <- matrix(0, p + 1, p + 1)
xty <- numeric(p + 1)
f <- file("c:/temp/big.txt", open="r")
for (i in 1:3) {
    x <- matrix(scan(f, nlines=100), 100, p + 1, byrow=TRUE)
    xtx <- xtx + crossprod(cbind(1, x[, -1]))
    xty <- xty + crossprod(cbind(1, x[, -1]), x[, 1])
}
close(f)
solve(xtx, xty)
coef(lm.fit(cbind(1, dat[,-1]), dat[,1]))  ## check result

unlink("c:/temp/big.txt")  ## clean up.
 
Andy

-----Original Message-----
From: Sachin J [mailto:sachinj.2006@yahoo.com] 
Sent: Monday, April 24, 2006 5:09 PM
To: Liaw, Andy; R-help@stat.math.ethz.ch
Subject: RE: [R] Handling large dataset & dataframe [Broadcast]


Hi Andy:
 
I searched through R-archive to find out how to handle large data set using
readLines and other related R functions. I couldn't find any single post
which elaborates the process. Can you provide me with an example or any
pointers to the postings elaborating the process. 
 
Thanx in advance
Sachin
 

"Liaw, Andy" <andy_liaw@merck.com> wrote:

Instead of reading the entire data in at once, you read a chunk at a time,
and compute X'X and X'y on that chunk, and accumulate (i.e., add) them.
There are examples in "S Programming", taken from independent replies
by the
two authors to a post on S-news, if I remember correctly.

Andy

From: Sachin J> 
> Gabor:
> 
> Can you elaborate more.
> 
> Thanx
> Sachin
> 
> Gabor Grothendieck wrote:
> You just need the much smaller cross product matrix X'X and 
> vector X'Y so you can build those up as you read the data in 
> in chunks.
> 
> 
> On 4/24/06, Sachin J wrote:
> > Hi,
> >
> > I have a dataset consisting of 350,000 rows and 266 columns. Out of 
> > 266 columns 250 are dummy variable columns. I am trying to 
> read this 
> > data set into R dataframe object but unable to do it due to memory 
> > size limitations (object size created is too large to 
> handle in R). Is 
> > there a way to handle such a large dataset in R.
> >
> > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
> >
> > Any pointers would be of great help.
> >
> > TIA
> > Sachin
> >
> >
> > ---------------------------------
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> > http://www.R-project.org/posting-guide.html
> >
> 
> 
> 
> ---------------------------------
> 
> [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
> 

----------------------------------------------------------------------------
--
Notice: This e-mail message, together with any attachments, ...{{dropped}}

Gabor Grothendieck

2006-Apr-25 05:17 UTC

head link

[R] Handling large dataset & dataframe [Broadcast]

The other thing you could try after doing this is to sample
some rows from your data and see if the subset gives
nearly the same answer as the entire data set.

On 4/24/06, Liaw, Andy <andy_liaw at merck.com>
wrote:> Here's a skeletal example.  Embellish as needed:
>
> p <- 5
> n <- 300
> set.seed(1)
> dat <- cbind(rnorm(n), matrix(runif(n * p), n, p))
> write.table(dat, file="c:/temp/big.txt", row=FALSE, col=FALSE)
>
> xtx <- matrix(0, p + 1, p + 1)
> xty <- numeric(p + 1)
> f <- file("c:/temp/big.txt", open="r")
> for (i in 1:3) {
>    x <- matrix(scan(f, nlines=100), 100, p + 1, byrow=TRUE)
>    xtx <- xtx + crossprod(cbind(1, x[, -1]))
>    xty <- xty + crossprod(cbind(1, x[, -1]), x[, 1])
> }
> close(f)
> solve(xtx, xty)
> coef(lm.fit(cbind(1, dat[,-1]), dat[,1]))  ## check result
>
> unlink("c:/temp/big.txt")  ## clean up.
>
> Andy
>
> -----Original Message-----
> From: Sachin J [mailto:sachinj.2006 at yahoo.com]
> Sent: Monday, April 24, 2006 5:09 PM
> To: Liaw, Andy; R-help at stat.math.ethz.ch
> Subject: RE: [R] Handling large dataset & dataframe [Broadcast]
>
>
> Hi Andy:
>
> I searched through R-archive to find out how to handle large data set using
> readLines and other related R functions. I couldn't find any single
post
> which elaborates the process. Can you provide me with an example or any
> pointers to the postings elaborating the process.
>
> Thanx in advance
> Sachin
>
>
> "Liaw, Andy" <andy_liaw at merck.com> wrote:
>
> Instead of reading the entire data in at once, you read a chunk at a time,
> and compute X'X and X'y on that chunk, and accumulate (i.e., add)
them.
> There are examples in "S Programming", taken from independent
replies by the
> two authors to a post on S-news, if I remember correctly.
>
> Andy
>
> From: Sachin J
> >
> > Gabor:
> >
> > Can you elaborate more.
> >
> > Thanx
> > Sachin
> >
> > Gabor Grothendieck wrote:
> > You just need the much smaller cross product matrix X'X and
> > vector X'Y so you can build those up as you read the data in
> > in chunks.
> >
> >
> > On 4/24/06, Sachin J wrote:
> > > Hi,
> > >
> > > I have a dataset consisting of 350,000 rows and 266 columns. Out
of
> > > 266 columns 250 are dummy variable columns. I am trying to
> > read this
> > > data set into R dataframe object but unable to do it due to
memory
> > > size limitations (object size created is too large to
> > handle in R). Is
> > > there a way to handle such a large dataset in R.
> > >
> > > My PC has 1GB of RAM, and 55 GB harddisk space running windows
XP.
> > >
> > > Any pointers would be of great help.
> > >
> > > TIA
> > > Sachin
> > >
> > >
> > > ---------------------------------
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> >
> >
> >
> > ---------------------------------
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> >
>
>
>
----------------------------------------------------------------------------
> --
> Notice: This e-mail message, together with any attachments, ...{{dropped}}
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Apr 2006 - Handling large dataset & dataframe [Broadcast]

[R] Handling large dataset & dataframe [Broadcast]

[R] Handling large dataset & dataframe [Broadcast]

Possibly Parallel Threads