thr3ads.net - R help - [R] How to separate huge dataset into chunks [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Guillaume Filteau

2009-Mar-24 08:41 UTC

[R] How to separate huge dataset into chunks

Hello all,

I?m trying to take a huge dataset (1.5 GB) and separate it into smaller 
chunks with R.

So far I had nothing but problems.

I cannot load the whole dataset in R due to memory problems. So, I 
instead try to load a few (100000) lines at a time (with read.table).

However, R kept crashing (with no error message) at about the 6800000 
line. This is extremely frustrating.

To try to fix this, I used connections with read.table. However, I now 
get a cryptic error telling me ?no lines available in input?.

Is there any way to make this work?

Best,
Guillaume

Thomas Lumley

2009-Mar-24 09:45 UTC

head link

[R] How to separate huge dataset into chunks

On Tue, 24 Mar 2009, Guillaume Filteau wrote:
> Hello all,
>
> I?m trying to take a huge dataset (1.5 GB) and separate it into smaller
chunks
> with R.
>
> So far I had nothing but problems.
>
> I cannot load the whole dataset in R due to memory problems. So, I instead
try
> to load a few (100000) lines at a time (with read.table).
>
> However, R kept crashing (with no error message) at about the 6800000 line.
> This is extremely frustrating.
>
> To try to fix this, I used connections with read.table. However, I now get
a
> cryptic error telling me ?no lines available in input?.
>
> Is there any way to make this work?
>
There might be an error in line 42 of your script. Or somewhere else. The error
message is cryptically saying that there were no lines of text available in the
input connection, so presumably the connection wasn't pointed at your file
correctly.

It's hard to guess without seeing what you are doing, but
    conn <- file("mybigfile", open="r")
    chunk<- read.table(conn, header=TRUE, nrows=10000)
    nms <- names(chunk)
    while(length(chunk)==10000){
       chunk<-read.table(conn, nrows=10000,col.names=nms)
       ## do something to the chunk
    }
    close(conn)

should work. This sort of thing certainly does work routinely.

It's probably not worth reading 100,000 lines at a time unless your computer
has a lot of memory. Reducing the chunk size to 10,000 shouldn't introduce
much extra overhead and may well increase the speed by reducing memory use.

     -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Maybe Matching Threads

Search for more seemingly similar threads

R help - Mar 2009 - How to separate huge dataset into chunks

[R] How to separate huge dataset into chunks

[R] How to separate huge dataset into chunks

Maybe Matching Threads