thr3ads.net - R help - [R] How long does skipping in read.table take [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Dimitri Liakhovitski

2010-Oct-22 21:17 UTC

[R] How long does skipping in read.table take

I know I could figure it out empirically - but maybe based on your
experience you can tell me if it's doable in a reasonable amount of
time:
I have a table (in .txt) with a 17,000,000 rows (and 30 columns).
I can't read it all in (there are many strings). So I thought I could
read it in in parts (e.g., 1 milllion) using nrows= and skip.
I was able to read in the first 1,000,000 rows no problem in 45 sec.
But then I tried to skip 16,999,999 rows and then read in things. Then
R crashed. Should I try again - or is it too many rows to skip for R?

Thank you!


-- 
Dimitri Liakhovitski
Ninah Consulting
www.ninah.com

Mike Marchywka

2010-Oct-22 21:43 UTC

head link

[R] How long does skipping in read.table take

> Date: Fri, 22 Oct 2010 17:17:58 -0400
> From: dimitri.liakhovitski at gmail.com
> To: r-help at r-project.org
> Subject: [R] How long does skipping in read.table take
>
> I know I could figure it out empirically - but maybe based on your
> experience you can tell me if it's doable in a reasonable amount of
> time:
> I have a table (in .txt) with a 17,000,000 rows (and 30 columns).
> I can't read it all in (there are many strings). So I thought I could
> read it in in parts (e.g., 1 milllion) using nrows= and skip.
> I was able to read in the first 1,000,000 rows no problem in 45 sec.
> But then I tried to skip 16,999,999 rows and then read in things. Then
> R crashed. Should I try again - or is it too many rows to skip for R?
>I've seen this come up a few times already in my brief time on 
the list. Quick goog search does turn up things like this to deal
with large datasets,

http://yusung.blogspot.com/2007/09/dealing-with-large-data-set-in-r.html

With most OO languages and use of accessors, you can hide a lot of 
things and the data handler is free to return a value or values to
you however makes sense- memory,disk, or even socket is hidden. I'm 
amazed that R is this general to allow package creators this freedom.



Just generally, memory management is a big problem even among computer
people using "computer" ( hard core programming rather than something
like R)
languages. People assume "well gee
I made an array it must be all in memory." Often however the OS tries to
give you VM- which is probably worse than having a file in terms of performance.

One rule that is good to consider is to "act locally" - that is try to
operate only with adjacent data and do something like stream or block
your input data. An R streaming IO class could potentially be very fast
and give implementors a reason to think globally but act locally. 
As is probably apparent, it is easy for even stats and math tasks to 
become IO limited rather than CPU bound.

As an aside, you can use some external utilities to split the file I guess
if split files are ok to use with your R code, head and tail for example
can isolate line ranges. In the past I've crated indexes of line offsets
and then used perl for random access but not sure how that would work with
R. 

> Thank you!
Thank google. 
>




Mike Marchywka | V.P. Technology

415-264-8477
marchywka at phluant.com

Online Advertising and Analytics for Mobile
http://www.phluant.com

Gabor Grothendieck

2010-Oct-22 22:28 UTC

head link

[R] How long does skipping in read.table take

On Fri, Oct 22, 2010 at 5:17 PM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:> I know I could figure it out empirically - but maybe based on your
> experience you can tell me if it's doable in a reasonable amount of
> time:
> I have a table (in .txt) with a 17,000,000 rows (and 30 columns).
> I can't read it all in (there are many strings). So I thought I could
> read it in in parts (e.g., 1 milllion) using nrows= and skip.
> I was able to read in the first 1,000,000 rows no problem in 45 sec.
> But then I tried to skip 16,999,999 rows and then read in things. Then
> R crashed. Should I try again - or is it too many rows to skip for R?
>
You could try read.csv.sql in sqldf.

library(sqldf)
read.csv.sql("myfile.csv", skip = 1000, header = FALSE)
or
read.csv.sql("myfile.csv, sql = "select * from file 2000, 1000")

The first skips the first 1000 lines including the header and the
second one skips 1000 rows (but still reads in the header) and then
reads 2000 rows.  You may or may not need to specify other arguments
as well. For example, you may need to specify eol = "\n" or other
depending on your line endings.

Unlike read.csv, read.csv.sql reads the data directly into an sqlite
database (which it creates on the fly for you).  The data does not go
through R during this operation.  From there it reads only the data
you ask for into R so R never sees the skipped over data.  After all
that it automatically deletes the database.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Reasonably Related Threads

Search for more maybe matching threads

R help - Oct 2010 - How long does skipping in read.table take

[R] How long does skipping in read.table take

[R] How long does skipping in read.table take

[R] How long does skipping in read.table take

Reasonably Related Threads