thr3ads.net - R help - [R] naive question [Jun 2004]

If this information is useful, please help other people find it:
Share via:

Igor Rivin

2004-Jun-29 16:35 UTC

[R] naive question

I have a 100Mb comma-separated file, and R takes several minutes to read it
(via read.table()). This is R 1.9.0 on a linux box with a couple gigabytes of
RAM. I am conjecturing that R is gc-ing, so maybe there is some command-line
arg I can give it to convince it that I have a lot of space, or?!

    Thanks!

	Igor

Prof Brian Ripley

2004-Jun-29 20:11 UTC

head link

[R] naive question

There are hints in the R Data Import/Export Manual.  Just checking: you 
_have_ read it?

On Tue, 29 Jun 2004, Igor Rivin wrote:
> 
> I have a 100Mb comma-separated file, and R takes several minutes to read it
> (via read.table()). This is R 1.9.0 on a linux box with a couple gigabytes
of
> RAM. I am conjecturing that R is gc-ing, so maybe there is some
command-line
> arg I can give it to convince it that I have a lot of space, or?!
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Igor Rivin

2004-Jun-29 20:22 UTC

head link

[R] naive question

I did read the Import/Export document. It is true that replacing
the read.table by read.csv and setting the commentChar="" speeds
things up some (a factor of two?) -- this is very far from acceptable
performance,
being some two orders of magnitude worse than SAS (the IO of which is, in turn,
much worse
than that of the unix utilities (awk, sort, and so on))   . Setting colClasses
is suggested
(and has been suggested by some in response to my question), but for 
a frame with some 60 columns, this is a major nuisance.

Liaw, Andy

2004-Jun-29 22:00 UTC

head link

[R] naive question

> From: rivin at euclid.math.temple.edu
> 
> I did read the Import/Export document. It is true that replacing
> the read.table by read.csv and setting the commentChar="" speeds
> things up some (a factor of two?) -- this is very far from 
> acceptable performance,
> being some two orders of magnitude worse than SAS (the IO of 
> which is, in turn, much worse
> than that of the unix utilities (awk, sort, and so on))   . 
> Setting colClasses is suggested
> (and has been suggested by some in response to my question), but for 
> a frame with some 60 columns, this is a major nuisance.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
Please don't make _your_ nuisance into others'.  Do read the posting
guide
as suggested above.  You have not provided any info for anyone to give you
any useful advice beyond those you said you received.

R is not all things to all people.  If you are so annoyed, why not use
SAS/awk/sort and so on?

[For my own education:  How do you read the file into SAS without specifying
column names and types?]

Andy

Vadim Ogranovich

2004-Jun-29 23:59 UTC

head link

[R] naive question

R's IO is indeed 20 - 50 times slower than that of equivalent C code no
matter what you do, which has been a pain for some of us. It does
however help read the Import/Export tips as w/o them the ratio gets much
worse. As Gabor G. suggested in another mail, if you use the file
repeatedly you can convert it into internal format: read.table once into
R and save using save()... This is much faster.

In my experience R is not so good at large data sets, where large is
roughly 10% of your RAM.

Igor Rivin

2004-Jun-30 00:19 UTC

head link

[R] naive question

I was not particularly annoyed, just disappointed, since R seems like
a much better thing than SAS in general, and doing everything with a combination
of hand-rolled tools is too much work. However, I do need to work with very
large data sets, and if it takes 20 minutes to read them in, I have to explore
other
options (one of which might be S-PLUS, which claims scalability as a major 
, er, PLUS over R).

Peter Wilkinson

2004-Jun-30 01:18 UTC

head link

[R] naive question

I am working with data sets that have 2 matrices of 300 columns by 19,000 
rows , and I manage to get the data loaded in a reasonable amount of time. 
Once its in I save the workspace and load from there. Once I start doing 
some work on the data, I am taking up about 600 Meg's of RAM out of the 1 
Gig I have in the computer.I will soon upgrade to 2 Gig because I will have 
to work with an even larger data matrix soon.

I must say that the speed of R given with what I have been doing, is 
acceptable.

Peter

At 07:59 PM 6/29/2004, Vadim Ogranovich wrote:>  R's IO is indeed 20 - 50 times slower than that of equivalent C code
no
>matter what you do, which has been a pain for some of us. It does
>however help read the Import/Export tips as w/o them the ratio gets much
>worse. As Gabor G. suggested in another mail, if you use the file
>repeatedly you can convert it into internal format: read.table once into
>R and save using save()... This is much faster.
>
>In my experience R is not so good at large data sets, where large is
>roughly 10% of your RAM.
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Roger D. Peng

2004-Jun-30 02:06 UTC

head link

[R] naive question

We need more details about your problem to provide any useful 
help.  Are all the variables numeric?  Are they all completely 
different?  Is it possible to use `colClasses'?

Also, having "a couple of gigabytes of RAM" is not necessarily 
useful if you're on a 32-bit OS since the total process size is 
usually limited to be less than ~3GB.

Believe it or not, complaints like these are not that common. 
1998 was a long time ago!

-roger

Igor Rivin wrote:
> I have a 100Mb comma-separated file, and R takes several minutes to read it
> (via read.table()). This is R 1.9.0 on a linux box with a couple gigabytes
of
> RAM. I am conjecturing that R is gc-ing, so maybe there is some
command-line
> arg I can give it to convince it that I have a lot of space, or?!
> 
>     Thanks!
> 
> 	Igor
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> 
-- 
Roger D. Peng
http://www.biostat.jhsph.edu/~rpeng/

james.holtman@convergys.com

2004-Jun-30 20:38 UTC

head link

[R] naive question

It is amazing the amount of time that has been spent on this issue.  In
most cases, if you do some timing studies using 'scan', you will find
that
you can read some quite large data structures in a reasonable time.  If you
initial concern was having to wait 10 minutes to have your data read in,
you could have read in quite a few data sets by now.

When comparing speeds/feeds of processors, you also have to consider what
it being done on them.  Back in the "dark ages" we had a 1 MIP
computer
with 4M of memory handling input from 200 users on a transaction system.
Today I need a 1GHZ computer with 512M to just handle me.  Now true, I am
doing a lot different processing on it.

With respect to I/O, you have to consider what is being read in and how it
is converted.  Each system/program has different requirements.  I have some
applications (running on a laptop) that can read in approximately 100K rows
of data per second (of course they are already binary).  On the other hand,
I can easily slow that down to 1K rows per second if I do not specify the
correct parameters to 'read.table'.

So go back and take a look at what you are doing, and instrument your code
to see where time is being spent.  The nice thing about R is that there are
a number of ways of approaching a solution and it you don't like the timing
of one way, try another.  That is half the fun of using R.
__________________________________________________________
James Holtman        "What is the problem you are trying to solve?"
Executive Technical Consultant  --  Office of Technology, Convergys
james.holtman at convergys.com
+1 (513) 723-2929

                      <rivin at euclid.math.te
                      mple.edu>                    To:       <p.dalgaard
at biostat.ku.dk>
                      Sent by:                     cc:       r-help at
stat.math.ethz.ch,
                      r-help-bounces at stat.m         tplate at
blackmesacapital.com, rivin at euclid.math.temple.edu
                      ath.ethz.ch                  Subject:  Re: [R] naive
question

                      06/30/2004 16:25

> <rivin at euclid.math.temple.edu> writes:
>
>> I did not use R ten years ago, but "reasonable" RAM amounts
have
>> multiplied by roughly a factor of 10 (from 128Mb to 1Gb), CPU speeds
>> have gone up by a factor of 30 (from 90Mhz to 3Ghz), and disk space
>> availabilty has gone up probably by a factor of 10. So, unless the I/O
>> performance scales nonlinearly with size (a bit strange but not
>> inconsistent with my R experiments), I would think that things should
>> have gotten faster (by the wall clock, not slower). Of course, it is
>> possible that the other components of the R system have been worked on
>> more -- I am not equipped to comment...
>
> I think your RAM calculation is a bit off. in late 1993, 4MB systems
> were the standard PC, with 16 or 32 MB on high-end workstations.
I beg to differ. In 1989, Mac II came standard with 8MB, NeXT came
standard with 16MB. By 1994, 16MB was pretty much standard on good quality
(= Pentium, of which the 90Mhz was the first example) PCs, with 32Mb
pretty common (though I suspect that most R/S-Plus users were on SUNs,
which were somewhat more plushly equipped).
> Comparable figures today are probably  256MB for the entry-level PC and
> a couple GB in the high end. So that's more like a factor of 64. On the
> other hand, CPU's have changed by more than the clock speed; in
> particular, the number of clock cycles per FP calculation has
> decreased considerably and is currently less than one in some
> circumstances.
>I think that FP performance has increased more than integer performance,
which has pretty much kept pace with the clock speed. The compilers have
also improved a bit...

  Igor

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Richard A. O'Keefe

2004-Jul-02 00:22 UTC

head link

[R] naive question

As part of a continuing thread on the cost of loading large
amounts of data into R,

"Vadim Ogranovich" <vograno at evafunds.com> wrote:
	R's IO is indeed 20 - 50 times slower than that of equivalent C code
	no matter what you do, which has been a pain for some of us.

I wondered to myself just how bad R is at reading,
when it is given a fair chance.  So I performed an experiment.
My machine (according to "Workstation Info") is a SunBlade 100 with
640MB
of physical memory running SunOS 5.9 Generic, according to fpversion this
is an Ultra2e with the CPU clock running at 500MHz and the main memory
clock running at 84MHz (wow, slow memory).  R.version is
platform sparc-sun-solaris2.9
arch     sparc               
os       solaris2.9          
system   sparc, solaris2.9   
status                       
major    1                   
minor    9.0                 
year     2004                
month    04                  
day      12                  
language R                   
and althnough this is a 64-bit machine, it's a 32-bit installation of R.

The experiment was this:
(1) I wrote a C program that generated 12500 rows of 800 columns, the
    numbers were integers 0..999,999,999 generated using drand48().
    These numbers were written using printf().  It is possible to do
    quite a bit better by avoiding printf(), but that would ruin the
    spirit of the comparison, which is to see what can be done with
    *straightforward* code using *existing* library functions.

    21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds.

    The sizes were chosen to get 100MB; the actual size was
    12500 (lines) 10000000 (words) 100012500 (bytes)

(2) I wrote a C program that read these numbers using scanf("%d"); it
    "knew" there were 800 numbers per row and 12500 numbers in all.
    Again, it is possible to do better by avoiding scanf(), but the
    point is to look at *straightforward* code.

    18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds.

(3) I started R, played around a bit doing other things, then issued this
    command:

    > system.time(xx <- read.table("/tmp/big.dat", header=FALSE,
quote="",
    + row.names=NULL, colClasses=rep("numeric",800), nrows=12500,
    + comment.char="")

    So how long _did_ it take to read 100MB on this machine?

    71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds.

The result:  the R/C ratio was less than 4, whether you measure cpu time
or real time.  It certainly wasn't anywhere near 20-50 times slower.

Of course, *binary* I/O in C *would* be quite a bit faster:
(1') generate same integers but write a row at a time using fwrite():
     5 seconds cpu, 25 seconds real; 40 MB.

(2') read same integers a row at a time using fread()
     0.26 seconds cpu, 1 second real.

This would appear to more than justify "20-50 times slower", but
reading
binary data and reading data in a textual representation are different
things, "less than 4 times slower" is the fairer measure.  However, it
does emphasise the usefulness of problem-specific bulk reading techniques.

I thought I'd give you another R measurement:> system.time(xx <- read.table("/tmp/big.dat", header=FALSE))But I got sick of waiting for it, and killed it after 843 cpu seconds,
3075 real seconds.  Without knowing how far it had got, one can say no
more than that this is at least 10 times slower than the more informed
call to read.table.

What this tells me is that if you know something about the data that
you _could_ tell read.table about, you do yourself no favour by keeping
read.table in the dark.  All those options are there for a reason, and
it *will* pay to use them.

Vadim Ogranovich

2004-Jul-02 01:49 UTC

head link

[R] naive question

Richard,

 Thank you for the analysis. I don't think there is an inconsistency
between the factor of 4 you've found in your example and 20 - 50 I found
in my data. I guess the major cause of the difference lies with the
structure of your data set. Specifically, your test data set differs
from mine in two respects:
* you have fewer lines, but each line contains many more fields (12500 *
800 in your case and 3.8M * 10 in my)
* all of your data fields are doubles, not strings. I have a mixture of
doubles and strings.

I posted a more technical message to r-devel where I discussed possible
reasons for the IO slowness. One of them is that R is slow at making
strings. So if you try to read your data as strings,
colClasses=rep("character", 800), I'd guess you will see a very
different timing. Even simple reshaping of your matrix, say make it
(12500*80) rows by 10 columns, will considerably worsen it.
Please let me know the results if you do anything of the above.

In my message to r-devel you may also find some timing that supports my
estimates.

Thanks,
Vadim
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
> Richard A. O'Keefe
> Sent: Thursday, July 01, 2004 5:22 PM
> To: r-help at stat.math.ethz.ch
> Subject: RE: [R] naive question
> 
> As part of a continuing thread on the cost of loading large 
> amounts of data into R,
> 
> "Vadim Ogranovich" <vograno at evafunds.com> wrote:
> 	R's IO is indeed 20 - 50 times slower than that of 
> equivalent C code
> 	no matter what you do, which has been a pain for some of us.
> 
> I wondered to myself just how bad R is at reading, when it is 
> given a fair chance.  So I performed an experiment.
> My machine (according to "Workstation Info") is a SunBlade 
> 100 with 640MB of physical memory running SunOS 5.9 Generic, 
> according to fpversion this is an Ultra2e with the CPU clock 
> running at 500MHz and the main memory clock running at 84MHz 
> (wow, slow memory).  R.version is platform sparc-sun-solaris2.9
> arch     sparc               
> os       solaris2.9          
> system   sparc, solaris2.9   
> status                       
> major    1                   
> minor    9.0                 
> year     2004                
> month    04                  
> day      12                  
> language R                   
> and althnough this is a 64-bit machine, it's a 32-bit 
> installation of R.
> 
> The experiment was this:
> (1) I wrote a C program that generated 12500 rows of 800 columns, the
>     numbers were integers 0..999,999,999 generated using drand48().
>     These numbers were written using printf().  It is possible to do
>     quite a bit better by avoiding printf(), but that would ruin the
>     spirit of the comparison, which is to see what can be done with
>     *straightforward* code using *existing* library functions.
> 
>     21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds.
> 
>     The sizes were chosen to get 100MB; the actual size was
>     12500 (lines) 10000000 (words) 100012500 (bytes)
> 
> (2) I wrote a C program that read these numbers using 
> scanf("%d"); it    
>     "knew" there were 800 numbers per row and 12500 numbers in
all.
>     Again, it is possible to do better by avoiding scanf(), but the
>     point is to look at *straightforward* code.
> 
>     18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds.
> 
> (3) I started R, played around a bit doing other things, then 
> issued this
>     command:
> 
>     > system.time(xx <- read.table("/tmp/big.dat", 
> header=FALSE, quote="",
>     + row.names=NULL, colClasses=rep("numeric",800), nrows=12500,
>     + comment.char="")
> 
>     So how long _did_ it take to read 100MB on this machine?
> 
>     71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds.
> 
> The result:  the R/C ratio was less than 4, whether you 
> measure cpu time or real time.  It certainly wasn't anywhere 
> near 20-50 times slower.
> 
> Of course, *binary* I/O in C *would* be quite a bit faster:
> (1') generate same integers but write a row at a time using fwrite():
>      5 seconds cpu, 25 seconds real; 40 MB.
> 
> (2') read same integers a row at a time using fread()
>      0.26 seconds cpu, 1 second real.
> 
> This would appear to more than justify "20-50 times slower", 
> but reading binary data and reading data in a textual 
> representation are different things, "less than 4 times 
> slower" is the fairer measure.  However, it does emphasise 
> the usefulness of problem-specific bulk reading techniques.
> 
> I thought I'd give you another R measurement:
> > system.time(xx <- read.table("/tmp/big.dat",
header=FALSE))
> But I got sick of waiting for it, and killed it after 843 cpu seconds,
> 3075 real seconds.  Without knowing how far it had got, one 
> can say no more than that this is at least 10 times slower 
> than the more informed call to read.table.
> 
> What this tells me is that if you know something about the 
> data that you _could_ tell read.table about, you do yourself 
> no favour by keeping read.table in the dark.  All those 
> options are there for a reason, and it *will* pay to use them.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

Apparently Analagous Threads

Search for more reasonably related threads

R help - Jun 2004 - naive question

[R] naive question

[R] naive question

[R] naive question

[R] naive question

[R] naive question

[R] naive question

[R] naive question

[R] naive question

[R] naive question

[R] naive question

[R] naive question

Apparently Analagous Threads