I have a 100Mb comma-separated file, and R takes several minutes to read it (via read.table()). This is R 1.9.0 on a linux box with a couple gigabytes of RAM. I am conjecturing that R is gc-ing, so maybe there is some command-line arg I can give it to convince it that I have a lot of space, or?! Thanks! Igor
There are hints in the R Data Import/Export Manual. Just checking: you _have_ read it? On Tue, 29 Jun 2004, Igor Rivin wrote:> > I have a 100Mb comma-separated file, and R takes several minutes to read it > (via read.table()). This is R 1.9.0 on a linux box with a couple gigabytes of > RAM. I am conjecturing that R is gc-ing, so maybe there is some command-line > arg I can give it to convince it that I have a lot of space, or?!-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
I did read the Import/Export document. It is true that replacing the read.table by read.csv and setting the commentChar="" speeds things up some (a factor of two?) -- this is very far from acceptable performance, being some two orders of magnitude worse than SAS (the IO of which is, in turn, much worse than that of the unix utilities (awk, sort, and so on)) . Setting colClasses is suggested (and has been suggested by some in response to my question), but for a frame with some 60 columns, this is a major nuisance.
> From: rivin at euclid.math.temple.edu > > I did read the Import/Export document. It is true that replacing > the read.table by read.csv and setting the commentChar="" speeds > things up some (a factor of two?) -- this is very far from > acceptable performance, > being some two orders of magnitude worse than SAS (the IO of > which is, in turn, much worse > than that of the unix utilities (awk, sort, and so on)) . > Setting colClasses is suggested > (and has been suggested by some in response to my question), but for > a frame with some 60 columns, this is a major nuisance. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.htmlPlease don't make _your_ nuisance into others'. Do read the posting guide as suggested above. You have not provided any info for anyone to give you any useful advice beyond those you said you received. R is not all things to all people. If you are so annoyed, why not use SAS/awk/sort and so on? [For my own education: How do you read the file into SAS without specifying column names and types?] Andy
R's IO is indeed 20 - 50 times slower than that of equivalent C code no matter what you do, which has been a pain for some of us. It does however help read the Import/Export tips as w/o them the ratio gets much worse. As Gabor G. suggested in another mail, if you use the file repeatedly you can convert it into internal format: read.table once into R and save using save()... This is much faster. In my experience R is not so good at large data sets, where large is roughly 10% of your RAM.
I was not particularly annoyed, just disappointed, since R seems like a much better thing than SAS in general, and doing everything with a combination of hand-rolled tools is too much work. However, I do need to work with very large data sets, and if it takes 20 minutes to read them in, I have to explore other options (one of which might be S-PLUS, which claims scalability as a major , er, PLUS over R).
I am working with data sets that have 2 matrices of 300 columns by 19,000 rows , and I manage to get the data loaded in a reasonable amount of time. Once its in I save the workspace and load from there. Once I start doing some work on the data, I am taking up about 600 Meg's of RAM out of the 1 Gig I have in the computer.I will soon upgrade to 2 Gig because I will have to work with an even larger data matrix soon. I must say that the speed of R given with what I have been doing, is acceptable. Peter At 07:59 PM 6/29/2004, Vadim Ogranovich wrote:> R's IO is indeed 20 - 50 times slower than that of equivalent C code no >matter what you do, which has been a pain for some of us. It does >however help read the Import/Export tips as w/o them the ratio gets much >worse. As Gabor G. suggested in another mail, if you use the file >repeatedly you can convert it into internal format: read.table once into >R and save using save()... This is much faster. > >In my experience R is not so good at large data sets, where large is >roughly 10% of your RAM. > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://www.stat.math.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
We need more details about your problem to provide any useful help. Are all the variables numeric? Are they all completely different? Is it possible to use `colClasses'? Also, having "a couple of gigabytes of RAM" is not necessarily useful if you're on a 32-bit OS since the total process size is usually limited to be less than ~3GB. Believe it or not, complaints like these are not that common. 1998 was a long time ago! -roger Igor Rivin wrote:> I have a 100Mb comma-separated file, and R takes several minutes to read it > (via read.table()). This is R 1.9.0 on a linux box with a couple gigabytes of > RAM. I am conjecturing that R is gc-ing, so maybe there is some command-line > arg I can give it to convince it that I have a lot of space, or?! > > Thanks! > > Igor > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Roger D. Peng http://www.biostat.jhsph.edu/~rpeng/
It is amazing the amount of time that has been spent on this issue. In most cases, if you do some timing studies using 'scan', you will find that you can read some quite large data structures in a reasonable time. If you initial concern was having to wait 10 minutes to have your data read in, you could have read in quite a few data sets by now. When comparing speeds/feeds of processors, you also have to consider what it being done on them. Back in the "dark ages" we had a 1 MIP computer with 4M of memory handling input from 200 users on a transaction system. Today I need a 1GHZ computer with 512M to just handle me. Now true, I am doing a lot different processing on it. With respect to I/O, you have to consider what is being read in and how it is converted. Each system/program has different requirements. I have some applications (running on a laptop) that can read in approximately 100K rows of data per second (of course they are already binary). On the other hand, I can easily slow that down to 1K rows per second if I do not specify the correct parameters to 'read.table'. So go back and take a look at what you are doing, and instrument your code to see where time is being spent. The nice thing about R is that there are a number of ways of approaching a solution and it you don't like the timing of one way, try another. That is half the fun of using R. __________________________________________________________ James Holtman "What is the problem you are trying to solve?" Executive Technical Consultant -- Office of Technology, Convergys james.holtman at convergys.com +1 (513) 723-2929 <rivin at euclid.math.te mple.edu> To: <p.dalgaard at biostat.ku.dk> Sent by: cc: r-help at stat.math.ethz.ch, r-help-bounces at stat.m tplate at blackmesacapital.com, rivin at euclid.math.temple.edu ath.ethz.ch Subject: Re: [R] naive question 06/30/2004 16:25> <rivin at euclid.math.temple.edu> writes: > >> I did not use R ten years ago, but "reasonable" RAM amounts have >> multiplied by roughly a factor of 10 (from 128Mb to 1Gb), CPU speeds >> have gone up by a factor of 30 (from 90Mhz to 3Ghz), and disk space >> availabilty has gone up probably by a factor of 10. So, unless the I/O >> performance scales nonlinearly with size (a bit strange but not >> inconsistent with my R experiments), I would think that things should >> have gotten faster (by the wall clock, not slower). Of course, it is >> possible that the other components of the R system have been worked on >> more -- I am not equipped to comment... > > I think your RAM calculation is a bit off. in late 1993, 4MB systems > were the standard PC, with 16 or 32 MB on high-end workstations.I beg to differ. In 1989, Mac II came standard with 8MB, NeXT came standard with 16MB. By 1994, 16MB was pretty much standard on good quality (= Pentium, of which the 90Mhz was the first example) PCs, with 32Mb pretty common (though I suspect that most R/S-Plus users were on SUNs, which were somewhat more plushly equipped).> Comparable figures today are probably 256MB for the entry-level PC and > a couple GB in the high end. So that's more like a factor of 64. On the > other hand, CPU's have changed by more than the clock speed; in > particular, the number of clock cycles per FP calculation has > decreased considerably and is currently less than one in some > circumstances. >I think that FP performance has increased more than integer performance, which has pretty much kept pace with the clock speed. The compilers have also improved a bit... Igor ______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
As part of a continuing thread on the cost of loading large amounts of data into R, "Vadim Ogranovich" <vograno at evafunds.com> wrote: R's IO is indeed 20 - 50 times slower than that of equivalent C code no matter what you do, which has been a pain for some of us. I wondered to myself just how bad R is at reading, when it is given a fair chance. So I performed an experiment. My machine (according to "Workstation Info") is a SunBlade 100 with 640MB of physical memory running SunOS 5.9 Generic, according to fpversion this is an Ultra2e with the CPU clock running at 500MHz and the main memory clock running at 84MHz (wow, slow memory). R.version is platform sparc-sun-solaris2.9 arch sparc os solaris2.9 system sparc, solaris2.9 status major 1 minor 9.0 year 2004 month 04 day 12 language R and althnough this is a 64-bit machine, it's a 32-bit installation of R. The experiment was this: (1) I wrote a C program that generated 12500 rows of 800 columns, the numbers were integers 0..999,999,999 generated using drand48(). These numbers were written using printf(). It is possible to do quite a bit better by avoiding printf(), but that would ruin the spirit of the comparison, which is to see what can be done with *straightforward* code using *existing* library functions. 21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds. The sizes were chosen to get 100MB; the actual size was 12500 (lines) 10000000 (words) 100012500 (bytes) (2) I wrote a C program that read these numbers using scanf("%d"); it "knew" there were 800 numbers per row and 12500 numbers in all. Again, it is possible to do better by avoiding scanf(), but the point is to look at *straightforward* code. 18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds. (3) I started R, played around a bit doing other things, then issued this command: > system.time(xx <- read.table("/tmp/big.dat", header=FALSE, quote="", + row.names=NULL, colClasses=rep("numeric",800), nrows=12500, + comment.char="") So how long _did_ it take to read 100MB on this machine? 71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds. The result: the R/C ratio was less than 4, whether you measure cpu time or real time. It certainly wasn't anywhere near 20-50 times slower. Of course, *binary* I/O in C *would* be quite a bit faster: (1') generate same integers but write a row at a time using fwrite(): 5 seconds cpu, 25 seconds real; 40 MB. (2') read same integers a row at a time using fread() 0.26 seconds cpu, 1 second real. This would appear to more than justify "20-50 times slower", but reading binary data and reading data in a textual representation are different things, "less than 4 times slower" is the fairer measure. However, it does emphasise the usefulness of problem-specific bulk reading techniques. I thought I'd give you another R measurement:> system.time(xx <- read.table("/tmp/big.dat", header=FALSE))But I got sick of waiting for it, and killed it after 843 cpu seconds, 3075 real seconds. Without knowing how far it had got, one can say no more than that this is at least 10 times slower than the more informed call to read.table. What this tells me is that if you know something about the data that you _could_ tell read.table about, you do yourself no favour by keeping read.table in the dark. All those options are there for a reason, and it *will* pay to use them.
Richard, Thank you for the analysis. I don't think there is an inconsistency between the factor of 4 you've found in your example and 20 - 50 I found in my data. I guess the major cause of the difference lies with the structure of your data set. Specifically, your test data set differs from mine in two respects: * you have fewer lines, but each line contains many more fields (12500 * 800 in your case and 3.8M * 10 in my) * all of your data fields are doubles, not strings. I have a mixture of doubles and strings. I posted a more technical message to r-devel where I discussed possible reasons for the IO slowness. One of them is that R is slow at making strings. So if you try to read your data as strings, colClasses=rep("character", 800), I'd guess you will see a very different timing. Even simple reshaping of your matrix, say make it (12500*80) rows by 10 columns, will considerably worsen it. Please let me know the results if you do anything of the above. In my message to r-devel you may also find some timing that supports my estimates. Thanks, Vadim> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of > Richard A. O'Keefe > Sent: Thursday, July 01, 2004 5:22 PM > To: r-help at stat.math.ethz.ch > Subject: RE: [R] naive question > > As part of a continuing thread on the cost of loading large > amounts of data into R, > > "Vadim Ogranovich" <vograno at evafunds.com> wrote: > R's IO is indeed 20 - 50 times slower than that of > equivalent C code > no matter what you do, which has been a pain for some of us. > > I wondered to myself just how bad R is at reading, when it is > given a fair chance. So I performed an experiment. > My machine (according to "Workstation Info") is a SunBlade > 100 with 640MB of physical memory running SunOS 5.9 Generic, > according to fpversion this is an Ultra2e with the CPU clock > running at 500MHz and the main memory clock running at 84MHz > (wow, slow memory). R.version is platform sparc-sun-solaris2.9 > arch sparc > os solaris2.9 > system sparc, solaris2.9 > status > major 1 > minor 9.0 > year 2004 > month 04 > day 12 > language R > and althnough this is a 64-bit machine, it's a 32-bit > installation of R. > > The experiment was this: > (1) I wrote a C program that generated 12500 rows of 800 columns, the > numbers were integers 0..999,999,999 generated using drand48(). > These numbers were written using printf(). It is possible to do > quite a bit better by avoiding printf(), but that would ruin the > spirit of the comparison, which is to see what can be done with > *straightforward* code using *existing* library functions. > > 21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds. > > The sizes were chosen to get 100MB; the actual size was > 12500 (lines) 10000000 (words) 100012500 (bytes) > > (2) I wrote a C program that read these numbers using > scanf("%d"); it > "knew" there were 800 numbers per row and 12500 numbers in all. > Again, it is possible to do better by avoiding scanf(), but the > point is to look at *straightforward* code. > > 18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds. > > (3) I started R, played around a bit doing other things, then > issued this > command: > > > system.time(xx <- read.table("/tmp/big.dat", > header=FALSE, quote="", > + row.names=NULL, colClasses=rep("numeric",800), nrows=12500, > + comment.char="") > > So how long _did_ it take to read 100MB on this machine? > > 71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds. > > The result: the R/C ratio was less than 4, whether you > measure cpu time or real time. It certainly wasn't anywhere > near 20-50 times slower. > > Of course, *binary* I/O in C *would* be quite a bit faster: > (1') generate same integers but write a row at a time using fwrite(): > 5 seconds cpu, 25 seconds real; 40 MB. > > (2') read same integers a row at a time using fread() > 0.26 seconds cpu, 1 second real. > > This would appear to more than justify "20-50 times slower", > but reading binary data and reading data in a textual > representation are different things, "less than 4 times > slower" is the fairer measure. However, it does emphasise > the usefulness of problem-specific bulk reading techniques. > > I thought I'd give you another R measurement: > > system.time(xx <- read.table("/tmp/big.dat", header=FALSE)) > But I got sick of waiting for it, and killed it after 843 cpu seconds, > 3075 real seconds. Without knowing how far it had got, one > can say no more than that this is at least 10 times slower > than the more informed call to read.table. > > What this tells me is that if you know something about the > data that you _could_ tell read.table about, you do yourself > no favour by keeping read.table in the dark. All those > options are there for a reason, and it *will* pay to use them. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >