I believe IO in R is slow because of the way it is implemented, not because it has to do some extra work for the user. I compared scan() with 'what' argument set (which is, AFAIK, is the fastest way to read a CSV file) to an equivalent C code. It turned out to be 20 - 50 times slower. I can see at least two main reasons why R's IO is so slow (I didn't profile this though): A) it reads from a connection char-by-char as opposed to the buffered read. Reading each char requires a call to scanchar() which then calls Rconn_fgetc() (with some non-trivial overhead). Rconn_fgetc() on its part is defined somewhere else (not in scan.c) and therefore the call can not be inlined, etc. B) mkChar, which is used very extensively, is too slow. There are ways to minimize the number of calls to mkChar, but I won't expand on it in this message. I brought this up because it seems that many people believe that the slowness is inherent and is a tradeoff for something else. I don't think this is the case. Thanks, Vadim -----Original Message----- From: r-help-bounces@stat.math.ethz.ch [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Douglas Bates Sent: Tuesday, June 29, 2004 5:56 PM To: Igor Rivin Cc: r-help@stat.math.ethz.ch Subject: Re: [R] naive question Igor Rivin wrote:> I was not particularly annoyed, just disappointed, since R seems like > a much better thing than SAS in general, and doing everything with a > combination of hand-rolled tools is too much work. However, I do need > to work with very large data sets, and if it takes 20 minutes to read > them in, I have to explore other options (one of which might beS-PLUS, which claims scalability as a major , er, PLUS over R). If you are routinely working with very large data sets it would be worthwhile learning to use a relational database (PostgreSQL, MySQL, even Access) to store the data and then access it from R with RODBC or one of the specialized database packages. R is slow reading ASCII files because it is assembling the meta-data on the fly and it is continually checking the types of the variables being read. If you know all this information and build it into your table definitions, reading the data will be much faster. A disadvantage of this approach is the need to learn yet another language and system. I was going to do an example but found I could not because I left all my SQL books at home (I'm travelling at the moment) and I couldn't remember the particular commands for loading a table from an ASCII file. ______________________________________________ R-help@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
"Vadim Ogranovich" <vograno@evafunds.com> writes:> I believe IO in R is slow because of the way it is implemented, not > because it has to do some extra work for the user. > > I compared scan() with 'what' argument set (which is, AFAIK, is the > fastest way to read a CSV file) to an equivalent C code. It turned out > to be 20 - 50 times slower. > I can see at least two main reasons why R's IO is so slow (I didn't > profile this though): > A) it reads from a connection char-by-char as opposed to the buffered > read. Reading each char requires a call to scanchar() which then calls > Rconn_fgetc() (with some non-trivial overhead). Rconn_fgetc() on its > part is defined somewhere else (not in scan.c) and therefore the call > can not be inlined, etc. > B) mkChar, which is used very extensively, is too slow. There are ways > to minimize the number of calls to mkChar, but I won't expand on it in > this message. > > I brought this up because it seems that many people believe that the > slowness is inherent and is a tradeoff for something else. I don't think > this is the case.Do you have some hard data on the relative importance of the above issues? I wouldn't think that R is really unbuffered, since there is buffering underlying the various fgetc() variants. Most C programs will do char-by-char processing by the same definition. The lack of inlining is sort of a consequence of a design where Rconn_fgetc() is switchable. However, conventional wisdom is that all of this tends to drown out compared to disk i/o. This might be a changing balance, but I think you're more on the mark with the mkChar issue. (Then again, it is quite a bit easier to come up with buffering designs for Rconn_fgetc than it is to redefine STRSXP...) -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907
> -----Original Message----- > From: Peter Dalgaard [mailto:p.dalgaard@biostat.ku.dk] > Sent: Wednesday, June 30, 2004 3:10 AM > To: Vadim Ogranovich > Cc: r-devel@stat.math.ethz.ch > Subject: Re: [Rd] Slow IO: was [R] naive question > > "Vadim Ogranovich" <vograno@evafunds.com> writes: > > > ... > > I can see at least two main reasons why R's IO is so slow (I didn't > > profile this though): > > A) it reads from a connection char-by-char as opposed to > the buffered > > read. Reading each char requires a call to scanchar() which > then calls > > Rconn_fgetc() (with some non-trivial overhead). > Rconn_fgetc() on its > > part is defined somewhere else (not in scan.c) and > therefore the call > > can not be inlined, etc. > > B) mkChar, which is used very extensively, is too slow. > > ... > > Do you have some hard data on the relative importance of the > above issues?Well, here is a little analysis which sheds some light. I have a file, foo, 154M uncompressed. It contains about 3.8M lines 01/02% ls -l foo* -rw-rw-r-- 1 vograno man 153797513 Jun 30 11:56 foo -rw-rw-r-- 1 vograno man 21518547 Jun 30 11:56 foo.gz # reading the files using standard UNIX utils takes no time 01/02% time cat foo > /dev/null 0.030u 0.110s 0:00.80 17.5% 0+0k 0+0io 124pf+0w 01/02% time zcat foo.gz > /dev/null 1.210u 0.030s 0:01.24 100.0% 0+0k 0+0io 90pf+0w # compute exact line count 01/02% zcat foo.gz | wc 3794929 3794929 153797513 # now we fire R-1.8.1 # we will experiment with the gzip-ed copy since we've seen that the overhead of decompression is trivial> nlines <- 3794929# this exercises scanchar(), but not mkChar(), see scan() in scan.c> system.time(scan(gzfile("foo.gz", open="r"), what="character", skip nlines - 1))system.time(scan(gzfile("foo.gz", open="r"), what="character", skip nlines - 1)) Read 1 items [1] 67.83 0.01 68.04 0.00 0.00 # this exercises both scanchar() and mkChar() system.time(readLines(gzfile("foo.gz", open="r"), n = nlines)) [1] 110.61 0.83 112.44 0.00 0.00 It seems that scanchar() and mkChar() have comparable overheads in this case.> ... This might be a changing balance, but I > think you're more on the mark with the mkChar issue. (Then > again, it is quite a bit easier to come up with buffering > designs for Rconn_fgetc than it is to redefine STRSXP...)First off all I agree that redefining STRSXP is not easy, but this has a potential to considerably speed up R as whole since name propagation would work faster. As to the mkChar() in scan() there are few tricks that can help. Say we have a CSV file that contains categorical and numerical data. Here is what we can do to minimize the number of calls to mkChar: * when reading the file in as a bunch of lines (before type conversion) do not call mkChar, rather pre-allocate large temporary char * arrays (via R_alloc) and store the lines sequentially in the arrays. This allows us to read the file into the memory with just few, however expensive, calls to R_alloc. Here the arrays effectively serve as a heap which will released by R at the end of the call. * Field conversion - when converting numeric fields there is no need to call mkChar at all (obvious) - when creating char fields that correspond to categorical data (going from the first element to the end) we can maintain a hash table that maps, char* -> SEXP, the field values encountered so far. When we get a new field value we first look it up in the hash table and if it is already there we use the corresponding SEXP to assign to the string element. This leads to a considerable speed-up in the common case where most field values are drawn from a small (<1000) set of "factor levels". And a final observation once we are on the scan() subject. I've found it more convenient to convert data column-by-column rather than row-by-row. When you do it column-by-column you * figure out the type of the column only once. Ditto about the destination vector. * maintain only one hash table for the current column, not for all columns at once. Thanks, Vadim