thr3ads.net - R devel - [Rd] Slow IO: was [R] naive question [Jun 2004]

If this information is useful, please help other people find it:
Share via:

Vadim Ogranovich

2004-Jun-30 03:46 UTC

[Rd] Slow IO: was [R] naive question

I believe IO in R is slow because of the way it is implemented, not
because it has to do some extra work for the user. 

I compared scan() with 'what' argument set (which is, AFAIK, is the
fastest way to read a CSV file) to an equivalent C code. It turned out
to be 20 - 50 times slower.
I can see at least two main reasons why R's IO is so slow (I didn't
profile this though):
A) it reads from a connection char-by-char as opposed to the buffered
read. Reading each char requires a call to scanchar() which then calls
Rconn_fgetc() (with some non-trivial overhead). Rconn_fgetc() on its
part is defined somewhere else (not in scan.c) and therefore the call
can not be inlined, etc.
B) mkChar, which is used very extensively, is too slow. There are ways
to minimize the number of calls to mkChar, but I won't expand on it in
this message.

I brought this up because it seems that many people believe that the
slowness is inherent and is a tradeoff for something else. I don't think
this is the case.

Thanks,
Vadim

-----Original Message-----
From: r-help-bounces@stat.math.ethz.ch
[mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Douglas Bates
Sent: Tuesday, June 29, 2004 5:56 PM
To: Igor Rivin
Cc: r-help@stat.math.ethz.ch
Subject: Re: [R] naive question

Igor Rivin wrote:
> I was not particularly annoyed, just disappointed, since R seems like 
> a much better thing than SAS in general, and doing everything with a 
> combination of hand-rolled tools is too much work. However, I do need 
> to work with very large data sets, and if it takes 20 minutes to read 
> them in, I have to explore other options (one of which might beS-PLUS, which claims scalability as a major , er, PLUS over R).

If you are routinely working with very large data sets it would be
worthwhile learning to use a relational database (PostgreSQL, MySQL,
even Access) to store the data and then access it from R with RODBC or
one of the specialized database packages.

R is slow reading ASCII files because it is assembling the meta-data on
the fly and it is continually checking the types of the variables being
read.  If you know all this information and build it into your table
definitions, reading the data will be much faster.

A disadvantage of this approach is the need to learn yet another
language and system.  I was going to do an example but found I could not
because I left all my SQL books at home (I'm travelling at the moment)
and I couldn't remember the particular commands for loading a table from
an ASCII file.

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Peter Dalgaard

2004-Jun-30 12:10 UTC

head link

[Rd] Slow IO: was [R] naive question

"Vadim Ogranovich" <vograno@evafunds.com> writes:
> I believe IO in R is slow because of the way it is implemented, not
> because it has to do some extra work for the user. 
> 
> I compared scan() with 'what' argument set (which is, AFAIK, is the
> fastest way to read a CSV file) to an equivalent C code. It turned out
> to be 20 - 50 times slower.
> I can see at least two main reasons why R's IO is so slow (I didn't
> profile this though):
> A) it reads from a connection char-by-char as opposed to the buffered
> read. Reading each char requires a call to scanchar() which then calls
> Rconn_fgetc() (with some non-trivial overhead). Rconn_fgetc() on its
> part is defined somewhere else (not in scan.c) and therefore the call
> can not be inlined, etc.
> B) mkChar, which is used very extensively, is too slow. There are ways
> to minimize the number of calls to mkChar, but I won't expand on it in
> this message.
> 
> I brought this up because it seems that many people believe that the
> slowness is inherent and is a tradeoff for something else. I don't
think
> this is the case.
Do you have some hard data on the relative importance of the above
issues?

I wouldn't think that R is really unbuffered, since there is buffering
underlying the various fgetc() variants. Most C programs will do
char-by-char processing by the same definition. The lack of inlining
is sort of a consequence of a design where Rconn_fgetc() is
switchable. However, conventional wisdom is that all of this tends to
drown out compared to disk i/o. This might be a changing balance, but
I think you're more on the mark with the mkChar issue. (Then again, it
is quite a bit easier to come up with buffering designs for
Rconn_fgetc than it is to redefine STRSXP...)
 
-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

Vadim Ogranovich

2004-Jun-30 22:13 UTC

head link

[Rd] Slow IO: was [R] naive question

> -----Original Message-----
> From: Peter Dalgaard [mailto:p.dalgaard@biostat.ku.dk] 
> Sent: Wednesday, June 30, 2004 3:10 AM
> To: Vadim Ogranovich
> Cc: r-devel@stat.math.ethz.ch
> Subject: Re: [Rd] Slow IO: was [R] naive question
> 
> "Vadim Ogranovich" <vograno@evafunds.com> writes:
> 
> > ...
> > I can see at least two main reasons why R's IO is so slow (I
didn't
> > profile this though):
> > A) it reads from a connection char-by-char as opposed to 
> the buffered 
> > read. Reading each char requires a call to scanchar() which 
> then calls
> > Rconn_fgetc() (with some non-trivial overhead). 
> Rconn_fgetc() on its 
> > part is defined somewhere else (not in scan.c) and 
> therefore the call 
> > can not be inlined, etc.
> > B) mkChar, which is used very extensively, is too slow. 
> > ...
> 
> Do you have some hard data on the relative importance of the 
> above issues?
Well, here is a little analysis which sheds some light. I have a file,
foo, 154M uncompressed. It contains about 3.8M lines

01/02% ls -l foo*
-rw-rw-r--    1 vograno  man      153797513 Jun 30 11:56 foo
-rw-rw-r--    1 vograno  man      21518547 Jun 30 11:56 foo.gz

# reading the files using standard UNIX utils takes no time
01/02% time cat foo > /dev/null
0.030u 0.110s 0:00.80 17.5%	0+0k 0+0io 124pf+0w
01/02% time zcat foo.gz  > /dev/null
1.210u 0.030s 0:01.24 100.0%	0+0k 0+0io 90pf+0w

# compute exact line count
01/02% zcat foo.gz | wc
3794929 3794929 153797513


# now we fire R-1.8.1
# we will experiment with the gzip-ed copy since we've seen that the
overhead of decompression is trivial> nlines <- 3794929
# this exercises scanchar(), but not mkChar(), see scan() in
scan.c> system.time(scan(gzfile("foo.gz", open="r"),
what="character", skip nlines - 1))system.time(scan(gzfile("foo.gz", open="r"),
what="character", skip nlines - 1))
Read 1 items
[1] 67.83  0.01 68.04  0.00  0.00

# this exercises both scanchar() and mkChar()
system.time(readLines(gzfile("foo.gz", open="r"), n =
nlines))
[1] 110.61   0.83 112.44   0.00   0.00

It seems that scanchar() and mkChar() have comparable overheads in this
case.

> ... This might be a changing balance, but I 
> think you're more on the mark with the mkChar issue. (Then 
> again, it is quite a bit easier to come up with buffering 
> designs for Rconn_fgetc than it is to redefine STRSXP...)
First off all I agree that redefining STRSXP is not easy, but this has a
potential to considerably speed up R as whole since name propagation
would work faster.
As to the mkChar() in scan() there are few tricks that can help. Say we
have a CSV file that contains categorical and numerical data. Here is
what we can do to minimize the number of calls to mkChar:

* when reading the file in as a bunch of lines (before type conversion)
do not call mkChar, rather pre-allocate large temporary char * arrays
(via R_alloc) and store the lines sequentially in the arrays. This
allows us to read the file into the memory with just few, however
expensive, calls to R_alloc. Here the arrays effectively serve as a heap
which will released by R at the end of the call.

* Field conversion
	- when converting numeric fields there is no need to call mkChar
at all (obvious)
	- when creating char fields that correspond to categorical data
(going from the first element to the end) we can maintain a hash table
that maps, char* -> SEXP, the field values encountered so far. When we
get a new field value we first look it up in the hash table and if it is
already there we use the corresponding SEXP to assign to the string
element. This leads to a considerable speed-up in the common case where
most field values are drawn from a small (<1000) set of "factor
levels".


And a final observation once we are on the scan() subject. I've found it
more convenient to convert data column-by-column rather than row-by-row.
When you do it column-by-column you
* figure out the type of the column only once. Ditto about the
destination vector.
* maintain only one hash table for the current column, not for all
columns at once.


Thanks,
Vadim

Maybe Matching Threads

Search for more apparently analagous threads

R devel - Jun 2004 - Slow IO: was [R] naive question

[Rd] Slow IO: was [R] naive question

[Rd] Slow IO: was [R] naive question

[Rd] Slow IO: was [R] naive question

Maybe Matching Threads