Dear r-helpers,
I know that there has already been enough questions on IO performance
these last days, but I came accross the following situation today. I was
comparing the performance of R with that of SAS's Risk Dimensions at
generating random "scenarios". My dataset --all numeric entries--
would
nicely fit into RAM and R would outperform SAS until... I wanted to
export the results to a .csv file using the write.table() function. For
reference, this output file was of about 30MB. Moreover, the memory
needed by R would increase sharply during the writing process.
I had a look at the code for the write.table() function and I found out
that, basically, what it does is to create a very long text string from
the data using paste() and then to print it using writeLines(). Rprof()
showed that writeLines() would only use a mere 3% of the computing time,
the rest being taken almost entirely by paste().
There are two directions in which performance could potentially be improved:
1.- Writing speed.
2.- Memory usage.
Regarding memory usage, I thought that perhaps a little rewriting of the
write.table() function could be considered: instead of writing in RAM a
single long text string, with a little overhead, the data frame to be
printed could be splitted into shorter, recyclable, chunks, then
paste()-ing them into shorter "buffer" strings and print them
sequentially into the the output file. (Note: I am a complete ignorant
on R's memory recycling rules and this could perhaps not work as
intended because of them).
Regarding speed considerations, I see little hope as long as the paste()
function is implicitly called by write.table(). Most likely, its
execution time scales linearly with the number of lines in the data
frame, so splitting it would render no benefits. Are there any hints on
how could a performance improvement (other than linking external, ad hoc
C code) be achieved? Do we really need to go through parse()? Would it
perhaps be beneficial to include in R some specialized functions that
achieved high output performance for writing out, say, only numeric
values (this happens to be the case for me most of the time)?
Sorry for the long posting.
Carlos J. Gil Bellosta