Hi, I have made a tiny package for saving dataframes in ASCII format. The package contains functions save.table() and save.delim(), the first mimics (not completely) write.table() and the second uses just different default values, suitable for read.delim(). The reason I have written the functions is that I have had problems with saving large dataframes in ASCII form. write.table() essentially makes a huge string in memory from the dataframe. I am not sure about write.matrix() (in MASS), but in my practice it is too memory-intensive also. My approach was to write the whole thing in C in this way that the function takes the values from the dataframe, one scalar value by time, and writes them immediately to the file. This, of course, puts certain limitations on the contents of dataframe and output format. Here is an example of the result:> dim(e2000)[1] 7505 1197> library(savetable) > system.time(save.table(e2000, "e2000"))[1] 38.04 0.48 48.75 0.00 0.00> library(MASS) > system.time(write.matrix(e2000, "e2000", sep=",", 1))-- killed after 10 minutes swapping. And now a smaller example:> dim(e2000s)[1] 100 1197> library(savetable) > system.time(save.table(e2000s, "e2000s"))[1] 0.45 0.00 0.56 0.00 0.00> system.time(write.table(e2000s, "e2000s"))[1] 31.21 0.11 38.99 0.00 0.00> library(MASS) > system.time(write.matrix(e2000s, "e2000s", sep=",", 1))[1] 4.01 0.66 5.45 0.00 0.00 None of the functions started swapping now, but as you can see, save.table() is still around 10 times as fast as write.matrix(). Examples are on my 128MB PII-400 linux system and R 1.4.0. I am not sure if there is much interest for such a package, so I put it on my own website instead of CRAN (http://www.obs.ee/~siim/savetable_0.1.0.tar.gz). Any feedback is appreciated. Many thanks to Brian Ripley and the others, who helped me accessing R objects in C. Best wishes, Ott Toomet -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
ripley@stats.ox.ac.uk
2002-Aug-10 18:54 UTC
[R] package for saving large datasets in ASCII
?write.matrix will tell you what you have overlooked, a sensible blocksize. If `I am not sure about write.matrix()', surely reading the help page is a first step? On Sat, 10 Aug 2002, Ott Toomet wrote:> Hi, > > I have made a tiny package for saving dataframes in ASCII format. The > package contains functions save.table() and save.delim(), the first > mimics (not completely) write.table() and the second uses just > different default values, suitable for read.delim(). > > The reason I have written the functions is that I have had problems > with saving large dataframes in ASCII form. write.table() essentially > makes a huge string in memory from the dataframe. I am not sure about > write.matrix() (in MASS), but in my practice it is too > memory-intensive also. My approach was to write the whole thing in C > in this way that the function takes the values from the dataframe, one > scalar value by time, and writes them immediately to the file. This, > of course, puts certain limitations on the contents of dataframe and > output format. > > Here is an example of the result: > > > dim(e2000) > [1] 7505 1197 > > library(savetable) > > system.time(save.table(e2000, "e2000")) > [1] 38.04 0.48 48.75 0.00 0.00 > > library(MASS) > > system.time(write.matrix(e2000, "e2000", sep=",", 1)) > > -- killed after 10 minutes swapping. > > And now a smaller example: > > > dim(e2000s) > [1] 100 1197 > > library(savetable) > > system.time(save.table(e2000s, "e2000s")) > [1] 0.45 0.00 0.56 0.00 0.00 > > system.time(write.table(e2000s, "e2000s")) > [1] 31.21 0.11 38.99 0.00 0.00 > > library(MASS) > > system.time(write.matrix(e2000s, "e2000s", sep=",", 1)) > [1] 4.01 0.66 5.45 0.00 0.00 > > None of the functions started swapping now, but as you can see, > save.table() is still around 10 times as fast as write.matrix(). > Examples are on my 128MB PII-400 linux system and R 1.4.0. > > I am not sure if there is much interest for such a package, so I put > it on my own website instead of CRAN > (http://www.obs.ee/~siim/savetable_0.1.0.tar.gz). Any feedback is > appreciated. > > Many thanks to Brian Ripley and the others, who helped me accessing R > objects in C. > > > Best wishes, > > Ott Toomet > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Ott, I think "save.table" is a great idea! I have a couple minor suggestions: 1) Personally my defaults would be sep="\t", na="NA", and col.names=(!append), that is, do not repeat column names if you're appending to an existing file (which presumably already has them at the top). 2) An option "digits" would be great, allowing you to specify the maximum digits after the decimal place. "digits" might be a single number (applies to all numeric fields), a vector (length = number of columns in x), or a list (whose names corresponded to the names of x you wanted to influence). The C code would simply round numbers to that many places; see src/nmath/fround.c for code that already does this. 3) I happen to like lists that are not officially dataframes; I'm glad to see that save.table works just fine on these (you really test for "list", not "dataframe"). I'd be willing to work on #2 with you if you'd like. -- -- David Brahm (brahm at alum.mit.edu) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._