Hello, I frequently have to export a large quantity of data from some source (for example, a database, or a hand-written perl script) and then read it into R. This occasionally takes a lot of time; I'm usually using read.table("filename",comment.char="",quote="") to read the data once it is written to disk. However, I *know* that the program that generates the data is more or less just calling printf in a for loop to create the csv or tab-delimited file, writing, then having R parse it, which is pretty inefficient. Instead, I am interested in figuring out how to write the data in .RData format so that I can load() it instead of read.table() it. Trolling the internet, however, has not suggested anything about the specification for an .RData file. Could somebody link me to a specification or some information that would instruct me on how to construct a .RData file (either compressed or uncompressed)? Also, I am open to other suggestions of how to get load()-like efficiency in some other way. Many thanks, Adam D. I. Kramer
The R 'save' format (as used for the saved workspace .RData) is described in the 'R Internals' manual (section 1.8). It is intended for R objects, and you would first have to create one[*] of those in your other application. That seems a lot of work. The normal way to transfer numeric data between applications is to write a binary file: R can read such files with readBin(), and it also has wrappers/C-code to read a number of commmon binary data formats (e.g. those from SPSS). With character data there are more issues (and more formats, see also readChar()), but load() is not particularly fast for those. Ultimately the R functions pay a performance price for their flexibility so hand-crafted C code to read the format can be worthwhile: but see the comments below about whether I/O speed is that important. [*] the 'save' format is a serialization of a single R object, even if you save many objects, since the object(s) are combined into a pairlist. On Sun, 8 Nov 2009, Adam D. I. Kramer wrote:> Hello, > > I frequently have to export a large quantity of data from some > source (for example, a database, or a hand-written perl script) and then > read it into R. This occasionally takes a lot of time; I'm usually using > read.table("filename",comment.char="",quote="") to read the data once it is > written to disk.Specifying colClasses and nrows will usually help. To read from a database, packages such as RODBC use binary data transfer: with suitable tuning this can be fast.> However, I *know* that the program that generates the data is more > or less just calling printf in a for loop to create the csv or tab-delimited > file, writing, then having R parse it, which is pretty inefficient. Instead, > I am interested in figuring out how to write the data in .RData > format so that I can load() it instead of read.table() it.Without more details it is hard to say if it is inefficient. read.table() can read data pretty fast (millions of items per second) if used following the hints in the 'R Data' manual. See e.g. https://stat.ethz.ch/pipermail/r-devel/2004-December/031733.html Almost anything non-trivial one might do with such data is much slower. The trend is to write richer (and slower to read) data formats.> Trolling the internet, however, has not suggested anything about the > specification for an .RData file. Could somebody link me to a specification > or some information that would instruct me on how to construct a .RData > file (either compressed or uncompressed)? > > Also, I am open to other suggestions of how to get load()-like > efficiency in some other way. > > Many thanks, > Adam D. I. Kramer-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
You can try read.csv.sql in the sqldf package. It reads a file into an sqlite database which it creates for you using RSQLite/sqlite thus effectively its done outside of R. Then it extracts the portion you specify using an sql statement and destroys the database. Omit the sql statement if you want the entire file. Don't know if its faster than read.table when used in that way but its only one line of code so you could easily try it. See example 13 on home page: http://sqldf.googlecode.com On Mon, Nov 9, 2009 at 12:27 AM, Adam D. I. Kramer <adik at ilovebacon.org> wrote:> Hello, > > ? ? ? ?I frequently have to export a large quantity of data from some > source (for example, a database, or a hand-written perl script) and then > read it into R. ?This occasionally takes a lot of time; I'm usually using > read.table("filename",comment.char="",quote="") to read the data once it is > written to disk. > > ? ? ? ?However, I *know* that the program that generates the data is more > or less just calling printf in a for loop to create the csv or tab-delimited > file, writing, then having R parse it, which is pretty inefficient. Instead, > I am interested in figuring out how to write the data in .RData > format so that I can load() it instead of read.table() it. > > ? ? ? ?Trolling the internet, however, has not suggested anything about the > specification for an .RData file. Could somebody link me to a specification > or some information that would instruct me on how to construct a .RData > file (either compressed or uncompressed)? > > ? ? ? ?Also, I am open to other suggestions of how to get load()-like > efficiency in some other way. > > Many thanks, > Adam D. I. Kramer > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
If you can manage to write out your data in separate binary files, one for each column, then another possibility is using package ff. You can link those binary columns into R by defining an ffdf dataframe: columns are memory mapped and you can access those parts you need - without initially importing them. This is much faster than a csv import and also works for files that are too large to import at once. If all your columns have the same storage.mode (vmode in ff), then another alternative is writing out all your data in one single binary matrix with major row-order (because that can be written row by row from your program) and link the file into R as a single ff_matrix. Since ffdf in ff is new, I give a mini-tutorial below. Let me know how that works for you. Kind regards Jens Oehlschl?gel library(ff) # Create example csv fnam <- "/tmp/example.csv" write.csv(data.frame(a=1:9, b=1:9+0.1), file=fnam, row.names=FALSE) # Create example binary files on disk. # Reading csv into ffdf actually stores # each column as a binary file on disk. # Using a pattern outside fftempdir automatically sets finalizer="close" # and thus makes those binary files permanent. path <- "/tmp/example_" x <- read.csv.ffdf(file=fnam, ff_args=list(pattern=path)) close(x) # Note that a standard ffdf is made-up column by column from simple ff objects. # More coplex mappings from ff objects into ffdf are possible, # but let's keep it simple for now. p <- physical(x) p # Now let's just create an ffdf from existing binary files. # Step one: create an ff object for each binary file (without reading them). # Note that because we open ff files outside fftempdir, # the default finalizer is "close", not "delete", # so the file will not be deleted on finalization # files are opened for memory mapping, but not read ffcols <- vector("list", length(p)) for (i in 1:length(p)){ ffcols[[i]] <- ff(filename=filename(p[[i]]), vmode=vmode(p[[i]])) } ffcols # step two: bundle several ff objects into one ffdf data.frame # (still without reading data) ffdafr <- ffdf(a=ffcols[[1]], b=ffcols[[2]]) # now reading rows from this will return a standard data.frame # (and only read the required rows) ffdafr[1:4,] ffdafr[5:9,] # As an alternative create an example binary # (double) matrix in major row order y <- as.ff(t(ffdafr[,]), filename="d:/tmp/example_single_matrix.ff") # Again we can link this existing binary file. # if we know the size of the matrix we can do z <- ff(filename=filename(y), vmode="double", dim=c(9,2), dimorder=c(2,1)) z rm(z) # If we only know the number of columns we can do z <- ff(filename=filename(y), vmode="double") # and set dim later dim(z) <- c(length(z)/2, 2) # Note that so far we have interpreted the file in major column order z # To interpret the file in major column order we set dimorder # (a generalization for n-way arrays) dimorder(z) <- c(2,1) z # removing the ff objects will trigger finalizer # at next garbage collection rm(x, ffcols, ffdafr, y, z) gc() # since we carefully selected the "close" finalizer, # the files still exist dir(path="/tmp", pattern="example_") # now remove them physically unlink(file.path("/tmp", dir(path="/tmp", pattern="example_"))) -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT!