At present the example data sets in R libraries are to be given as expressions that can be read directly into R. For example, the acid.R file in the main library looks like acid <- data.frame( carb = c(0.1, 0.3, 0.5, 0.6, 0.7, 0.9), optden = c(0.086, 0.269, 0.446, 0.538, 0.626, 0.782), row.names = paste(1:6)) This is great when you have only a few observations. I have one example data set with over 9000 rows and 17 variables. Even when I set -v 40, I exhaust the available memory trying to read it in as a data.frame. I believe this is because of the recursive nature of the parsing of data objects. Are there alternatives that would cause less memory usage? In S/S-PLUS the data.dump/data.restore functions use a portable representation that can be parsed without exponential memory growth. -- Douglas Bates bates@stat.wisc.edu Statistics Department 608/262-2598 University of Wisconsin - Madison http://www.stat.wisc.edu/~bates/ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On 24 Feb 1998, Douglas Bates wrote:> At present the example data sets in R libraries are to be given as > expressions that can be read directly into R. For example, the acid.R > file in the main library looks like > acid <- data.frame( > carb = c(0.1, 0.3, 0.5, 0.6, 0.7, 0.9), > optden = c(0.086, 0.269, 0.446, 0.538, 0.626, 0.782), row.names = paste(1:6)) > > This is great when you have only a few observations. I have one > example data set with over 9000 rows and 17 variables. Even when I > set -v 40, I exhaust the available memory trying to read it in as a > data.frame.You need to specify -n some_large_number to read in large data sets, specifying -v is not enough. You can see this by using gcinfo(T) to report heap and cons cell usage at each garbage collection.> Are there alternatives that would cause less memory usage? In > S/S-PLUS the data.dump/data.restore functions use a portable > representation that can be parsed without exponential memory growth.The R save() format is portable, at least among Unices. You could have the data.R file contain the command eval(load("data.Rdata"),.GlobalEnv) where "data.Rdata" is the saved file. There is an ascii=T option, which might make it more portable to other operating systems. I haven't checked, but I assume that this format can be read more efficiently than sourcing R code. -thomas -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>>>>> "DougB" == Douglas Bates <bates@stat.wisc.edu> writes:DougB> At present the example data sets in R libraries are to be given as DougB> expressions that can be read directly into R. For example, the acid.R DougB> file in the main library looks like DougB> acid <- data.frame( DougB> carb = c(0.1, 0.3, 0.5, 0.6, 0.7, 0.9), DougB> optden = c(0.086, 0.269, 0.446, 0.538, 0.626, 0.782), row.names = paste(1:6)) DougB> This is great when you have only a few observations. I have one DougB> example data set with over 9000 rows and 17 variables. Even when I DougB> set -v 40, I exhaust the available memory trying to read it in as a DougB> data.frame. I believe this is because of the recursive nature of the DougB> parsing of data objects. yes; DougB> Are there alternatives that would cause less memory usage? yes; but only in the 0.62 development version. The current 0.62 ``standard'' is: if a 'data' file ends in .R, source(.) is used to read it if it ends in .tab read.table(..., header = TRUE) is used to read it. (you find the new data(.) function in src/library/base/data in R-snapshot.) Note that this is still not really satisfactory for large data files, since read.table(.) is not really efficient: it first reads everything as character matrix and then converts variable by variable, some to numeric, some to factor. On the other hand: does it really make sense to distribute huge example data sets as yours above? If yes, AND if you have only numeric data, I'd propose the following: 1) create a <pkg>/data/dougBex.R file which only contains something like dougBex <- as.data.frame( matrix(scan(system.file("<pkg>/data/dougBex.dat")), ncol = ..., dimnames = ...)) 2) create <pkg>/data/dougBex.dat to contain all your data, white-space delimited numeric. DougB> In S/S-PLUS the data.dump/data.restore functions use a portable DougB> representation that can be parsed without exponential memory growth. hmm, yes, we have been longing for someone to write data.dump/data.restore for R. Any volunteers? -- Martin Maechler <maechler@stat.math.ethz.ch> <>< Seminar fuer Statistik, ETH-Zentrum SOL G1; Sonneggstr.33 ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND phone: x-41-1-632-3408 fax: ...-1086 http://www.stat.math.ethz.ch/~maechler/ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._