George Ostrouchov
2003-Feb-19 02:07 UTC
[Rd] R as analysis server for very large data sets
At ORNL, we are building a system, ASPECT (Adaptive Simulation Product Exploration and Control Toolkit), for analyzing output from massive simulations. It is essentially a client server type setup that reads netcdf and hdf files, and uses MPI for some distributed tasks. The total output of a simulation can be terabytes, but individual variables can be only a gigabyte and some relevant subsets even smaller. In theory, a single variable can be handled on a 64 bit machine with a few gigabytes of memory, say 10 GB. I understand that some folks have some success running R on a 64 bit machine. In addition to some home-grown distributed data analysis codes, we have included a facility for calling a limited subset of R functions from ASPECT. Simple use of R on a large data set did not work well. For example, computing a simple histogram consumed several times (I think it was 3 times) more memory than that required for the data itself. Some editing to the hist.default function fixed the problem, but reduced the generality of the function. The default seemed to generate a dimnames attribute that became as large as the data. It may be that our initial data matrix had some attributes we were not aware of. It seems that generality and metadata generation in R run counter to R's ability to handle large data sets. Can someone comment on this? Are there functions in R that will strip a variable of all its attributes, except the structure such as vector, matrix, or array? Or are there options to prevent generating more attributes in some functions? ... Perhaps an attribute to prevent further attributes? Does it make sense to propose building (assuming that someone has time to do it) a "large data" subset of R? Thanks for your help, George ---------------------------------------------------------- George Ostrouchov Statistics and Data Sciences Group Computer Science and Mathematics Division Oak Ridge National Laboratory ----------------------------------------------------------