dear R experts: I have a very large panel data set, about 2-8GB. think NU <- 30000;NT <- 3000 ds <- data.frame( unit= rep(1:NU, each=NT ), time=NA, x=NA) ds$time <- rep( 1:NT, NU ) ds$x <- rnorm(nrow(ds)) I want to do a couple of operations within each unit first, and then do some list operations at each time. not difficult in principle. think ds <- merge back in results of mclapply( split(1:nrow(x), ds$unit), function( ids ) { work on ds[ids,] } ) # same unit ds <- merge back in results of mclapply( split(1:nrow(x), ds$time), function( ids ) { work on ds[ids,] } ) # same time the problem is that ds is big. I can store 1 copy, but not 4. what I really want is to declare ds "read-only shared memory" before the mclapply() and have the spawned processes access the same ds. right now, each core wants its own private duplicate of ds, which then runs out of memory. I don't think shared data is possible in R across mclapply. * I could just run my code single-threaded. this loses the parallelism of the task, but the code remains parsimonious and the memory footprint is still ok. * I could just throw 120GB of SSD as swapfile. for $100 or so, this ain't a bad solution. its slower than RAM but faster and safer than coding more complex R solutions. it's still likely faster than single-threaded operations on quad-core machines. if the swap algorithm is efficient, it shouldn't be so bad. * I could pre-split the data before and merge after the mclapply. within each chunk, I could then use mclapply. the code would be uglier and have a layer of extra complexity ( = bugs ), but RAM consumption drops by orders of magnitude. I am thinking something roughly like ## first operation mclapply( split(1:nrow(x), ds$units), function(di) save( ds[di,], file=paste0("@", di, ".Rdata") ) rm(ds) ## make space for the mclapply results <- mclapply( Sys.glob("*.Rdata"), function( ids ) { load(ids); ...do whatever... } ## run many many happysmall-mem processes system("rm @*.Rdata") ## temporary files load("ds.Rdata") ## since we deleted it, we have to reload the original data ## combine results of the full ds ds <- data.frame( ds, results ) ## now run the second operation on the time units * I could dump the data into a data base, but then every access (like the split() or the mclapply()) would also have to query and reload the data again, just like my .Rdata files. is it really faster/better than abusing the file system and R's native file formats? I doubt it, but I don't know for sure. this is a reasonably common problem with large data sets. I saw some specific solutions on stackoverflow, a couple requiring even less parsimonious user code. is everyone using bigmemory? or SQL? or ... ? I am leaning towards the SSD solution. am I overlooking some simpler recommended solution? /iaw ---- Ivo Welch (ivo.welch at gmail.com)