Hello R-users,
First my settings: R-1.8.1, compiled as a 64bit application for a Solaris
5.8, 64 bit. The OS has 8Gb of RAM available and I am the sole user of the
machine, hence pretty much all the 8Gb are available to R.
I am pretty new to R and I am having a hard time to work with large data
sets, which make up over 90% of the anlyses done here. The data set I
imported in R, from S+, has a little over 2,000,000 rows by somewhere
around 60 variables, most of them factors, but a few continuous. The data
set is in fact a subset of a larger data set used for analysis in S+. I
know that some of you will think that I should sample, but it is not an
option in the present settings.
After first reading the data set into R -- which had its challenges on its
own -- when I quit R and save the work space it takes over 5 minutes, when
I start a new session and load the data set it takes around 15 minutes.
I am trying to build a model that I have already built in S+, so I can make
sure I am doing the right thing and can compare resources usage, but so far
I have no luck! After 45 minutes or so R has used up all the available
memory and is swapping, which brings CPU usage close to nothing.
I am convinced there are settings I could use to optimize memory management
for such problems. I tried help(Memory) which tells me about the options "
--min-vsize=vl --max-vsize=vu --min-nsize=nl --max-nsize=nu", but it is not
clear if they should be used and when. Further down the pages it
says:"..., and since setting larger values of the minima will make R
slightly more efficient on large tasks." But on the other hand, searching
the R-site, for memory management clues I found, from Brian Ripley, dated
13 Nov. 2003:
"But had you actually read the documentation you would know it did not do
that. That needs --max-memory-size set.", that was in response to someone
who had increased the value of "min-vsize= "; furthermore I don't
find any
"--max-memory-size" option?
I am wondering if someone having experience working with large data sets
would share the configurations and options he is using. If that matters
here is the model I was trying to fit.
library(package = "statmod", pos = 2,
lib.loc = "/home/jeg002/R-1.8.1/lib/R/R_LIBS")
qc.B3.tweedie <- glm(formula = pp20B3 ~ ageveh + anpol +
categveh + champion + cie + dossiera +
faq13c + faq5a + kmaff + kmprom + nbvt +
rabprof + sexeprin + newage,
family = tweedie(var.power = 1.577,
link.power = 0),
etastart = log(rep(mean(qc.b3.sans.occ[,
'pp20B3']), nrow(qc.b3.sans.occ))),
weights = unsb3t1,
trace = T,
data = qc.b3.sans.occ)
After one iteration (45+ minutes) R is trashing through over 10Gb of
memory.
Thanks for any insights,
G?rald Jean
Analyste-conseil (statistiques), Actuariat
t?lephone : (418) 835-4900 poste (7639)
t?lecopieur : (418) 835-6657
courrier ?lectronique: gerald.jean at spgdag.ca
"In God we trust all others must bring data" W. Edwards Deming