Recently there was a question on using lme with large data sets. As an
experiment I fit a linear mixed-effects model to a data set with about
350,000 observations on 6 predictors, a numerical response, and a
single grouping factor.
The timings shown below were on a 1.2 GHz Athlon with 1 GB of PC133
memory and 2 GB of swap. The operating system is Debian 3.0
GNU/Linux. The kernel is 2.4.14.
> dim(dallas)
[1] 348478 8> system.time(print(summary(dallas)))
year grade ids campus sex
1994:47291 3:60207 Min. : 3 57905049: 10725 F :171388
1995:44397 4:58014 1st Qu.: 495362 57905051: 6296 M :161423
1996:43159 5:57467 Median :1849290 57905043: 5982 NA's: 15667
1997:49306 6:59493 Mean :1748909 57905042: 5728
1998:51978 7:58261 3rd Qu.:2856098 57905065: 5484
1999:53587 8:55036 Max. :3584203 57905044: 5338
2000:58760 (Other) :308925
ethnicity disadvg tli
B :164680 N : 92016 Min. : 0.00
H :124410 Y :216169 1st Qu.:61.00
M : 3394 NA's: 40293 Median :75.00
O : 2926 Mean :71.24
W : 39285 3rd Qu.:84.00
NA's: 13783 Max. :93.00
[1] 5.12 0.03 6.05 0.00 0.00> gc()
used (Mb) gc trigger (Mb)
Ncells 540528 14.5 2013967 53.8
Vcells 2303229 17.6 7300819 55.8> object.size(dallas)
[1] 18126736
This shows that the data frame is about 18 MB, and the total memory
image for R is about 30 MB but has lots of room to grow. Operations like
summarizing the columns are very fast.
I was able to fit an lme model with a random effect for student
(ids).
> system.time(fm1 <- lme(tli~year + grade + sex + ethnicity, data =
dallas,
+ random = ~ 1 | ids, na.action = na.omit, control = lmeControl(niterEM=50,
msVerbose = TRUE)))
initial value 2916459.573268
final value 2916459.573268
converged
[1] 7220.25 19.17 8447.31 0.00 0.00> gc()
used (Mb) gc trigger (Mb)
Ncells 1127215 30.1 2620392 70.0
Vcells 28941125 220.9 55752037 425.4> object.size(fm1)
[1] 30863416
The memory image of R stayed at around 150 MB during most of the
function evaluation then went up to about 350 MB. I imagine the
larger memory requirement occurred after convergence, although I did
not check this. A lot of the work in lme (and many other
model-fitting functions) is in creating the fitted model object after
the parameter estimates have been obtained.
As you can see, this took a long time - over two hours- even on this
relatively fast computer. I did set the number of EM iterations to be
50, which is probably larger than necessary for this model, and would
slow down the fitting.
As expected when you have this many observations, all the fixed
effects are statistically significant.
> summary(fm1)
Linear mixed-effects model fit by REML
Data: dallas
AIC BIC logLik
2557954 2558158 -1278958
Random effects:
Formula: ~1 | ids
(Intercept) Residual
StdDev: 12.90782 7.742589
Fixed effects: tli ~ year + grade + sex + ethnicity
Value Std.Error DF t-value p-value
(Intercept) 68.47575 0.0683054 196971 1002.4933 <.0001
year.L 10.17203 0.0710794 196971 143.1080 <.0001
year.Q -1.10901 0.0426594 196971 -25.9968 <.0001
year.C -0.41155 0.0403056 196971 -10.2108 <.0001
year^4 0.78151 0.0388381 196971 20.1223 <.0001
year^5 -0.84449 0.0385725 196971 -21.8935 <.0001
year^6 0.19431 0.0380664 196971 5.1046 <.0001
grade.L 1.42974 0.0597163 196971 23.9423 <.0001
grade.Q -2.77355 0.0384930 196971 -72.0533 <.0001
grade.C -0.33285 0.0364386 196971 -9.1345 <.0001
grade^4 1.10184 0.0355329 196971 31.0089 <.0001
grade^5 1.48623 0.0349311 196971 42.5477 <.0001
sexM -1.83115 0.0770710 134707 -23.7592 <.0001
ethnicityH 1.85379 0.0836131 134707 22.1711 <.0001
ethnicityM 3.09373 0.4232054 134707 7.3102 <.0001
ethnicityO 10.47489 0.4083131 134707 25.6541 <.0001
ethnicityW 10.16495 0.1245893 134707 81.5876 <.0001
Correlation:
(Intr) year.L year.Q year.C year^4 year^5 year^6 grad.L grad.Q
year.L -0.001
year.Q -0.048 -0.030
year.C 0.012 -0.075 -0.053
year^4 -0.004 0.010 -0.019 -0.054
year^5 0.001 0.000 0.020 -0.021 -0.055
year^6 -0.001 -0.009 -0.009 0.024 -0.001 -0.066
grade.L 0.013 -0.633 0.007 0.042 0.003 0.009 -0.003
grade.Q -0.037 -0.002 0.085 -0.014 -0.003 0.008 -0.006 0.011
grade.C 0.004 0.021 -0.005 -0.048 -0.004 0.006 0.000 -0.029 -0.006
grade^4 -0.003 -0.001 0.004 0.005 0.020 -0.003 0.001 0.003 -0.006
grade^5 -0.001 -0.001 0.011 0.002 -0.002 -0.010 0.004 -0.003 0.010
sexM -0.547 -0.006 0.000 0.001 -0.002 0.000 0.000 0.006 -0.002
ethnicityH -0.563 -0.092 -0.015 -0.010 -0.011 -0.010 0.008 0.037 0.010
ethnicityM -0.110 0.002 0.005 0.000 -0.001 -0.001 0.001 0.000 0.005
ethnicityO -0.113 0.013 0.005 0.000 -0.001 -0.003 0.002 -0.020 0.000
ethnicityW -0.380 0.041 0.004 -0.001 0.000 -0.001 0.001 -0.028 0.005
grad.C grad^4 grad^5 sexM ethncH ethncM ethncO
year.L
year.Q
year.C
year^4
year^5
year^6
grade.L
grade.Q
grade.C
grade^4 -0.005
grade^5 0.001 -0.015
sexM 0.003 0.000 0.000
ethnicityH -0.001 0.002 -0.003 -0.010
ethnicityM 0.001 0.001 0.000 -0.005 0.092
ethnicityO 0.001 0.002 0.000 -0.007 0.094 0.019
ethnicityW -0.001 -0.001 0.000 -0.005 0.308 0.062 0.065
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-8.71167390 -0.43196123 0.06053214 0.47646735 7.23234387
Number of Observations: 331695
Number of Groups: 134713
Brian Ripley has pointed out that it is rare to have data sets this
size that are homogeneous. The caution is warranted here. These data
are not homogeneous but I won't go into details of exactly why the
results are questionable.
I started another model fit
> system.time(fm2 <- update(fm1, random = ~ year | ids))
but that is still running. R is using about half a gigabyte of memory.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._