Hi all, I'm running into some computer issues when trying to run a binomial model for spatially correlated data using glmmPQL and was wondering if anyone could help me out. My whole dataset consists of about 300,000 points for which I have a suite of environmental variables (I'm trying to come up with a habitat model for a species of seal, using real (presence) and simulated dives (absence) as my response variable). Since my dataset is so large, I split it into thirds and ran the model without spatial correlation. However, when checking my results, I did get a spatial correlation in the residuals, which I'm trying to incorporate using a variogram (spherical). The problem is that when I run it for my 1/3 of my data (about 70k points), the calculatations go on forever. I was running the first model for over a week and was still not getting past the first iteration. This is the model I used M1.f.Spa <- glmmPQL(presence ~ sst + tmax100 + tbot + sss + ssu + tmax100d + sbot, random = ~1|fPTT, family = binomial, correlation corSpher(c(91323.53,0.4279603), form =~ x+y, nugget = TRUE), data sample.df1) This week, I tried a different approach and split the 70k subset into smaller datasets of 10,000 points, and now the model runs much faster (just a couple of hours tops), Yet the output of these models changes with regards to the significance of one variable, which makes me think that I need a larger dataset than 10,000 points. Does someone have a suggestion on how to improve/run this code with the spatial correlation for a larger dataset than 10,000 which wouldn't take weeks to run? I'm working with an Intel Core i7, 12 Gb RAM (plus a couple of 100 Gbs in virtual memory), in Windows 7 64-bit. Thanks for your help, ---------------------------------------- Luis A. Huckstadt, Ph.D. Department of Ecology and Evolutionary Biology University of California Santa Cruz Long Marine Lab 100 Shaffer Road Santa Cruz, CA 95060 [[alternative HTML version deleted]]
Luis Huckstadt <lahuckst <at> ucsc.edu> writes:> I'm running into some computer issues when trying to run a binomial model > for spatially correlated data using glmmPQL and was wondering if anyone > could help me out. > My whole dataset consists of about 300,000 points for which I have a suite > of environmental variables (I'm trying to come up with a habitat model for > a species of seal, using real (presence) and simulated dives (absence) as > my response variable). > Since my dataset is so large, I split it into thirds and ran the model > without spatial correlation. However, when checking my results, I did get a > spatial correlation in the residuals, which I'm trying to incorporate using > a variogram (spherical). The problem is that when I run it for my 1/3 of > my data (about 70k points), the calculatations go on forever. I was running > the first model for over a week and was still not getting past the first > iteration. > > This is the model I used > > M1.f.Spa <- glmmPQL(presence ~ sst + tmax100 + tbot + sss + ssu + tmax100d > + sbot, random = ~1|fPTT, family = binomial, correlation > corSpher(c(91323.53,0.4279603), form =~ x+y, nugget = TRUE), data > sample.df1) > > This week, I tried a different approach and split the 70k subset into > smaller datasets of 10,000 points, and now the model runs much faster (just > a couple of hours tops), Yet the output of these models changes with > regards to the significance of one variable, which makes me think that I > need a larger dataset than 10,000 points.That's a bit surprising. When you say the significance of one variable changes -- does that mean it changes from (e.g.) 0.045 to 0.055, which crosses the magic line at p=0.05 but is not otherwise meaningful?> > Does someone have a suggestion on how to improve/run this code with the > spatial correlation for a larger dataset than 10,000 which wouldn't take > weeks to run? > > I'm working with an Intel Core i7, 12 Gb RAM (plus a couple of 100 Gbs in > virtual memory), in Windows 7 64-bit.This is not going to be easy in any case. I would look into the INLA package and AD Model Builder ... (I would also be mildly nervous about using PQL for binary data -- it might be OK, but this is known to be more or less the worst case scenario for PQL ...) Ben Bolker