Dear List,
I posted this in R-mixed and did not receive any feedback. I might post it
in the wrong place. I re-post in R-help and hope to receive any suggestions
and\or thoughts regarding data analysis.
The objective of the study is to investigate effects of soil properties on
insect outbreaks. There are four study fields (or sites). Data were
collected from 1996 through 2009. Below is sampling number per site per
year.> table(alldfv3$year, alldfv3$site)
1 2 3 4
1996 256 265 222 197
1997 239 272 249 236
1998 216 239 216 222
1999 236 246 216 232
2000 237 251 247 246
2001 236 248 0 0
2002 233 249 210 208
2003 261 300 265 256
2004 268 281 278 263
2005 265 282 273 263
2006 268 283 277 263
2007 274 285 278 0
2008 276 288 280 0
2009 268 293 279 0
As you can see, data were not collected in some study sites in some years
and sampling numbers per site per year are also vary. Sampling was
conducted every 8 feet in each study site. Investigator recorded GPS (long
and lat) of each sampling point. Within each sampling point, soil insects
were collected and counted and soil data were also collected from each
sampling site only in 2007. Investigator assumes that soil properties (e.g.
pH, Mg, Ca, lime, etc) are relatively consistent year to year. Thus,
measurements of soil properties in 2007 were applied to other years. So,
GPS reading for insects and soil properties are different. To investigate
the effect of soils on insect, first I ran thin plate spline regression to
fit surface to soil data. And then, I obtained predicted soil properties
using insect GPS readings so that I have the same GPS readings between
insect and soils data.
Due to the corlinearity among predictors, I conducted principal component
analysis and obtained first 4 principal components accounting over 85 %
total variation. Using coefficient of 4 principal components, new 4
predictors were obtained as follows (i.e., PC1 ~ PC4);
> head(df.chf09)
chafer year longitude latitude paddock2 PC1 PC2
PC3 PC4
12152 0 2009 -87.85045 33.90220 11 1.1790748 -2.895178 -1.522787
-2.33674507
12153 0 2009 -87.85045 33.90214 11 0.7852655 -3.758568 -1.691573
-2.16018891
12154 0 2009 -87.85046 33.90206 11 0.5025919 -4.447035 -1.794957
-1.17294024
12155 0 2009 -87.85046 33.90199 11 0.5575338 -4.587492 -1.894354
-0.24759008
12156 0 2009 -87.85046 33.90192 11 0.7923091 -4.505882 -2.194825
0.04366179
12157 0 2009 -87.85046 33.90186 11 1.0542226 -4.448612 -2.605783
0.35938751
Below is summary of an insect species (chafer).
year chf.sum chf.mean chf.max chf.min chf.n 0 1 2 3 4 5 6 7
8 13
1996 1100 1.17 13 0 940 498 161 122 68 34 22 18 4 12 1
1997 250 0.25 8 0 996 855 80 35 14 6 4 1 0 1 0
1998 57 0.06 4 0 893 854 28 6 3 2 0 0 0 0 0
1999 4 0.00 1 0 930 926 4 0 0 0 0 0 0 0 0
2000 0 0.00 0 0 981 981 0 0 0 0 0 0 0 0 0
2001 16 0.03 5 0 484 474 7 2 0 0 1 0 0 0 0
2002 20 0.02 3 0 900 886 10 2 2 0 0 0 0 0 0
2003 48 0.04 5 0 1082 1050 22 7 1 1 1 0 0 0 0
2004 92 0.08 4 0 1090 1019 55 12 3 1 0 0 0 0 0
2005 236 0.22 6 0 1083 931 104 28 10 6 2 2 0 0 0
2006 30 0.03 3 0 1091 1067 20 2 2 0 0 0 0 0 0
2007 16 0.02 2 0 837 822 14 1 0 0 0 0 0 0 0
2008 28 0.03 4 0 844 822 18 3 0 1 0 0 0 0 0
2009 93 0.11 4 0 840 764 62 12 1 1 0 0 0 0 0
, where
first column: year of data collection
second column: sum of insect collected in each year
third column: mean (sum/obs)
fourth/fifth column: maximum/minimum number of insects
sixth column: total observation number
seventh through end of columns: frequencies of each respective insect
counts.
As you can see, I have data with extreme number of zeros in each year and
by different insects. Even in 2000, there is no insect count at all. So, I
tried fitting 'zero inflated Poisson model' to the data using cozigam
function as follows:
> chf.res1 <- cozigam(chafer ~ s(longitude, latitude) + s(PC1) + s(PC2) +
s(PC3) + s(PC4) + (year) + paddock2, family=poisson, data=df.chf09)
iteration = 2 norm = 1.023552
iteration = 3 norm = 0.4258558
iteration = 4 norm = 0.003208923
iteration = 5 norm = 0.006238185
iteration = 6 norm = 0.006029638
iteration = 7 norm = 0.005037579
iteration = 8 norm = 0.004390153
iteration = 9 norm = 0.003970208
iteration = 10 norm = 0.003670730
iteration = 11 norm = 0.00343349
iteration = 12 norm = 0.003229298
iteration = 13 norm = 0.003043741
iteration = 14 norm = 0.002869686
iteration = 15 norm = 0.002703575
iteration = 16 norm = 0.002543601
iteration = 17 norm = 0.002388817
iteration = 18 norm = 0.002238705
iteration = 19 norm = 0.002092954
iteration = 20 norm = 0.001951361
Error in solve.default(t(X.c) %*% diag(w.c^2) %*% X.c + Lambda) :
system is computationally singular: reciprocal condition number 1.60481e-25
> chf.res1 <- cozigam(chafer ~ s(longitude, latitude) + s(PC1) + s(PC2) +
s(PC3) + s(PC4), family=poisson, data=df.chf09)
iteration = 2 norm = 1.019157
iteration = 3 norm = 0.3510161
iteration = 4 norm = 0.05217535
iteration = 5 norm = 0.04782636
iteration = 6 norm = 0.03668330
iteration = 7 norm = 0.02957568
iteration = 8 norm = 0.02715226
iteration = 9 norm = 0.0258863
iteration = 10 norm = 0.02498481
iteration = 11 norm = 0.02419356
iteration = 12 norm = 0.02343089
iteration = 13 norm = 0.02267359
iteration = 14 norm = 0.02191685
iteration = 15 norm = 0.02116107
iteration = 16 norm = 0.02040789
iteration = 17 norm = 0.01965911
iteration = 18 norm = 0.01891649
iteration = 19 norm = 0.01818169
iteration = 20 norm = 0.01745623
=========================================estimated alpha = 0.05016903 ( NaN )
estimated delta = 0.1491459 ( NaN )
=========================================
Warning messages:
1: In sqrt(V.theta[1, 1]) : NaNs produced
2: In sqrt(V.theta[2, 2]) : NaNs produced> summary(chf.res1)
Family: poisson
Link function: log
Formula:
chafer ~ s(longitude, latitude) + s(PC1) + s(PC2) + s(PC3) +
s(PC4)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.70922 NA NA NA
alpha 0.05017 NA NA NA
delta1 0.14915 NA NA NA
Approximate significance of smooth terms:
edf Est.rank Chi.sq p-value
s(longitude,latitude) 2.000 2 -0.158 1.000
s(PC1) 1.386 3 0.011 1.000
s(PC2) 1.000 1 -0.387 1.000
s(PC3) 1.737 4 1.153 0.886
s(PC4) 1.000 1 -0.191 1.000
Scale est. = 1 n = 840
This is beyond my scope of understanding and fixing problems.
My questions are:
1. Am I using the right method to analyze the data in this particular case?
2. If so, how can I fix the error above?
3. If not, do you have any suggestion or thoughts to handle this kind of
data sets?
Besides these questions, if you share any other comments and/or suggestions,
I would appreciate that.
Please let me know if there is lack of information you may need before
comments.
Thank you very much in advance!!
Steve Hong
[[alternative HTML version deleted]]