Peter Davenport
2012-Feb-10 18:53 UTC
[R] a) t-tests on loess splines; b) linear models, type II SS for unbalanced ANOVA
Dear all, I have some questions regarding the validity an implementation of statistical tests based on linear models and loess. I've searched the R-help arhives and found several informative threads that related to my questions, but there are still a few issues I'm not clear about. I'd be grateful for guidance. Background and data set: I wish to compare the growth and metabolism of 2 strains of bacteria. I have grown the 2 strains in 3 replicates on 2 different occasions and on each occasion taken hourly measurements of the population density of bacteria (represented by "od" (optical density)) and the concentrations of several chemicals, such as glycerol, in their growth medium (environment). The data looks like this: > head(df.yx.t) uid cid date g t od glycerol alanine 6 6 Ma,1 25 Ma 1 0.1428571 3025750049 7843841013 7 7 Ma,1 25 Ma 2 0.4542857 3026668036 7902016686 8 8 Ma,1 25 Ma 3 1.4542857 3086406597 8017589237 9 9 Ma,1 25 Ma 4 2.5866667 2821935918 6595302338 10 10 Ma,1 25 Ma 5 3.0933333 2674142017 6144141491 11 11 Ma,1 25 Ma 6 3.6000000 3026824144 7009788499 > summary(df.yx.t) uid cid date g t od glycerol alanine 6 : 1 Ma,1 :10 25:60 Ma :60 1 :12 Min. :0.1171 Min. :7.305e+08 Min. :9.858e+07 7 : 1 Ma,2 :10 26:60 RILI2:60 2 :12 1st Qu.:1.3921 1st Qu.:1.757e+09 1st Qu.:3.199e+09 8 : 1 Ma,3 :10 3 :12 Median :3.8042 Median :2.541e+09 Median :5.664e+09 9 : 1 Ma,4 :10 4 :12 Mean :3.7822 Mean :2.354e+09 Mean :5.264e+09 10 : 1 Ma,5 :10 5 :12 3rd Qu.:5.9495 3rd Qu.:3.026e+09 3rd Qu.:7.699e+09 11 : 1 Ma,6 :10 6 :12 Max. :8.7375 Max. :3.296e+09 Max. :8.501e+09 (Other):114 (Other):60 (Other):48 uid = unique id for an observation cid = culture id (a within subjects factor identifying each bacterial culture) date = date of culture; data was collected in 2 batches, with a balanced split of observations (3 replicates of each genotype in each batch) g = genotype (the name of the bactrial strain) t = time (1-10 hours) od = optical density (a proxy for poulation density of the bacterial) glycerol = glycerol "concentration" alanine = alanine "concentration" I have tested whether Genotype affects the growth (od) of the bacteria using lm() and anova(). #define the model m1<-lm(log10(od)~date*g*t,data=df.yx.t) #perform anova > anova(m1) Analysis of Variance Table Response: log10(od) Df Sum Sq Mean Sq F value Pr(>F) date 1 0.006 0.0056 5.3763 0.02297 * g 1 0.107 0.1066 103.2951 4.646e-16 *** t 9 33.646 3.7385 3620.9069 < 2.2e-16 *** date:g 1 0.003 0.0030 2.8731 0.09396 . date:t 9 0.015 0.0017 1.6542 0.11417 g:t 9 0.021 0.0023 2.2329 0.02796 * date:g:t 9 0.005 0.0005 0.4950 0.87383 Residuals 80 0.083 0.0010 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 So far so good. (Although I do wonder whether I should be using a within subjects predictor variable (e.g. via car::Anova).) However, I also want to test whether genotype affects the concentrations of metabolites, e.g. glycerol, in the growth medium at a given population density. I envisage testing this by one of 2 methods, neither of which I am confident is correct, despite digging aound in various fora and statistics texts. Method 1. Loess splines I could fit a loess curve to glycerol~od, (the date effect disappears when plotting against od rather than t). m1<-loess(glycerol~od, data=df.yx.t[df.yx.t$g=="Ma",]) m2<-loess(glycerol~od, data=df.yx.t[df.yx.t$g=="RILI2",]) However, I'm not confident that I can perform valid statistic tests based on this. I could try to test the whether there is a significant difference in glycerol concentration at any given od as follows: #test for a difference at od=7 v.pred<-7 pred1<-predict(m1,newdata=data.frame(od=v.pred),se=T) pred2<-predict(m2,newdata=data.frame(od=v.pred),se=T) dfit<-data.frame(g=c("Ma","RILI2"),fit=c(pred1$fit,pred2$fit),se=c(pred1$fit,pred2$fit)) alpha<-0.05 dfit$lo<-with(dfit,fit-se*abs(qnorm(alpha/2))) dfit$hi<-with(dfit,fit+se*abs(qnorm(alpha/2))) Is this legitimate? I might also try to test whether the curves are significantly different using anova(m1,m2), though again I'm not sure of the validity of this approach since the 2 models are based on different (and different numbers of) observations. Method 2. Linear models m2<-lm(glycerol~g*od*date,data=df.yx.t) The use of a linear model is complicated by 2 considerations: 1) od is a continuous variable. If my limited understanding of linear models and ANOVA is correct, then m1 above tests for a linear relationship between od and the response variable (e.g. glycerol). The response of glycerol to OD is not linear over the full range of od. Does this preclude using linear models? I'm tempted to avoid this issue by redefining the model as: m3<-lm(glycerol/od~g*t*date,data=df.yx.t) Thereby enabling me to perform a factorial ANOVA. 2) There are missing values in the metabolite data, making the data unbalanced. I have read that one can use type II sum of squares in these circumstance, which I use without further adjusting for the differences in group sizes: car::Anova(m2,type=2) Thank you for getting this far. Any guidance would be much appreciated! Peter Davenport [[alternative HTML version deleted]]