Matthias Gralle
2009-Sep-14 11:43 UTC
[R] Analysis of a highly pseudoreplicate mixed-effects experiment
Hello everybody, I have been trying for some weeks to state the correct design of my experiment as a GLM formula, and have not been able to find something appropriate in Pinheiro & Bates, so I am posting it here and hope somebody can help me. In each experimental condition, described by 1) gene (10 levels, fixed, because of high interest to me) 2) species (2 levels, fixed, because of high interest) 3) day (2 levels, random) 4) replicate (2 levels per day, random), I have several thousand data points consisting of two variables: 5) FITC (level of transfection of a cell) 6) APC (antibody binding to the cell) Because of intrinsic and uncontrollable cell-to-cell variation, FITC varies quite uniformly over a wide range, and APC correlates rather well with FITC. In some cases, I pasted day and replicate together as day_repl. My question is the following: Is there any gene (in my set of 10 genes) where the species makes a difference in the relation between FITC and APC ? If yes, in what gene does species have an effect ? And what is the effect of the species difference ? My attempts are the following: 1. Fit the data points of each experimental condition to a linear equation APC=Intercept+Slope*FITC and analyse the slopes : lm(Slope~species*gene*day_repl) This analysis shows clear differences between the genes, but no effect of species and no interaction gene:species. The linear fit to the cells is reasonably good, but of course does not represent the data set completely, so I wanted to incorporate the complete data set. 2a. lmer(APC~FITC*species*gene+(1|day)+(1|repl)) This gives extremely significant values for any interaction and variable because there are >200 000 df. Of course, it cannot be true, because the cells are not really independent. I have done many variations of the above, e.g. 2b. lmer(APC~FITC*species*gene+(1|day)+(1+FITC|day_repl)), but they all suffer from the excess of df. 3. lmer(APC~species*gene+(1|day/repl/FITC) gives several warning messages like this one: In repl:day : numerical expression has 275591 elements: only the first used 4. lmer(APC~gene*species+(1|day_repl) + (1+gene:species|FITC)) ran several days, but failed to converge... Can somebody give me any hint, or do you think the only possible analysis is a simplification as in my model 1 ? By the way, I am using R version 2.8.0 (2008-10-20) on Ubuntu 8.04 on a linux 2.6.24-24-generic kernel on different Intel systems. I am using the lme4 that came with R 2.8.0. Thank you very much for your time! -- Matthias Gralle, PhD Dept. Evolutionary Genetics Max Planck Institute for Evolutionary Anthropology Deutscher Platz 6 04103 Leipzig, Germany Tel +49 341 3550 519 Fax +49 341 3550 555
Matthias Gralle
2009-Sep-16 09:10 UTC
[R] Example data for analysis of a highly pseudoreplicate mixed-effects experiment
Hello everybody, it may be better to have sample data. I have provided data with less levels of "gene" and "day" and only ca. 400 data points per condition. Sample code: small=as.data.frame(read.csv("small.csv")) small$species=factor(small$species) small$gene=factor(small$gene) small$day=factor(small$day) small$repl=factor(small$repl) library(lme4) model1=lmer(APC~(FITC+species+gene)^3+(1|day)+(1|repl),REML=F,data=small) model2=lmer(APC~(FITC+species+gene)^2+(1|day)+(1|repl),REML=F,data=small) anova(model1,model2) gives me a significant difference, but in fact there are far too many residual df, and this is much worse in the original data. I suppose I cannot trust this difference. model3=lmer(APC~species*gene+(1|day/repl/FITC),data=small) Error: length(f1) == length(f2) is not TRUE In addition: Warning messages: 1: In FITC:(repl:day) : numerical expression has 886 elements: only the first used 2: In FITC:(repl:day) : numerical expression has 886 elements: only the first used Can I do nesting without incurring this kind of error ? And finally model4=lmer(APC~gene*species+(1|day) + (1|repl) + (1+(gene:species)|FITC),data=small) model5=lmer(APC~gene+species+(1|day) + (1|repl) + (1|FITC),data=small) anova(model4,model5) works with this reduced data set, but fails to converge for the original, much larger data set. I am unsure if this is the right kind of analysis for the data and there is only a problem of convergence, or if it is the wrong formula. Can anybody give me some advice on this problem ? Or should I just stick to reducing the data from each condition to a single parameter, as explained in my first email below? Thank you again! Matthias Gralle wrote:> Hello everybody, > > I have been trying for some weeks to state the correct design of my > experiment as a GLM formula, and have not been able to find something > appropriate in Pinheiro & Bates, so I am posting it here and hope > somebody can help me. > > In each experimental condition, described by > 1) gene (10 levels, fixed, because of high interest to me) > 2) species (2 levels, fixed, because of high interest) > 3) day (2 levels, random) > 4) replicate (2 levels per day, random), > > I have several thousand data points consisting of two variables: > > 5) FITC (level of transfection of a cell) > 6) APC (antibody binding to the cell) > Because of intrinsic and uncontrollable cell-to-cell variation, FITC > varies quite uniformly over a wide range, and APC correlates rather > well with FITC. In some cases, I pasted day and replicate together as > day_repl. > > My question is the following: > > Is there any gene (in my set of 10 genes) where the species makes a > difference in the relation between FITC and APC ? If yes, in what gene > does species have an effect ? And what is the effect of the species > difference ? > > My attempts are the following: > 1. Fit the data points of each experimental condition to a linear > equation APC=Intercept+Slope*FITC and analyse the slopes : > lm(Slope~species*gene*day_repl) > This analysis shows clear differences between the genes, but no effect > of species and no interaction gene:species. > > The linear fit to the cells is reasonably good, but of course does not > represent the data set completely, so I wanted to incorporate the > complete data set. > > 2a. lmer(APC~FITC*species*gene+(1|day)+(1|repl)) > This gives extremely significant values for any interaction and > variable because there are >200 000 df. Of course, it cannot be true, > because the cells are not really independent. I have done many > variations of the above, e.g. > 2b. lmer(APC~FITC*species*gene+(1|day)+(1+FITC|day_repl)), > but they all suffer from the excess of df. > > 3. lmer(APC~species*gene+(1|day/repl/FITC) gives several warning > messages like this one: > In repl:day : > numerical expression has 275591 elements: only the first used > > 4. lmer(APC~gene*species+(1|day_repl) + (1+gene:species|FITC)) ran > several days, but failed to converge... > > Can somebody give me any hint, or do you think the only possible > analysis is a simplification as in my model 1 ? > > By the way, I am using R version 2.8.0 (2008-10-20) on Ubuntu 8.04 on > a linux 2.6.24-24-generic kernel on different Intel systems. I am > using the lme4 that came with R 2.8.0. > > Thank you very much for your time! > > -- Matthias Gralle, PhD > Dept. Evolutionary Genetics > Max Planck Institute for Evolutionary Anthropology > Deutscher Platz 6 > 04103 Leipzig, Germany > Tel +49 341 3550 519 > Fax +49 341 3550 555 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >-- Matthias Gralle, PhD Dept. Evolutionary Genetics Max Planck Institute for Evolutionary Anthropology Deutscher Platz 6 04103 Leipzig, Germany Tel +49 341 3550 519 Fax +49 341 3550 555
Matthias Gralle
2009-Sep-16 15:41 UTC
[R] Example data for analysis of a highly pseudoreplicate mixed-effects experiment
There were some wrong NA values in the provided data set, this is now corrected. The data can be read in as small=read.csv("small.csv",colClasses=c("character",rep("integer",2),rep("factor",5))) The high number of residual df can be seen using the nlme package (can it be seen in the lme4 package, too ?): library(nlme) model1=lme(APC~(FITC+species+gene)^3,random=~1|day_repl,method="ML",data=small) anova(model1) numDF denDF F-value p-value (Intercept) 1 789 1365.7400 <.0001 FITC 1 789 3040.8168 <.0001 species 1 789 24.0521 <.0001 gene 1 789 10.4035 0.0013 FITC:species 1 789 5.0690 0.0246 FITC:gene 1 789 10.7925 0.0011 species:gene 1 789 72.5823 <.0001 FITC:species:gene 1 789 4.6742 0.0309 In the original data set, denDF is >200 000. Once again, how do I correctly account for pseudoreplication, avoiding the artificially high number of df ? Thank you, Matthias Matthias Gralle wrote:> Hello everybody, > > it may be better to have sample data. I have provided data with less > levels of "gene" and "day" and only ca. 400 data points per condition. > > Sample code: > small=as.data.frame(read.csv("small.csv")) > small$species=factor(small$species) > small$gene=factor(small$gene) > small$day=factor(small$day) > small$repl=factor(small$repl) > library(lme4) > model1=lmer(APC~(FITC+species+gene)^3+(1|day)+(1|repl),REML=F,data=small) > model2=lmer(APC~(FITC+species+gene)^2+(1|day)+(1|repl),REML=F,data=small) > anova(model1,model2) > > gives me a significant difference, but in fact there are far too many > residual df, and this is much worse in the original data. I suppose I > cannot trust this difference. > > model3=lmer(APC~species*gene+(1|day/repl/FITC),data=small) > Error: length(f1) == length(f2) is not TRUE > In addition: Warning messages: > 1: In FITC:(repl:day) : > numerical expression has 886 elements: only the first used > 2: In FITC:(repl:day) : > numerical expression has 886 elements: only the first used > > Can I do nesting without incurring this kind of error ? > > And finally > model4=lmer(APC~gene*species+(1|day) + (1|repl) + > (1+(gene:species)|FITC),data=small) > model5=lmer(APC~gene+species+(1|day) + (1|repl) + (1|FITC),data=small) > anova(model4,model5) > > works with this reduced data set, but fails to converge for the > original, much larger data set. I am unsure if this is the right kind > of analysis for the data and there is only a problem of convergence, > or if it is the wrong formula. > > Can anybody give me some advice on this problem ? Or should I just > stick to reducing the data from each condition to a single parameter, > as explained in my first email below? > > Thank you again! > > Matthias Gralle wrote: >> Hello everybody, >> >> I have been trying for some weeks to state the correct design of my >> experiment as a GLM formula, and have not been able to find something >> appropriate in Pinheiro & Bates, so I am posting it here and hope >> somebody can help me. >> >> In each experimental condition, described by >> 1) gene (10 levels, fixed, because of high interest to me) >> 2) species (2 levels, fixed, because of high interest) >> 3) day (2 levels, random) >> 4) replicate (2 levels per day, random), >> >> I have several thousand data points consisting of two variables: >> >> 5) FITC (level of transfection of a cell) >> 6) APC (antibody binding to the cell) >> Because of intrinsic and uncontrollable cell-to-cell variation, FITC >> varies quite uniformly over a wide range, and APC correlates rather >> well with FITC. In some cases, I pasted day and replicate together as >> day_repl. >> >> My question is the following: >> >> Is there any gene (in my set of 10 genes) where the species makes a >> difference in the relation between FITC and APC ? If yes, in what >> gene does species have an effect ? And what is the effect of the >> species difference ? >> >> My attempts are the following: >> 1. Fit the data points of each experimental condition to a linear >> equation APC=Intercept+Slope*FITC and analyse the slopes : >> lm(Slope~species*gene*day_repl) >> This analysis shows clear differences between the genes, but no >> effect of species and no interaction gene:species. >> >> The linear fit to the cells is reasonably good, but of course does >> not represent the data set completely, so I wanted to incorporate the >> complete data set. >> >> 2a. lmer(APC~FITC*species*gene+(1|day)+(1|repl)) >> This gives extremely significant values for any interaction and >> variable because there are >200 000 df. Of course, it cannot be true, >> because the cells are not really independent. I have done many >> variations of the above, e.g. >> 2b. lmer(APC~FITC*species*gene+(1|day)+(1+FITC|day_repl)), >> but they all suffer from the excess of df. >> >> 3. lmer(APC~species*gene+(1|day/repl/FITC) gives several warning >> messages like this one: >> In repl:day : >> numerical expression has 275591 elements: only the first used >> >> 4. lmer(APC~gene*species+(1|day_repl) + (1+gene:species|FITC)) ran >> several days, but failed to converge... >> >> Can somebody give me any hint, or do you think the only possible >> analysis is a simplification as in my model 1 ? >> >> By the way, I am using R version 2.8.0 (2008-10-20) on Ubuntu 8.04 on >> a linux 2.6.24-24-generic kernel on different Intel systems. I am >> using the lme4 that came with R 2.8.0. >> >> Thank you very much for your time! >> >> -- Matthias Gralle, PhD >> Dept. Evolutionary Genetics >> Max Planck Institute for Evolutionary Anthropology >> Deutscher Platz 6 >> 04103 Leipzig, Germany >> Tel +49 341 3550 519 >> Fax +49 341 3550 555 >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> >> > > > ------------------------------------------------------------------------ > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Matthias Gralle, PhD Dept. Evolutionary Genetics Max Planck Institute for Evolutionary Anthropology Deutscher Platz 6 04103 Leipzig, Germany Tel +49 341 3550 519 Fax +49 341 3550 555