I want to do an unbalanced anova on 272,992 observations with 405 factors including 2-way interactions between 1 of these factors and the other 404. After fitting only 11 factors and their interactions I get error messages like: Error: cannot allocate vector of size 1433066 Kb R(365,0xa000ed68) malloc: *** vm_allocate(size=1467461632) failed (error code=3) R(365,0xa000ed68) malloc: *** error: can't allocate region R(365,0xa000ed68) malloc: *** set a breakpoint in szone_error to debug I think that the anova involves a matrix of 272,992 rows by 29025 columns (using dummy variables)=7,900 million elements. I realise this is a lot! Could I solve this if I had more RAM or is it just too big? Another possibility is to do 16 separate analyses on 17,062 observations with 404 factors (although statistically I think the first approach is preferable). I get similar error messages then: Error: cannot allocate vector of size 175685 Kb R(365,0xa000ed68) malloc: *** vm_allocate(size=179904512) failed (error code=3) I think this analysis requires a 31 million element matrix. I am using R version 2.2.1 on a Mac G5 with 1 GB RAM running OS 10.4.4. Can somebody tell me what the limitations of my machine (or R) are likely to be? Whether this smaller analysis is feasible? and if so how much more memory I might require? The data is in R in a data frame of 272,992 rows by 406 columns. I would really appreciate any helpful input. Lucy Crooks Theoretical Biology ETH Zurich
Lucy Crooks <Lucy.Crooks at env.ethz.ch> writes:> I want to do an unbalanced anova on 272,992 observations with 405 > factors including 2-way interactions between 1 of these factors and > the other 404. After fitting only 11 factors and their interactions I > get error messages like: > > Error: cannot allocate vector of size 1433066 Kb > R(365,0xa000ed68) malloc: *** vm_allocate(size=1467461632) failed > (error code=3) > R(365,0xa000ed68) malloc: *** error: can't allocate region > R(365,0xa000ed68) malloc: *** set a breakpoint in szone_error to debug > > I think that the anova involves a matrix of 272,992 rows by 29025 > columns (using dummy variables)=7,900 million elements. I realise > this is a lot! Could I solve this if I had more RAM or is it just too > big? > > Another possibility is to do 16 separate analyses on 17,062 > observations with 404 factors (although statistically I think the > first approach is preferable). I get similar error messages then: > > Error: cannot allocate vector of size 175685 Kb > R(365,0xa000ed68) malloc: *** vm_allocate(size=179904512) failed > (error code=3) > > I think this analysis requires a 31 million element matrix. > > I am using R version 2.2.1 on a Mac G5 with 1 GB RAM running OS > 10.4.4. Can somebody tell me what the limitations of my machine (or > R) are likely to be? Whether this smaller analysis is feasible? and > if so how much more memory I might require? > > The data is in R in a data frame of 272,992 rows by 406 columns. I > would really appreciate any helpful input.You do not want to use aov() on unbalanced data, and especially not on large data sets if random effects are involved. Rather, you need to look at lmer() or just lm() if no random effects are present. However, even so, if you really have 29025 parameters to estimate, I think you're out of luck. 8 billion (US) elements is 64G and R is not able to handle objects of that size - the limit is that the size must fit in a 32 bit integer (about 2 billion elements). A quick calculation suggests that your factors have around 8 levels each. Is that really necessary, or can you perhaps collapse some levels? -- O__ ---- Peter Dalgaard ??ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
I don't know what the goal of the analysis is, but I have a suspicion that the `gbm' package might be a more fruitful way... Cheers, Andy From: Lucy Crooks> > Thanks for your reply. > > Thanks for info on aov-hadn't been able to tell which to use from > help pages. There are no random effects so will switch to lm(). > > The data are amino acid sequences, with factor being position and > level which amino acid is present. There are indeed an average of > around 8 per position (from 2 to 20). I don't think I can collapse > the levels at least to start with as I don't know in advance which > effect fitness (the y variable). > > From what you say R should be able to do the smaller analysis. So > have increased the RAM and will try this again. > > Lucy Crooks > > On Feb 1, 2006, at 3:45 PM, Peter Dalgaard wrote: > > You do not want to use aov() on unbalanced data, and > especially not on > > large data sets if random effects are involved. Rather, you need to > > look at lmer() or just lm() if no random effects are present. > > > > However, even so, if you really have 29025 parameters to estimate, I > > think you're out of luck. 8 billion (US) elements is 64G > and R is not > > able to handle objects of that size - the limit is that the > size must > > fit in a 32 bit integer (about 2 billion elements). > > > > A quick calculation suggests that your factors have around 8 levels > > each. Is that really necessary, or can you perhaps collapse some > > levels? > > > > > > > > -- > > O__ ---- Peter Dalgaard ??ster Farimagsgade 5, Entr.B > > c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K > > (*) \(*) -- University of Copenhagen Denmark > Ph: (+45) > > 35327918 > > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) > FAX: (+45) > > 35327907 > > > > Lucy Crooks <Lucy.Crooks at env.ethz.ch> writes: > >> I want to do an unbalanced anova on 272,992 observations with 405 > >> factors including 2-way interactions between 1 of these factors and > >> the other 404. After fitting only 11 factors and their > interactions I > >> get error messages like: > >> > >> Error: cannot allocate vector of size 1433066 Kb > >> R(365,0xa000ed68) malloc: *** vm_allocate(size=1467461632) failed > >> (error code=3) > >> R(365,0xa000ed68) malloc: *** error: can't allocate region > >> R(365,0xa000ed68) malloc: *** set a breakpoint in szone_error to > >> debug > >> > >> I think that the anova involves a matrix of 272,992 rows by 29025 > >> columns (using dummy variables)=7,900 million elements. I realise > >> this is a lot! Could I solve this if I had more RAM or is > it just too > >> big? > >> > >> Another possibility is to do 16 separate analyses on 17,062 > >> observations with 404 factors (although statistically I think the > >> first approach is preferable). I get similar error messages then: > >> > >> Error: cannot allocate vector of size 175685 Kb > >> R(365,0xa000ed68) malloc: *** vm_allocate(size=179904512) failed > >> (error code=3) > >> > >> I think this analysis requires a 31 million element matrix. > >> > >> I am using R version 2.2.1 on a Mac G5 with 1 GB RAM running OS > >> 10.4.4. Can somebody tell me what the limitations of my machine (or > >> R) are likely to be? Whether this smaller analysis is feasible? and > >> if so how much more memory I might require? > >> > >> The data is in R in a data frame of 272,992 rows by 406 columns. I > >> would really appreciate any helpful input. > >> > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >