Sarah, I have 669 sites and each site has 7 years of data, so if I'm thinking correctly then there should be 4683 possible combinations of site x year. For each year though I need 3 sampling periods so that there is something like the following: site 1 year1 sample 1 site 1 year1 sample 2 site 1 year1 sample 3 site 2 year1 sample 1 site 2 year1 sample 2 site 2 year1 sample 3..... site 669 year7 sample 1 site 669 year7 sample 2 site 669 year7 sample 3. I have my max memory allocation set to the amount of RAM (8GB) on my laptop, but it still 'times out' due to memory problems. On Tue, Mar 10, 2015 at 2:50 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:> You said your data only had 14000 rows, which really isn't many. > > How many possible combinations do you have, and how many do you need to > add? > > On Tue, Mar 10, 2015 at 4:35 PM, Curtis Burkhalter > <curtisburkhalter at gmail.com> wrote: > > Sarah, > > > > This strategy works great for this small dataset, but when I attempt your > > method with my data set I reach the maximum allowable memory allocation > and > > the operation just stalls and then stops completely before it is > finished. > > Do you know of a way around this? > > > > Thanks > > > > On Tue, Mar 10, 2015 at 2:04 PM, Sarah Goslee <sarah.goslee at gmail.com> > > wrote: > >> > >> Hi, > >> > >> I didn't work through your code, because it looked overly complicated. > >> Here's a more general approach that does what you appear to want: > >> > >> # use dput() to provide reproducible data please! > >> comAn <- structure(list(animals = c("bird", "bird", "bird", "bird", > >> "bird", > >> "bird", "dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", > >> "cat", "cat"), animalYears = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, > >> 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L), animalMass = c(29L, 48L, 36L, > >> 20L, 34L, 34L, 21L, 28L, 25L, 35L, 18L, 11L, 46L, 33L, 48L, 21L > >> )), .Names = c("animals", "animalYears", "animalMass"), class > >> "data.frame", row.names = c("1", > >> "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", > >> "14", "15", "16")) > >> > >> > >> # add reps to comAn > >> # assumes comAn is already sorted on animals, animalYears > >> comAn$reps <- unlist(sapply(rle(do.call("paste", > >> comAn[,1:2]))$lengths, seq_len)) > >> > >> # create full set of combinations > >> outgrid <- expand.grid(animals=unique(comAn$animals), > >> animalYears=unique(comAn$animalYears), reps=unique(comAn$reps), > >> stringsAsFactors=FALSE) > >> > >> # combine with comAn > >> comAn.full <- merge(outgrid, comAn, all.x=TRUE) > >> > >> > comAn.full > >> animals animalYears reps animalMass > >> 1 bird 1 1 29 > >> 2 bird 1 2 48 > >> 3 bird 1 3 36 > >> 4 bird 2 1 20 > >> 5 bird 2 2 34 > >> 6 bird 2 3 34 > >> 7 cat 1 1 46 > >> 8 cat 1 2 33 > >> 9 cat 1 3 48 > >> 10 cat 2 1 21 > >> 11 cat 2 2 NA > >> 12 cat 2 3 NA > >> 13 dog 1 1 21 > >> 14 dog 1 2 28 > >> 15 dog 1 3 25 > >> 16 dog 2 1 35 > >> 17 dog 2 2 18 > >> 18 dog 2 3 11 > >> > > >> > >> On Tue, Mar 10, 2015 at 3:43 PM, Curtis Burkhalter > >> <curtisburkhalter at gmail.com> wrote: > >> > Hey everyone, > >> > > >> > I've written a function that adds NAs to a dataframe where data is > >> > missing > >> > and it seems to work great if I only need to run it once, but if I run > >> > it > >> > two times in a row I run into problems. I've created a workable > example > >> > to > >> > explain what I mean and why I would do this. > >> > > >> > In my dataframe there are areas where I need to add two rows of NAs > (b/c > >> > I > >> > need to have 3 animal x year combos and for cat in year 2 I only have > >> > one) > >> > so I thought that I'd just run my code twice using the function in the > >> > code > >> > below. Everything works great when I run it the first time, but when I > >> > run > >> > it again it says that the value returned to the list 'x' is of length > 0. > >> > I > >> > don't understand why the function works the first time around and adds > >> > an > >> > NA to the 'animalMass' column, but won't do it again. I've used > >> > (print(str(dataframe)) to see if there is a change in class or type > when > >> > the function runs through the original dataframe and there is for > >> > 'animalYears', but I just convert it back before rerunning the > function > >> > for > >> > second time. > >> > > >> > Any thoughts on this would be greatly appreciated b/c my actual data > >> > dataframe I have to input into WinBUGS is 14000x12, so it's not a > >> > trivial > >> > thing to just add in an NA here or there. > >> > > >> >>comAn > >> > animals animalYears animalMass > >> > 1 bird 1 29 > >> > 2 bird 1 48 > >> > 3 bird 1 36 > >> > 4 bird 2 20 > >> > 5 bird 2 34 > >> > 6 bird 2 34 > >> > 7 dog 1 21 > >> > 8 dog 1 28 > >> > 9 dog 1 25 > >> > 10 dog 2 35 > >> > 11 dog 2 18 > >> > 12 dog 2 11 > >> > 13 cat 1 46 > >> > 14 cat 1 33 > >> > 15 cat 1 48 > >> > 16 cat 2 21 > >> > > >> > So every animal has 3 measurements per year, except for the cat in > year > >> > two > >> > which has only 1. I run the code below and get: > >> > > >> > #combs defines the different combinations of > >> > #animals and animalYears > >> > combs<-paste(comAn$animals,comAn$animalYears,sep=':') > >> > #counts defines how long the different combinations are > >> > counts<-ave(1:nrow(comAn),combs,FUN=length) > >> > #missing defines the combs that have length less than one and puts it > in > >> > #the data frame missing > >> > missing<-data.frame(vals=combs[counts<2],count=counts[counts<2]) > >> > > >> > genRows<-function(dat){ > >> > vals<-strsplit(dat[1],':')[[1]] > >> > #not sure why dat[2] is being converted to a string > >> > newRows<-2-as.numeric(dat[2]) > >> > newDf<-data.frame(animals=rep(vals[1],newRows), > >> > animalYears=rep(vals[2],newRows), > >> > animalMass=rep(NA,newRows)) > >> > return(newDf) > >> > } > >> > > >> > > >> > x<-apply(missing,1,genRows) > >> > comAn=rbind(comAn, > >> > do.call(rbind,x)) > >> > > >> >> comAn > >> > animals animalYears animalMass > >> > 1 bird 1 29 > >> > 2 bird 1 48 > >> > 3 bird 1 36 > >> > 4 bird 2 20 > >> > 5 bird 2 34 > >> > 6 bird 2 34 > >> > 7 dog 1 21 > >> > 8 dog 1 28 > >> > 9 dog 1 25 > >> > 10 dog 2 35 > >> > 11 dog 2 18 > >> > 12 dog 2 11 > >> > 13 cat 1 46 > >> > 14 cat 1 33 > >> > 15 cat 1 48 > >> > 16 cat 2 21 > >> > 17 cat 2 <NA> > >> > > >> > So far so good, but then I adjust the code so that it reads (**notice > >> > the > >> > change in the specification in 'missing' to counts<3**): > >> > > >> > #combs defines the different combinations of > >> > #animals and animalYears > >> > combs<-paste(comAn$animals,comAn$animalYears,sep=':') > >> > #counts defines how long the different combinations are > >> > counts<-ave(1:nrow(comAn),combs,FUN=length) > >> > #missing defines the combs that have length less than one and puts it > in > >> > #the data frame missing > >> > missing<-data.frame(vals=combs[counts<3],count=counts[counts<3]) > >> > > >> > genRows<-function(dat){ > >> > vals<-strsplit(dat[1],':')[[1]] > >> > #not sure why dat[2] is being converted to a string > >> > newRows<-2-as.numeric(dat[2]) > >> > newDf<-data.frame(animals=rep(vals[1],newRows), > >> > animalYears=rep(vals[2],newRows), > >> > animalMass=rep(NA,newRows)) > >> > return(newDf) > >> > } > >> > > >> > > >> > x<-apply(missing,1,genRows) > >> > comAn=rbind(comAn, > >> > do.call(rbind,x)) > >> > > >> > The result for 'x' then reads: > >> > > >> >> x > >> > [[1]] > >> > [1] animals animalYears animalMass > >> > <0 rows> (or 0-length row.names) > >> > > >> > Any thoughts on why it might be doing this instead of adding an > >> > additional > >> > row to get the result: > >> > > >> >> comAn > >> > animals animalYears animalMass > >> > 1 bird 1 29 > >> > 2 bird 1 48 > >> > 3 bird 1 36 > >> > 4 bird 2 20 > >> > 5 bird 2 34 > >> > 6 bird 2 34 > >> > 7 dog 1 21 > >> > 8 dog 1 28 > >> > 9 dog 1 25 > >> > 10 dog 2 35 > >> > 11 dog 2 18 > >> > 12 dog 2 11 > >> > 13 cat 1 46 > >> > 14 cat 1 33 > >> > 15 cat 1 48 > >> > 16 cat 2 21 > >> > 17 cat 2 <NA> > >> > 18 cat 2 <NA> > >> > > >> > Thanks > >> > -- > >> > Curtis Burkhalter > > > > >-- Curtis Burkhalter https://sites.google.com/site/curtisburkhalter/ [[alternative HTML version deleted]]
Yeah, that's tiny:> fullout <- expand.grid(site=1:669, year=1:7, sample=1:3) > dim(fullout)[1] 14049 3 Almost certainly the problem is that your expand.grid result doesn't have the same column names as your actual data file, so merge() is trying to make an enormous result. Note how when I made outgrid in the example I named the columns. Make sure that the names are identical! On Tue, Mar 10, 2015 at 4:57 PM, Curtis Burkhalter <curtisburkhalter at gmail.com> wrote:> Sarah, > > I have 669 sites and each site has 7 years of data, so if I'm thinking > correctly then there should be 4683 possible combinations of site x year. > For each year though I need 3 sampling periods so that there is something > like the following: > > site 1 year1 sample 1 > site 1 year1 sample 2 > site 1 year1 sample 3 > site 2 year1 sample 1 > site 2 year1 sample 2 > site 2 year1 sample 3..... > site 669 year7 sample 1 > site 669 year7 sample 2 > site 669 year7 sample 3. > > I have my max memory allocation set to the amount of RAM (8GB) on my laptop, > but it still 'times out' due to memory problems. > > On Tue, Mar 10, 2015 at 2:50 PM, Sarah Goslee <sarah.goslee at gmail.com> > wrote: >> >> You said your data only had 14000 rows, which really isn't many. >> >> How many possible combinations do you have, and how many do you need to >> add? >> >> On Tue, Mar 10, 2015 at 4:35 PM, Curtis Burkhalter >> <curtisburkhalter at gmail.com> wrote: >> > Sarah, >> > >> > This strategy works great for this small dataset, but when I attempt >> > your >> > method with my data set I reach the maximum allowable memory allocation >> > and >> > the operation just stalls and then stops completely before it is >> > finished. >> > Do you know of a way around this? >> > >> > Thanks >> > >> > On Tue, Mar 10, 2015 at 2:04 PM, Sarah Goslee <sarah.goslee at gmail.com> >> > wrote: >> >> >> >> Hi, >> >> >> >> I didn't work through your code, because it looked overly complicated. >> >> Here's a more general approach that does what you appear to want: >> >> >> >> # use dput() to provide reproducible data please! >> >> comAn <- structure(list(animals = c("bird", "bird", "bird", "bird", >> >> "bird", >> >> "bird", "dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", >> >> "cat", "cat"), animalYears = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, >> >> 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L), animalMass = c(29L, 48L, 36L, >> >> 20L, 34L, 34L, 21L, 28L, 25L, 35L, 18L, 11L, 46L, 33L, 48L, 21L >> >> )), .Names = c("animals", "animalYears", "animalMass"), class >> >> "data.frame", row.names = c("1", >> >> "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", >> >> "14", "15", "16")) >> >> >> >> >> >> # add reps to comAn >> >> # assumes comAn is already sorted on animals, animalYears >> >> comAn$reps <- unlist(sapply(rle(do.call("paste", >> >> comAn[,1:2]))$lengths, seq_len)) >> >> >> >> # create full set of combinations >> >> outgrid <- expand.grid(animals=unique(comAn$animals), >> >> animalYears=unique(comAn$animalYears), reps=unique(comAn$reps), >> >> stringsAsFactors=FALSE) >> >> >> >> # combine with comAn >> >> comAn.full <- merge(outgrid, comAn, all.x=TRUE) >> >> >> >> > comAn.full >> >> animals animalYears reps animalMass >> >> 1 bird 1 1 29 >> >> 2 bird 1 2 48 >> >> 3 bird 1 3 36 >> >> 4 bird 2 1 20 >> >> 5 bird 2 2 34 >> >> 6 bird 2 3 34 >> >> 7 cat 1 1 46 >> >> 8 cat 1 2 33 >> >> 9 cat 1 3 48 >> >> 10 cat 2 1 21 >> >> 11 cat 2 2 NA >> >> 12 cat 2 3 NA >> >> 13 dog 1 1 21 >> >> 14 dog 1 2 28 >> >> 15 dog 1 3 25 >> >> 16 dog 2 1 35 >> >> 17 dog 2 2 18 >> >> 18 dog 2 3 11 >> >> > >> >> >> >> On Tue, Mar 10, 2015 at 3:43 PM, Curtis Burkhalter >> >> <curtisburkhalter at gmail.com> wrote: >> >> > Hey everyone, >> >> > >> >> > I've written a function that adds NAs to a dataframe where data is >> >> > missing >> >> > and it seems to work great if I only need to run it once, but if I >> >> > run >> >> > it >> >> > two times in a row I run into problems. I've created a workable >> >> > example >> >> > to >> >> > explain what I mean and why I would do this. >> >> > >> >> > In my dataframe there are areas where I need to add two rows of NAs >> >> > (b/c >> >> > I >> >> > need to have 3 animal x year combos and for cat in year 2 I only have >> >> > one) >> >> > so I thought that I'd just run my code twice using the function in >> >> > the >> >> > code >> >> > below. Everything works great when I run it the first time, but when >> >> > I >> >> > run >> >> > it again it says that the value returned to the list 'x' is of length >> >> > 0. >> >> > I >> >> > don't understand why the function works the first time around and >> >> > adds >> >> > an >> >> > NA to the 'animalMass' column, but won't do it again. I've used >> >> > (print(str(dataframe)) to see if there is a change in class or type >> >> > when >> >> > the function runs through the original dataframe and there is for >> >> > 'animalYears', but I just convert it back before rerunning the >> >> > function >> >> > for >> >> > second time. >> >> > >> >> > Any thoughts on this would be greatly appreciated b/c my actual data >> >> > dataframe I have to input into WinBUGS is 14000x12, so it's not a >> >> > trivial >> >> > thing to just add in an NA here or there. >> >> > >> >> >>comAn >> >> > animals animalYears animalMass >> >> > 1 bird 1 29 >> >> > 2 bird 1 48 >> >> > 3 bird 1 36 >> >> > 4 bird 2 20 >> >> > 5 bird 2 34 >> >> > 6 bird 2 34 >> >> > 7 dog 1 21 >> >> > 8 dog 1 28 >> >> > 9 dog 1 25 >> >> > 10 dog 2 35 >> >> > 11 dog 2 18 >> >> > 12 dog 2 11 >> >> > 13 cat 1 46 >> >> > 14 cat 1 33 >> >> > 15 cat 1 48 >> >> > 16 cat 2 21 >> >> > >> >> > So every animal has 3 measurements per year, except for the cat in >> >> > year >> >> > two >> >> > which has only 1. I run the code below and get: >> >> > >> >> > #combs defines the different combinations of >> >> > #animals and animalYears >> >> > combs<-paste(comAn$animals,comAn$animalYears,sep=':') >> >> > #counts defines how long the different combinations are >> >> > counts<-ave(1:nrow(comAn),combs,FUN=length) >> >> > #missing defines the combs that have length less than one and puts it >> >> > in >> >> > #the data frame missing >> >> > missing<-data.frame(vals=combs[counts<2],count=counts[counts<2]) >> >> > >> >> > genRows<-function(dat){ >> >> > vals<-strsplit(dat[1],':')[[1]] >> >> > #not sure why dat[2] is being converted to a string >> >> > newRows<-2-as.numeric(dat[2]) >> >> > newDf<-data.frame(animals=rep(vals[1],newRows), >> >> > animalYears=rep(vals[2],newRows), >> >> > animalMass=rep(NA,newRows)) >> >> > return(newDf) >> >> > } >> >> > >> >> > >> >> > x<-apply(missing,1,genRows) >> >> > comAn=rbind(comAn, >> >> > do.call(rbind,x)) >> >> > >> >> >> comAn >> >> > animals animalYears animalMass >> >> > 1 bird 1 29 >> >> > 2 bird 1 48 >> >> > 3 bird 1 36 >> >> > 4 bird 2 20 >> >> > 5 bird 2 34 >> >> > 6 bird 2 34 >> >> > 7 dog 1 21 >> >> > 8 dog 1 28 >> >> > 9 dog 1 25 >> >> > 10 dog 2 35 >> >> > 11 dog 2 18 >> >> > 12 dog 2 11 >> >> > 13 cat 1 46 >> >> > 14 cat 1 33 >> >> > 15 cat 1 48 >> >> > 16 cat 2 21 >> >> > 17 cat 2 <NA> >> >> > >> >> > So far so good, but then I adjust the code so that it reads (**notice >> >> > the >> >> > change in the specification in 'missing' to counts<3**): >> >> > >> >> > #combs defines the different combinations of >> >> > #animals and animalYears >> >> > combs<-paste(comAn$animals,comAn$animalYears,sep=':') >> >> > #counts defines how long the different combinations are >> >> > counts<-ave(1:nrow(comAn),combs,FUN=length) >> >> > #missing defines the combs that have length less than one and puts it >> >> > in >> >> > #the data frame missing >> >> > missing<-data.frame(vals=combs[counts<3],count=counts[counts<3]) >> >> > >> >> > genRows<-function(dat){ >> >> > vals<-strsplit(dat[1],':')[[1]] >> >> > #not sure why dat[2] is being converted to a string >> >> > newRows<-2-as.numeric(dat[2]) >> >> > newDf<-data.frame(animals=rep(vals[1],newRows), >> >> > animalYears=rep(vals[2],newRows), >> >> > animalMass=rep(NA,newRows)) >> >> > return(newDf) >> >> > } >> >> > >> >> > >> >> > x<-apply(missing,1,genRows) >> >> > comAn=rbind(comAn, >> >> > do.call(rbind,x)) >> >> > >> >> > The result for 'x' then reads: >> >> > >> >> >> x >> >> > [[1]] >> >> > [1] animals animalYears animalMass >> >> > <0 rows> (or 0-length row.names) >> >> > >> >> > Any thoughts on why it might be doing this instead of adding an >> >> > additional >> >> > row to get the result: >> >> > >> >> >> comAn >> >> > animals animalYears animalMass >> >> > 1 bird 1 29 >> >> > 2 bird 1 48 >> >> > 3 bird 1 36 >> >> > 4 bird 2 20 >> >> > 5 bird 2 34 >> >> > 6 bird 2 34 >> >> > 7 dog 1 21 >> >> > 8 dog 1 28 >> >> > 9 dog 1 25 >> >> > 10 dog 2 35 >> >> > 11 dog 2 18 >> >> > 12 dog 2 11 >> >> > 13 cat 1 46 >> >> > 14 cat 1 33 >> >> > 15 cat 1 48 >> >> > 16 cat 2 21 >> >> > 17 cat 2 <NA> >> >> > 18 cat 2 <NA> >> >> > >> >> > Thanks >> >> > -- >> >> > Curtis Burkhalter >> > >> >
Thanks Sarah, one of my column names was missing a letter so it was throwing things off. It works super fast now and is exactly what I needed. My actual data set has about 6 other ancillary response data data columns, is there a way to combine the 'full' data set I just created with the original in case I need any of the other response variables. E.g. FULL: Original: Combined: site year sample site year sample color shape site year sample color shape 1 1 10 1 1 10 blue diamond 1 1 10 blue diamond 1 1 12 1 1 12 green pyramid 1 1 12 green pyramid 1 1 NA 1 1 NA NA NA Thanks On Tue, Mar 10, 2015 at 3:12 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:> Yeah, that's tiny: > > > fullout <- expand.grid(site=1:669, year=1:7, sample=1:3) > > dim(fullout) > [1] 14049 3 > > > Almost certainly the problem is that your expand.grid result doesn't > have the same column names as your actual data file, so merge() is > trying to make an enormous result. Note how when I made outgrid in the > example I named the columns. > > Make sure that the names are identical! > > > On Tue, Mar 10, 2015 at 4:57 PM, Curtis Burkhalter > <curtisburkhalter at gmail.com> wrote: > > Sarah, > > > > I have 669 sites and each site has 7 years of data, so if I'm thinking > > correctly then there should be 4683 possible combinations of site x year. > > For each year though I need 3 sampling periods so that there is something > > like the following: > > > > site 1 year1 sample 1 > > site 1 year1 sample 2 > > site 1 year1 sample 3 > > site 2 year1 sample 1 > > site 2 year1 sample 2 > > site 2 year1 sample 3..... > > site 669 year7 sample 1 > > site 669 year7 sample 2 > > site 669 year7 sample 3. > > > > I have my max memory allocation set to the amount of RAM (8GB) on my > laptop, > > but it still 'times out' due to memory problems. > > > > On Tue, Mar 10, 2015 at 2:50 PM, Sarah Goslee <sarah.goslee at gmail.com> > > wrote: > >> > >> You said your data only had 14000 rows, which really isn't many. > >> > >> How many possible combinations do you have, and how many do you need to > >> add? > >> > >> On Tue, Mar 10, 2015 at 4:35 PM, Curtis Burkhalter > >> <curtisburkhalter at gmail.com> wrote: > >> > Sarah, > >> > > >> > This strategy works great for this small dataset, but when I attempt > >> > your > >> > method with my data set I reach the maximum allowable memory > allocation > >> > and > >> > the operation just stalls and then stops completely before it is > >> > finished. > >> > Do you know of a way around this? > >> > > >> > Thanks > >> > > >> > On Tue, Mar 10, 2015 at 2:04 PM, Sarah Goslee <sarah.goslee at gmail.com > > > >> > wrote: > >> >> > >> >> Hi, > >> >> > >> >> I didn't work through your code, because it looked overly > complicated. > >> >> Here's a more general approach that does what you appear to want: > >> >> > >> >> # use dput() to provide reproducible data please! > >> >> comAn <- structure(list(animals = c("bird", "bird", "bird", "bird", > >> >> "bird", > >> >> "bird", "dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", > >> >> "cat", "cat"), animalYears = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, > >> >> 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L), animalMass = c(29L, 48L, 36L, > >> >> 20L, 34L, 34L, 21L, 28L, 25L, 35L, 18L, 11L, 46L, 33L, 48L, 21L > >> >> )), .Names = c("animals", "animalYears", "animalMass"), class > >> >> "data.frame", row.names = c("1", > >> >> "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", > >> >> "14", "15", "16")) > >> >> > >> >> > >> >> # add reps to comAn > >> >> # assumes comAn is already sorted on animals, animalYears > >> >> comAn$reps <- unlist(sapply(rle(do.call("paste", > >> >> comAn[,1:2]))$lengths, seq_len)) > >> >> > >> >> # create full set of combinations > >> >> outgrid <- expand.grid(animals=unique(comAn$animals), > >> >> animalYears=unique(comAn$animalYears), reps=unique(comAn$reps), > >> >> stringsAsFactors=FALSE) > >> >> > >> >> # combine with comAn > >> >> comAn.full <- merge(outgrid, comAn, all.x=TRUE) > >> >> > >> >> > comAn.full > >> >> animals animalYears reps animalMass > >> >> 1 bird 1 1 29 > >> >> 2 bird 1 2 48 > >> >> 3 bird 1 3 36 > >> >> 4 bird 2 1 20 > >> >> 5 bird 2 2 34 > >> >> 6 bird 2 3 34 > >> >> 7 cat 1 1 46 > >> >> 8 cat 1 2 33 > >> >> 9 cat 1 3 48 > >> >> 10 cat 2 1 21 > >> >> 11 cat 2 2 NA > >> >> 12 cat 2 3 NA > >> >> 13 dog 1 1 21 > >> >> 14 dog 1 2 28 > >> >> 15 dog 1 3 25 > >> >> 16 dog 2 1 35 > >> >> 17 dog 2 2 18 > >> >> 18 dog 2 3 11 > >> >> > > >> >> > >> >> On Tue, Mar 10, 2015 at 3:43 PM, Curtis Burkhalter > >> >> <curtisburkhalter at gmail.com> wrote: > >> >> > Hey everyone, > >> >> > > >> >> > I've written a function that adds NAs to a dataframe where data is > >> >> > missing > >> >> > and it seems to work great if I only need to run it once, but if I > >> >> > run > >> >> > it > >> >> > two times in a row I run into problems. I've created a workable > >> >> > example > >> >> > to > >> >> > explain what I mean and why I would do this. > >> >> > > >> >> > In my dataframe there are areas where I need to add two rows of NAs > >> >> > (b/c > >> >> > I > >> >> > need to have 3 animal x year combos and for cat in year 2 I only > have > >> >> > one) > >> >> > so I thought that I'd just run my code twice using the function in > >> >> > the > >> >> > code > >> >> > below. Everything works great when I run it the first time, but > when > >> >> > I > >> >> > run > >> >> > it again it says that the value returned to the list 'x' is of > length > >> >> > 0. > >> >> > I > >> >> > don't understand why the function works the first time around and > >> >> > adds > >> >> > an > >> >> > NA to the 'animalMass' column, but won't do it again. I've used > >> >> > (print(str(dataframe)) to see if there is a change in class or type > >> >> > when > >> >> > the function runs through the original dataframe and there is for > >> >> > 'animalYears', but I just convert it back before rerunning the > >> >> > function > >> >> > for > >> >> > second time. > >> >> > > >> >> > Any thoughts on this would be greatly appreciated b/c my actual > data > >> >> > dataframe I have to input into WinBUGS is 14000x12, so it's not a > >> >> > trivial > >> >> > thing to just add in an NA here or there. > >> >> > > >> >> >>comAn > >> >> > animals animalYears animalMass > >> >> > 1 bird 1 29 > >> >> > 2 bird 1 48 > >> >> > 3 bird 1 36 > >> >> > 4 bird 2 20 > >> >> > 5 bird 2 34 > >> >> > 6 bird 2 34 > >> >> > 7 dog 1 21 > >> >> > 8 dog 1 28 > >> >> > 9 dog 1 25 > >> >> > 10 dog 2 35 > >> >> > 11 dog 2 18 > >> >> > 12 dog 2 11 > >> >> > 13 cat 1 46 > >> >> > 14 cat 1 33 > >> >> > 15 cat 1 48 > >> >> > 16 cat 2 21 > >> >> > > >> >> > So every animal has 3 measurements per year, except for the cat in > >> >> > year > >> >> > two > >> >> > which has only 1. I run the code below and get: > >> >> > > >> >> > #combs defines the different combinations of > >> >> > #animals and animalYears > >> >> > combs<-paste(comAn$animals,comAn$animalYears,sep=':') > >> >> > #counts defines how long the different combinations are > >> >> > counts<-ave(1:nrow(comAn),combs,FUN=length) > >> >> > #missing defines the combs that have length less than one and puts > it > >> >> > in > >> >> > #the data frame missing > >> >> > missing<-data.frame(vals=combs[counts<2],count=counts[counts<2]) > >> >> > > >> >> > genRows<-function(dat){ > >> >> > vals<-strsplit(dat[1],':')[[1]] > >> >> > #not sure why dat[2] is being converted to a string > >> >> > newRows<-2-as.numeric(dat[2]) > >> >> > newDf<-data.frame(animals=rep(vals[1],newRows), > >> >> > animalYears=rep(vals[2],newRows), > >> >> > animalMass=rep(NA,newRows)) > >> >> > return(newDf) > >> >> > } > >> >> > > >> >> > > >> >> > x<-apply(missing,1,genRows) > >> >> > comAn=rbind(comAn, > >> >> > do.call(rbind,x)) > >> >> > > >> >> >> comAn > >> >> > animals animalYears animalMass > >> >> > 1 bird 1 29 > >> >> > 2 bird 1 48 > >> >> > 3 bird 1 36 > >> >> > 4 bird 2 20 > >> >> > 5 bird 2 34 > >> >> > 6 bird 2 34 > >> >> > 7 dog 1 21 > >> >> > 8 dog 1 28 > >> >> > 9 dog 1 25 > >> >> > 10 dog 2 35 > >> >> > 11 dog 2 18 > >> >> > 12 dog 2 11 > >> >> > 13 cat 1 46 > >> >> > 14 cat 1 33 > >> >> > 15 cat 1 48 > >> >> > 16 cat 2 21 > >> >> > 17 cat 2 <NA> > >> >> > > >> >> > So far so good, but then I adjust the code so that it reads > (**notice > >> >> > the > >> >> > change in the specification in 'missing' to counts<3**): > >> >> > > >> >> > #combs defines the different combinations of > >> >> > #animals and animalYears > >> >> > combs<-paste(comAn$animals,comAn$animalYears,sep=':') > >> >> > #counts defines how long the different combinations are > >> >> > counts<-ave(1:nrow(comAn),combs,FUN=length) > >> >> > #missing defines the combs that have length less than one and puts > it > >> >> > in > >> >> > #the data frame missing > >> >> > missing<-data.frame(vals=combs[counts<3],count=counts[counts<3]) > >> >> > > >> >> > genRows<-function(dat){ > >> >> > vals<-strsplit(dat[1],':')[[1]] > >> >> > #not sure why dat[2] is being converted to a string > >> >> > newRows<-2-as.numeric(dat[2]) > >> >> > newDf<-data.frame(animals=rep(vals[1],newRows), > >> >> > animalYears=rep(vals[2],newRows), > >> >> > animalMass=rep(NA,newRows)) > >> >> > return(newDf) > >> >> > } > >> >> > > >> >> > > >> >> > x<-apply(missing,1,genRows) > >> >> > comAn=rbind(comAn, > >> >> > do.call(rbind,x)) > >> >> > > >> >> > The result for 'x' then reads: > >> >> > > >> >> >> x > >> >> > [[1]] > >> >> > [1] animals animalYears animalMass > >> >> > <0 rows> (or 0-length row.names) > >> >> > > >> >> > Any thoughts on why it might be doing this instead of adding an > >> >> > additional > >> >> > row to get the result: > >> >> > > >> >> >> comAn > >> >> > animals animalYears animalMass > >> >> > 1 bird 1 29 > >> >> > 2 bird 1 48 > >> >> > 3 bird 1 36 > >> >> > 4 bird 2 20 > >> >> > 5 bird 2 34 > >> >> > 6 bird 2 34 > >> >> > 7 dog 1 21 > >> >> > 8 dog 1 28 > >> >> > 9 dog 1 25 > >> >> > 10 dog 2 35 > >> >> > 11 dog 2 18 > >> >> > 12 dog 2 11 > >> >> > 13 cat 1 46 > >> >> > 14 cat 1 33 > >> >> > 15 cat 1 48 > >> >> > 16 cat 2 21 > >> >> > 17 cat 2 <NA> > >> >> > 18 cat 2 <NA> > >> >> > > >> >> > Thanks > >> >> > -- > >> >> > Curtis Burkhalter > >> > > >> > >-- Curtis Burkhalter https://sites.google.com/site/curtisburkhalter/ [[alternative HTML version deleted]]
You may find it beneficial to investigate packages dplyr, data.table, or a combination of the two for handling large data sets in memory. Or, perhaps dplyr with a SQL back end for working on disk (I have not tried that myself yet). I do find your excuse for manufacturing data records uncompelling, though. Of the information necessary to draw valid conclusions is absent, the results you obtain by doing so is going to be questionable at best. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. On March 10, 2015 1:57:14 PM PDT, Curtis Burkhalter <curtisburkhalter at gmail.com> wrote:>Sarah, > >I have 669 sites and each site has 7 years of data, so if I'm thinking >correctly then there should be 4683 possible combinations of site x >year. >For each year though I need 3 sampling periods so that there is >something >like the following: > >site 1 year1 sample 1 >site 1 year1 sample 2 >site 1 year1 sample 3 >site 2 year1 sample 1 >site 2 year1 sample 2 >site 2 year1 sample 3..... >site 669 year7 sample 1 >site 669 year7 sample 2 >site 669 year7 sample 3. > >I have my max memory allocation set to the amount of RAM (8GB) on my >laptop, but it still 'times out' due to memory problems. > >On Tue, Mar 10, 2015 at 2:50 PM, Sarah Goslee <sarah.goslee at gmail.com> >wrote: > >> You said your data only had 14000 rows, which really isn't many. >> >> How many possible combinations do you have, and how many do you need >to >> add? >> >> On Tue, Mar 10, 2015 at 4:35 PM, Curtis Burkhalter >> <curtisburkhalter at gmail.com> wrote: >> > Sarah, >> > >> > This strategy works great for this small dataset, but when I >attempt your >> > method with my data set I reach the maximum allowable memory >allocation >> and >> > the operation just stalls and then stops completely before it is >> finished. >> > Do you know of a way around this? >> > >> > Thanks >> > >> > On Tue, Mar 10, 2015 at 2:04 PM, Sarah Goslee ><sarah.goslee at gmail.com> >> > wrote: >> >> >> >> Hi, >> >> >> >> I didn't work through your code, because it looked overly >complicated. >> >> Here's a more general approach that does what you appear to want: >> >> >> >> # use dput() to provide reproducible data please! >> >> comAn <- structure(list(animals = c("bird", "bird", "bird", >"bird", >> >> "bird", >> >> "bird", "dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", >> >> "cat", "cat"), animalYears = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, >> >> 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L), animalMass = c(29L, 48L, 36L, >> >> 20L, 34L, 34L, 21L, 28L, 25L, 35L, 18L, 11L, 46L, 33L, 48L, 21L >> >> )), .Names = c("animals", "animalYears", "animalMass"), class >> >> "data.frame", row.names = c("1", >> >> "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", >> >> "14", "15", "16")) >> >> >> >> >> >> # add reps to comAn >> >> # assumes comAn is already sorted on animals, animalYears >> >> comAn$reps <- unlist(sapply(rle(do.call("paste", >> >> comAn[,1:2]))$lengths, seq_len)) >> >> >> >> # create full set of combinations >> >> outgrid <- expand.grid(animals=unique(comAn$animals), >> >> animalYears=unique(comAn$animalYears), reps=unique(comAn$reps), >> >> stringsAsFactors=FALSE) >> >> >> >> # combine with comAn >> >> comAn.full <- merge(outgrid, comAn, all.x=TRUE) >> >> >> >> > comAn.full >> >> animals animalYears reps animalMass >> >> 1 bird 1 1 29 >> >> 2 bird 1 2 48 >> >> 3 bird 1 3 36 >> >> 4 bird 2 1 20 >> >> 5 bird 2 2 34 >> >> 6 bird 2 3 34 >> >> 7 cat 1 1 46 >> >> 8 cat 1 2 33 >> >> 9 cat 1 3 48 >> >> 10 cat 2 1 21 >> >> 11 cat 2 2 NA >> >> 12 cat 2 3 NA >> >> 13 dog 1 1 21 >> >> 14 dog 1 2 28 >> >> 15 dog 1 3 25 >> >> 16 dog 2 1 35 >> >> 17 dog 2 2 18 >> >> 18 dog 2 3 11 >> >> > >> >> >> >> On Tue, Mar 10, 2015 at 3:43 PM, Curtis Burkhalter >> >> <curtisburkhalter at gmail.com> wrote: >> >> > Hey everyone, >> >> > >> >> > I've written a function that adds NAs to a dataframe where data >is >> >> > missing >> >> > and it seems to work great if I only need to run it once, but if >I run >> >> > it >> >> > two times in a row I run into problems. I've created a workable >> example >> >> > to >> >> > explain what I mean and why I would do this. >> >> > >> >> > In my dataframe there are areas where I need to add two rows of >NAs >> (b/c >> >> > I >> >> > need to have 3 animal x year combos and for cat in year 2 I only >have >> >> > one) >> >> > so I thought that I'd just run my code twice using the function >in the >> >> > code >> >> > below. Everything works great when I run it the first time, but >when I >> >> > run >> >> > it again it says that the value returned to the list 'x' is of >length >> 0. >> >> > I >> >> > don't understand why the function works the first time around >and adds >> >> > an >> >> > NA to the 'animalMass' column, but won't do it again. I've used >> >> > (print(str(dataframe)) to see if there is a change in class or >type >> when >> >> > the function runs through the original dataframe and there is >for >> >> > 'animalYears', but I just convert it back before rerunning the >> function >> >> > for >> >> > second time. >> >> > >> >> > Any thoughts on this would be greatly appreciated b/c my actual >data >> >> > dataframe I have to input into WinBUGS is 14000x12, so it's not >a >> >> > trivial >> >> > thing to just add in an NA here or there. >> >> > >> >> >>comAn >> >> > animals animalYears animalMass >> >> > 1 bird 1 29 >> >> > 2 bird 1 48 >> >> > 3 bird 1 36 >> >> > 4 bird 2 20 >> >> > 5 bird 2 34 >> >> > 6 bird 2 34 >> >> > 7 dog 1 21 >> >> > 8 dog 1 28 >> >> > 9 dog 1 25 >> >> > 10 dog 2 35 >> >> > 11 dog 2 18 >> >> > 12 dog 2 11 >> >> > 13 cat 1 46 >> >> > 14 cat 1 33 >> >> > 15 cat 1 48 >> >> > 16 cat 2 21 >> >> > >> >> > So every animal has 3 measurements per year, except for the cat >in >> year >> >> > two >> >> > which has only 1. I run the code below and get: >> >> > >> >> > #combs defines the different combinations of >> >> > #animals and animalYears >> >> > combs<-paste(comAn$animals,comAn$animalYears,sep=':') >> >> > #counts defines how long the different combinations are >> >> > counts<-ave(1:nrow(comAn),combs,FUN=length) >> >> > #missing defines the combs that have length less than one and >puts it >> in >> >> > #the data frame missing >> >> > missing<-data.frame(vals=combs[counts<2],count=counts[counts<2]) >> >> > >> >> > genRows<-function(dat){ >> >> > vals<-strsplit(dat[1],':')[[1]] >> >> > #not sure why dat[2] is being converted to a >string >> >> > newRows<-2-as.numeric(dat[2]) >> >> > newDf<-data.frame(animals=rep(vals[1],newRows), >> >> > animalYears=rep(vals[2],newRows), >> >> > animalMass=rep(NA,newRows)) >> >> > return(newDf) >> >> > } >> >> > >> >> > >> >> > x<-apply(missing,1,genRows) >> >> > comAn=rbind(comAn, >> >> > do.call(rbind,x)) >> >> > >> >> >> comAn >> >> > animals animalYears animalMass >> >> > 1 bird 1 29 >> >> > 2 bird 1 48 >> >> > 3 bird 1 36 >> >> > 4 bird 2 20 >> >> > 5 bird 2 34 >> >> > 6 bird 2 34 >> >> > 7 dog 1 21 >> >> > 8 dog 1 28 >> >> > 9 dog 1 25 >> >> > 10 dog 2 35 >> >> > 11 dog 2 18 >> >> > 12 dog 2 11 >> >> > 13 cat 1 46 >> >> > 14 cat 1 33 >> >> > 15 cat 1 48 >> >> > 16 cat 2 21 >> >> > 17 cat 2 <NA> >> >> > >> >> > So far so good, but then I adjust the code so that it reads >(**notice >> >> > the >> >> > change in the specification in 'missing' to counts<3**): >> >> > >> >> > #combs defines the different combinations of >> >> > #animals and animalYears >> >> > combs<-paste(comAn$animals,comAn$animalYears,sep=':') >> >> > #counts defines how long the different combinations are >> >> > counts<-ave(1:nrow(comAn),combs,FUN=length) >> >> > #missing defines the combs that have length less than one and >puts it >> in >> >> > #the data frame missing >> >> > missing<-data.frame(vals=combs[counts<3],count=counts[counts<3]) >> >> > >> >> > genRows<-function(dat){ >> >> > vals<-strsplit(dat[1],':')[[1]] >> >> > #not sure why dat[2] is being converted to a >string >> >> > newRows<-2-as.numeric(dat[2]) >> >> > newDf<-data.frame(animals=rep(vals[1],newRows), >> >> > animalYears=rep(vals[2],newRows), >> >> > animalMass=rep(NA,newRows)) >> >> > return(newDf) >> >> > } >> >> > >> >> > >> >> > x<-apply(missing,1,genRows) >> >> > comAn=rbind(comAn, >> >> > do.call(rbind,x)) >> >> > >> >> > The result for 'x' then reads: >> >> > >> >> >> x >> >> > [[1]] >> >> > [1] animals animalYears animalMass >> >> > <0 rows> (or 0-length row.names) >> >> > >> >> > Any thoughts on why it might be doing this instead of adding an >> >> > additional >> >> > row to get the result: >> >> > >> >> >> comAn >> >> > animals animalYears animalMass >> >> > 1 bird 1 29 >> >> > 2 bird 1 48 >> >> > 3 bird 1 36 >> >> > 4 bird 2 20 >> >> > 5 bird 2 34 >> >> > 6 bird 2 34 >> >> > 7 dog 1 21 >> >> > 8 dog 1 28 >> >> > 9 dog 1 25 >> >> > 10 dog 2 35 >> >> > 11 dog 2 18 >> >> > 12 dog 2 11 >> >> > 13 cat 1 46 >> >> > 14 cat 1 33 >> >> > 15 cat 1 48 >> >> > 16 cat 2 21 >> >> > 17 cat 2 <NA> >> >> > 18 cat 2 <NA> >> >> > >> >> > Thanks >> >> > -- >> >> > Curtis Burkhalter >> > >> > >>