Ted Byers
2010-Jul-12 19:10 UTC
[R] exercise in frustration: applying a function to subsamples
>From the documentation I have found, it seems that one of the functions frompackage plyr, or a combination of functions like split and lapply would allow me to have a really short R script to analyze all my data (I have reduced it to a couple hundred thousand records with about half a dozen records. I get the same result from ddply and split/lapply:> ddply(moreinfo,c("m_id","sale_year","sale_week"), > + function(df) data.frame(res = fitdist(df$elapsed_time,"exp"),est > res$estimate,sd = res$sd)) > Error in fitdist(df$elapsed_time, "exp") : > data must be a numeric vector of length greater than 1 >and> > lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)), > + function(df) fitdist(df$elapsed_time,"exp")) > Error in fitdist(df$elapsed_time, "exp") : > data must be a numeric vector of length greater than 1 >Now, in retrospect, unless I misunderstood the properties of a data.frame, I suppose a data.frame might not have been entirely appropriate as the m_id samples start and end on very different dates, but I would have thought a list data structure should have been able to handle that. It would seem that split is making groups that have the same start and end dates (or that if, for example, I have sale data for precisely the last year, split would insist on both 2009 and 2010 having weeks from 0 through 52 instead of just the weeks in each year that actually have data: 26 through 52 for last year and 1 through 25 for this year). I don't see how else the data passed to fitdist could have a sample size of 0. I'd appreciate understanding how to resolve this. However, it isn't s show stopper as it now seems trivial to just break it out into a loop (followed by a lapply/split combo using only sale year and sale month). While I am asking, is there a better way to split such temporally ordered data into weekly samples that respective the year in which the sample is taken as well as the week in which it is taken? Thanks Ted [[alternative HTML version deleted]]
Erik Iverson
2010-Jul-12 19:20 UTC
[R] exercise in frustration: applying a function to subsamples
Your code is not reproducible. Can you come up with a small example showing the crux of your data structures/problem, that we can all run in our R sessions? You're likely get much higher quality responses this way. Ted Byers wrote:>>From the documentation I have found, it seems that one of the functions from > package plyr, or a combination of functions like split and lapply would > allow me to have a really short R script to analyze all my data (I have > reduced it to a couple hundred thousand records with about half a dozen > records. > > I get the same result from ddply and split/lapply: > >> ddply(moreinfo,c("m_id","sale_year","sale_week"), >> + function(df) data.frame(res = fitdist(df$elapsed_time,"exp"),est >> res$estimate,sd = res$sd)) >> Error in fitdist(df$elapsed_time, "exp") : >> data must be a numeric vector of length greater than 1 >> > > and > >> lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)), >> + function(df) fitdist(df$elapsed_time,"exp")) >> Error in fitdist(df$elapsed_time, "exp") : >> data must be a numeric vector of length greater than 1 >> > > Now, in retrospect, unless I misunderstood the properties of a data.frame, I > suppose a data.frame might not have been entirely appropriate as the m_id > samples start and end on very different dates, but I would have thought a > list data structure should have been able to handle that. It would seem > that split is making groups that have the same start and end dates (or that > if, for example, I have sale data for precisely the last year, split would > insist on both 2009 and 2010 having weeks from 0 through 52 instead of just > the weeks in each year that actually have data: 26 through 52 for last year > and 1 through 25 for this year). I don't see how else the data passed to > fitdist could have a sample size of 0. > > I'd appreciate understanding how to resolve this. However, it isn't s show > stopper as it now seems trivial to just break it out into a loop (followed > by a lapply/split combo using only sale year and sale month). > > While I am asking, is there a better way to split such temporally ordered > data into weekly samples that respective the year in which the sample is > taken as well as the week in which it is taken? > > Thanks > > Ted > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
jim holtman
2010-Jul-12 20:02 UTC
[R] exercise in frustration: applying a function to subsamples
try 'drop=TRUE' on the split function call. This will prevent the NULL set from being sent to the function. On Mon, Jul 12, 2010 at 3:10 PM, Ted Byers <r.ted.byers at gmail.com> wrote:> >From the documentation I have found, it seems that one of the functions from > package plyr, or a combination of functions like split and lapply would > allow me to have a really short R script to analyze all my data (I have > reduced it to a couple hundred thousand records with about half a dozen > records. > > I get the same result from ddply and split/lapply: > >> ddply(moreinfo,c("m_id","sale_year","sale_week"), >> + ? ? ? function(df) data.frame(res = fitdist(df$elapsed_time,"exp"),est >> res$estimate,sd = res$sd)) >> Error in fitdist(df$elapsed_time, "exp") : >> ? data must be a numeric vector of length greater than 1 >> > > and > >> >> lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)), >> + ? ? ? function(df) fitdist(df$elapsed_time,"exp")) >> Error in fitdist(df$elapsed_time, "exp") : >> ? data must be a numeric vector of length greater than 1 >> > > Now, in retrospect, unless I misunderstood the properties of a data.frame, I > suppose a data.frame might not have been entirely appropriate as the m_id > samples start and end on very different dates, but I would have thought a > list data structure should have been able to handle that. ?It would seem > that split is making groups that have the same start and end dates (or that > if, for example, I have sale data for precisely the last year, split would > insist on both 2009 and 2010 having weeks from 0 through 52 instead of just > the weeks in each year that actually have data: 26 through 52 for last year > and 1 through 25 for this year). ?I don't see how else the data passed to > fitdist could have a sample size of 0. > > I'd appreciate understanding how to resolve this. ?However, it isn't s show > stopper as it now seems trivial to just break it out into a loop (followed > by a lapply/split combo using only sale year and sale month). > > While I am asking, is there a better way to split such temporally ordered > data into weekly samples that respective the year in which the sample is > taken as well as the week in which it is taken? > > Thanks > > Ted > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Maybe Matching Threads
- How do I combine lists of data.frames into a single data frame?
- How do I get rid of list elements where the value is NULL before applying rbind?
- I need help making a data.fame comprised of selected columns of an original data frame.
- One problem with RMySQL and a query that returns an empty recordset
- Query about using timestamps returned by SQL as 'factor' for split