thr3ads.net - R help - [R] exercise in frustration: applying a function to subsamples [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Ted Byers

2010-Jul-12 19:10 UTC

[R] exercise in frustration: applying a function to subsamples

>From the documentation I have found, it seems that one of the functions frompackage plyr, or a combination of functions like split and lapply would
allow me to have a really short R script to analyze all my data (I have
reduced it to a couple hundred thousand records with about half a dozen
records.

I get the same result from ddply and split/lapply:
>
ddply(moreinfo,c("m_id","sale_year","sale_week"),
> +       function(df) data.frame(res =
fitdist(df$elapsed_time,"exp"),est > res$estimate,sd = res$sd))
> Error in fitdist(df$elapsed_time, "exp") :
>   data must be a numeric vector of length greater than 1
>
and
>
>
lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)),
> +       function(df) fitdist(df$elapsed_time,"exp"))
> Error in fitdist(df$elapsed_time, "exp") :
>   data must be a numeric vector of length greater than 1
>
Now, in retrospect, unless I misunderstood the properties of a data.frame, I
suppose a data.frame might not have been entirely appropriate as the m_id
samples start and end on very different dates, but I would have thought a
list data structure should have been able to handle that.  It would seem
that split is making groups that have the same start and end dates (or that
if, for example, I have sale data for precisely the last year, split would
insist on both 2009 and 2010 having weeks from 0 through 52 instead of just
the weeks in each year that actually have data: 26 through 52 for last year
and 1 through 25 for this year).  I don't see how else the data passed to
fitdist could have a sample size of 0.

I'd appreciate understanding how to resolve this.  However, it isn't s
show
stopper as it now seems trivial to just break it out into a loop (followed
by a lapply/split combo using only sale year and sale month).

While I am asking, is there a better way to split such temporally ordered
data into weekly samples that respective the year in which the sample is
taken as well as the week in which it is taken?

Thanks

Ted

	[[alternative HTML version deleted]]

Erik Iverson

2010-Jul-12 19:20 UTC

head link

[R] exercise in frustration: applying a function to subsamples

Your code is not reproducible.  Can you come up with a small example 
showing the crux of your data structures/problem, that we can all run in 
our R sessions?  You're likely get much higher quality responses this way.

Ted Byers wrote:>>From the documentation I have found, it seems that one of the functions
from
> package plyr, or a combination of functions like split and lapply would
> allow me to have a really short R script to analyze all my data (I have
> reduced it to a couple hundred thousand records with about half a dozen
> records.
> 
> I get the same result from ddply and split/lapply:
> 
>>
ddply(moreinfo,c("m_id","sale_year","sale_week"),
>> +       function(df) data.frame(res =
fitdist(df$elapsed_time,"exp"),est >> res$estimate,sd = res$sd))
>> Error in fitdist(df$elapsed_time, "exp") :
>>   data must be a numeric vector of length greater than 1
>>
> 
> and
> 
>>
lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)),
>> +       function(df) fitdist(df$elapsed_time,"exp"))
>> Error in fitdist(df$elapsed_time, "exp") :
>>   data must be a numeric vector of length greater than 1
>>
> 
> Now, in retrospect, unless I misunderstood the properties of a data.frame,
I
> suppose a data.frame might not have been entirely appropriate as the m_id
> samples start and end on very different dates, but I would have thought a
> list data structure should have been able to handle that.  It would seem
> that split is making groups that have the same start and end dates (or that
> if, for example, I have sale data for precisely the last year, split would
> insist on both 2009 and 2010 having weeks from 0 through 52 instead of just
> the weeks in each year that actually have data: 26 through 52 for last year
> and 1 through 25 for this year).  I don't see how else the data passed
to
> fitdist could have a sample size of 0.
> 
> I'd appreciate understanding how to resolve this.  However, it
isn't s show
> stopper as it now seems trivial to just break it out into a loop (followed
> by a lapply/split combo using only sale year and sale month).
> 
> While I am asking, is there a better way to split such temporally ordered
> data into weekly samples that respective the year in which the sample is
> taken as well as the week in which it is taken?
> 
> Thanks
> 
> Ted
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

jim holtman

2010-Jul-12 20:02 UTC

head link

[R] exercise in frustration: applying a function to subsamples

try 'drop=TRUE' on the split function call.  This will prevent the
NULL set from being sent to the function.

On Mon, Jul 12, 2010 at 3:10 PM, Ted Byers <r.ted.byers at gmail.com>
wrote:> >From the documentation I have found, it seems that one of the functions
from
> package plyr, or a combination of functions like split and lapply would
> allow me to have a really short R script to analyze all my data (I have
> reduced it to a couple hundred thousand records with about half a dozen
> records.
>
> I get the same result from ddply and split/lapply:
>
>>
ddply(moreinfo,c("m_id","sale_year","sale_week"),
>> + ? ? ? function(df) data.frame(res =
fitdist(df$elapsed_time,"exp"),est >> res$estimate,sd = res$sd))
>> Error in fitdist(df$elapsed_time, "exp") :
>> ? data must be a numeric vector of length greater than 1
>>
>
> and
>
>>
>>
lapply(split(moreinfo,list(moreinfo$m_id,moreinfo$sale_year,moreinfo$sale_week)),
>> + ? ? ? function(df) fitdist(df$elapsed_time,"exp"))
>> Error in fitdist(df$elapsed_time, "exp") :
>> ? data must be a numeric vector of length greater than 1
>>
>
> Now, in retrospect, unless I misunderstood the properties of a data.frame,
I
> suppose a data.frame might not have been entirely appropriate as the m_id
> samples start and end on very different dates, but I would have thought a
> list data structure should have been able to handle that. ?It would seem
> that split is making groups that have the same start and end dates (or that
> if, for example, I have sale data for precisely the last year, split would
> insist on both 2009 and 2010 having weeks from 0 through 52 instead of just
> the weeks in each year that actually have data: 26 through 52 for last year
> and 1 through 25 for this year). ?I don't see how else the data passed
to
> fitdist could have a sample size of 0.
>
> I'd appreciate understanding how to resolve this. ?However, it
isn't s show
> stopper as it now seems trivial to just break it out into a loop (followed
> by a lapply/split combo using only sale year and sale month).
>
> While I am asking, is there a better way to split such temporally ordered
> data into weekly samples that respective the year in which the sample is
> taken as well as the week in which it is taken?
>
> Thanks
>
> Ted
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Reasonably Related Threads

Search for more maybe matching threads

R help - Jul 2010 - exercise in frustration: applying a function to subsamples

[R] exercise in frustration: applying a function to subsamples

[R] exercise in frustration: applying a function to subsamples

[R] exercise in frustration: applying a function to subsamples

Reasonably Related Threads