thr3ads.net - R help - [R] Calculate average of many subsets based on columns in another dataframe [Feb 2016]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2016-Feb-10 21:26 UTC

[R] Calculate average of many subsets based on columns in another dataframe

A strategy:

 1. Convert your dates and intervals to numerics that give the days
since a time origin. See as.POSIXlt (or ** ct for details and an
example that does this). Should be fast...

2. Use the findInterval() function to get the interval into which each
date falls. This **is** "vectorized" and should be fairly fast.

3. Use the ave() function using the intervals as your factor that
splits your data column(s) for which you wish to compute statistics.
The basic statistics functions like mean, sum, etc. **are**
vectorized, so this should be fast.

As David said, the *apply functions will probably not be much, if at
all, faster than an explicit for() loop. Most of the time will be
spent spent comparing the dates to the intervals to find in which each
falls, and findInterval is a fast way to do this.

... I think.

If you try this, let me know (perhaps privately) how/if it works.

Cheers,
Bert
Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Feb 10, 2016 at 1:08 PM, David Winsemius <dwinsemius at
comcast.net> wrote:>
>> On Feb 10, 2016, at 12:18 PM, Peter Lomas <peter.br.lomas at
gmail.com> wrote:
>>
>> Hello, I have a dataframe with a date range, and another dataframe
>> with observations by date.  For each date range, I'd like to
average
>> the values within that range from the other dataframe.  I've
provided
>> code below doing what I would like, but using a for loop is too
>> inefficient for my actual case (takes about an hour).  So I'm
looking
>> for a way to vectorize.
>>
>>
>> set.seed(345)
>> date.range <-
seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-06-01"),
>> by="DSTday")
>> observations <- data.frame(date=date.range, values=runif(152,1,100)
)
>> groups <- data.frame(start=sample(date.range[1:50], 20), end
>> sample(date.range[51:152], 20), average = NA)
>>
>> #Potential Solution (too inefficient)
>>
>> for(i in 1:NROW(groups)){
>> groups[i, "average"] <-
mean(observations[observations$date >>> groups[i, "start"]
& observations$date <=groups[i, "end"], "values"])
>> }
>>
> The 'average' column could be added to groups with this value:
>
> mapply( function(start,end){ mean(observations[['values']][
>                      observations$date >= start & observations$date
<=end])},
>         groups$start, groups$end)
>
>  [1] 50.96831 49.42286 47.27240 49.07534 47.66570 49.30977 48.47503
47.74036
>  [9] 46.02527 58.76492 48.86580 49.90655 45.79705 48.84071 39.53846
46.44601
> [17] 47.06631 47.74199 49.16980 46.85131
>
> I don't really think this is fully "vectorized" in the usual
R-meaning of the word. And I don't expect it to be any faster than the
for-loop. Perhaps some of the range functions in the data.table package could
accelerate your processing. If you don't get any volunteers in this list,
you could repost the question on StackOverflow after a suitable pause that
avoids accusations of cross-posting. SO has several skilled users of data.table
functions.
>
>> As an extension to this, there will end up being multiple value
>> columns, and each range will also identify which column to average.  I
>> think if I can figure out the first problem I can try to extend it
>> myself.
>
> Sorry, I didn't understand what was being described in that paragraph.
>
> --
>
> David Winsemius
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Peter Lomas

2016-Feb-10 22:18 UTC

head link

[R] Calculate average of many subsets based on columns in another dataframe

Thanks David, Bert,
>From what I'm reading on ?findInterval, It may not be workable becauseof overlapping date ranges.  findInterval seems to take a series of
bin breakpoints as its argument. I'm currently exploring data.table
documentation and will keep thinking about this.

Just on David's point, the extension of this with "groups" would
look
as below.  I just don't want to complicate it before I've solved the
simplest issue.

set.seed(345)
date.range <-
seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-06-01"),
by="DSTday")
observations <- data.frame(date=date.range, a=runif(152,1,100),
b=runif(152,1,100), c=runif(152,1,100) )
groups <- data.frame(start=sample(date.range[1:50], 20), end
sample(date.range[51:152], 20), group=sample(letters[1:3], 20,
replace=TRUE), average = NA)

#Potential Solutions (too inefficient)
for(i in 1:NROW(groups)){
 groups[i, "average"] <- mean(observations[observations$date
>groups[i, "start"] & observations$date <=groups[i,
"end"],
as.character(groups[i, "group"])])
}

Thanks again,
Peter

On Wed, Feb 10, 2016 at 2:26 PM, Bert Gunter <bgunter.4567 at gmail.com>
wrote:> A strategy:
>
>  1. Convert your dates and intervals to numerics that give the days
> since a time origin. See as.POSIXlt (or ** ct for details and an
> example that does this). Should be fast...
>
> 2. Use the findInterval() function to get the interval into which each
> date falls. This **is** "vectorized" and should be fairly fast.
>
> 3. Use the ave() function using the intervals as your factor that
> splits your data column(s) for which you wish to compute statistics.
> The basic statistics functions like mean, sum, etc. **are**
> vectorized, so this should be fast.
>
> As David said, the *apply functions will probably not be much, if at
> all, faster than an explicit for() loop. Most of the time will be
> spent spent comparing the dates to the intervals to find in which each
> falls, and findInterval is a fast way to do this.
>
> ... I think.
>
> If you try this, let me know (perhaps privately) how/if it works.
>
> Cheers,
> Bert
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Wed, Feb 10, 2016 at 1:08 PM, David Winsemius <dwinsemius at
comcast.net> wrote:
>>
>>> On Feb 10, 2016, at 12:18 PM, Peter Lomas <peter.br.lomas at
gmail.com> wrote:
>>>
>>> Hello, I have a dataframe with a date range, and another dataframe
>>> with observations by date.  For each date range, I'd like to
average
>>> the values within that range from the other dataframe.  I've
provided
>>> code below doing what I would like, but using a for loop is too
>>> inefficient for my actual case (takes about an hour).  So I'm
looking
>>> for a way to vectorize.
>>>
>>>
>>> set.seed(345)
>>> date.range <-
seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-06-01"),
>>> by="DSTday")
>>> observations <- data.frame(date=date.range,
values=runif(152,1,100) )
>>> groups <- data.frame(start=sample(date.range[1:50], 20), end
>>> sample(date.range[51:152], 20), average = NA)
>>>
>>> #Potential Solution (too inefficient)
>>>
>>> for(i in 1:NROW(groups)){
>>> groups[i, "average"] <-
mean(observations[observations$date >>>> groups[i,
"start"] & observations$date <=groups[i, "end"],
"values"])
>>> }
>>>
>> The 'average' column could be added to groups with this value:
>>
>> mapply( function(start,end){ mean(observations[['values']][
>>                      observations$date >= start &
observations$date <=end])},
>>         groups$start, groups$end)
>>
>>  [1] 50.96831 49.42286 47.27240 49.07534 47.66570 49.30977 48.47503
47.74036
>>  [9] 46.02527 58.76492 48.86580 49.90655 45.79705 48.84071 39.53846
46.44601
>> [17] 47.06631 47.74199 49.16980 46.85131
>>
>> I don't really think this is fully "vectorized" in the
usual R-meaning of the word. And I don't expect it to be any faster than the
for-loop. Perhaps some of the range functions in the data.table package could
accelerate your processing. If you don't get any volunteers in this list,
you could repost the question on StackOverflow after a suitable pause that
avoids accusations of cross-posting. SO has several skilled users of data.table
functions.
>>
>>> As an extension to this, there will end up being multiple value
>>> columns, and each range will also identify which column to average.
I
>>> think if I can figure out the first problem I can try to extend it
>>> myself.
>>
>> Sorry, I didn't understand what was being described in that
paragraph.
>>
>> --
>>
>> David Winsemius
>> Alameda, CA, USA
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2016-Feb-10 23:32 UTC

head link

[R] Calculate average of many subsets based on columns in another dataframe

Oh, you didn't say the intervals could overlap!

If Bill D's suggestions don't suffice, try the following:

(again assuming all dates are in a form that allow comparison
operations, e.g. via as.POSIX**)

Assume you have g intervals with start dates "starts" and end dates
"ends" and that you have d "dates".

Then:

wh <- outer(starts, dates,,"<=") &
outer(ends,dates,">") ## arrange
"<" and ">"   as you wish

## (?outer for details.)

is a d x g matrix with each column's TRUE values giving the rows dates
contained in that column's interval.

Then if "somedat" is a data vector of length d

apply(wh, 2, function(x) mean(somedat[x]) )

will give you the means for each interval of all somedat values whose
dates  fell into that interval.
This last step can be repeated for as many somedats as you like.

Note that this is still a loop (via apply), however, so it may not
satisfy your efficiency needs.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Feb 10, 2016 at 2:18 PM, Peter Lomas <peter.br.lomas at gmail.com>
wrote:> Thanks David, Bert,
>
> From what I'm reading on ?findInterval, It may not be workable because
> of overlapping date ranges.  findInterval seems to take a series of
> bin breakpoints as its argument. I'm currently exploring data.table
> documentation and will keep thinking about this.
>
> Just on David's point, the extension of this with "groups"
would look
> as below.  I just don't want to complicate it before I've solved
the
> simplest issue.
>
> set.seed(345)
> date.range <-
seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-06-01"),
> by="DSTday")
> observations <- data.frame(date=date.range, a=runif(152,1,100),
> b=runif(152,1,100), c=runif(152,1,100) )
> groups <- data.frame(start=sample(date.range[1:50], 20), end >
sample(date.range[51:152], 20), group=sample(letters[1:3], 20,
> replace=TRUE), average = NA)
>
> #Potential Solutions (too inefficient)
> for(i in 1:NROW(groups)){
>  groups[i, "average"] <- mean(observations[observations$date
>> groups[i, "start"] & observations$date <=groups[i,
"end"],
> as.character(groups[i, "group"])])
> }
>
> Thanks again,
> Peter
>
> On Wed, Feb 10, 2016 at 2:26 PM, Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>> A strategy:
>>
>>  1. Convert your dates and intervals to numerics that give the days
>> since a time origin. See as.POSIXlt (or ** ct for details and an
>> example that does this). Should be fast...
>>
>> 2. Use the findInterval() function to get the interval into which each
>> date falls. This **is** "vectorized" and should be fairly
fast.
>>
>> 3. Use the ave() function using the intervals as your factor that
>> splits your data column(s) for which you wish to compute statistics.
>> The basic statistics functions like mean, sum, etc. **are**
>> vectorized, so this should be fast.
>>
>> As David said, the *apply functions will probably not be much, if at
>> all, faster than an explicit for() loop. Most of the time will be
>> spent spent comparing the dates to the intervals to find in which each
>> falls, and findInterval is a fast way to do this.
>>
>> ... I think.
>>
>> If you try this, let me know (perhaps privately) how/if it works.
>>
>> Cheers,
>> Bert
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming
along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic
strip )
>>
>>
>> On Wed, Feb 10, 2016 at 1:08 PM, David Winsemius <dwinsemius at
comcast.net> wrote:
>>>
>>>> On Feb 10, 2016, at 12:18 PM, Peter Lomas <peter.br.lomas at
gmail.com> wrote:
>>>>
>>>> Hello, I have a dataframe with a date range, and another
dataframe
>>>> with observations by date.  For each date range, I'd like
to average
>>>> the values within that range from the other dataframe. 
I've provided
>>>> code below doing what I would like, but using a for loop is too
>>>> inefficient for my actual case (takes about an hour).  So
I'm looking
>>>> for a way to vectorize.
>>>>
>>>>
>>>> set.seed(345)
>>>> date.range <-
seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-06-01"),
>>>> by="DSTday")
>>>> observations <- data.frame(date=date.range,
values=runif(152,1,100) )
>>>> groups <- data.frame(start=sample(date.range[1:50], 20), end
>>>> sample(date.range[51:152], 20), average = NA)
>>>>
>>>> #Potential Solution (too inefficient)
>>>>
>>>> for(i in 1:NROW(groups)){
>>>> groups[i, "average"] <-
mean(observations[observations$date >>>>> groups[i,
"start"] & observations$date <=groups[i, "end"],
"values"])
>>>> }
>>>>
>>> The 'average' column could be added to groups with this
value:
>>>
>>> mapply( function(start,end){ mean(observations[['values']][
>>>                      observations$date >= start &
observations$date <=end])},
>>>         groups$start, groups$end)
>>>
>>>  [1] 50.96831 49.42286 47.27240 49.07534 47.66570 49.30977 48.47503
47.74036
>>>  [9] 46.02527 58.76492 48.86580 49.90655 45.79705 48.84071 39.53846
46.44601
>>> [17] 47.06631 47.74199 49.16980 46.85131
>>>
>>> I don't really think this is fully "vectorized" in
the usual R-meaning of the word. And I don't expect it to be any faster than
the for-loop. Perhaps some of the range functions in the data.table package
could accelerate your processing. If you don't get any volunteers in this
list, you could repost the question on StackOverflow after a suitable pause that
avoids accusations of cross-posting. SO has several skilled users of data.table
functions.
>>>
>>>> As an extension to this, there will end up being multiple value
>>>> columns, and each range will also identify which column to
average.  I
>>>> think if I can figure out the first problem I can try to extend
it
>>>> myself.
>>>
>>> Sorry, I didn't understand what was being described in that
paragraph.
>>>
>>> --
>>>
>>> David Winsemius
>>> Alameda, CA, USA
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.

R help - Feb 2016 - Calculate average of many subsets based on columns in another dataframe

[R] Calculate average of many subsets based on columns in another dataframe

[R] Calculate average of many subsets based on columns in another dataframe

[R] Calculate average of many subsets based on columns in another dataframe