thr3ads.net - R help - [R] How to identify runs or clusters of events in time [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Mark Shanks

2016-Jul-01 16:58 UTC

[R] How to identify runs or clusters of events in time

Hi,


Imagine the two problems:


1) You have an event that occurs repeatedly over time. You want to identify
periods when the event occurs more frequently than the base rate of occurrence.
Ideally, you don't want to have to specify the period (e.g., break into
months), so the analysis can be sensitive to scenarios such as many events
happening only between, e.g., June 10 and June 15 - even though the overall
number of events for the month may not be much greater than usual. Similarly,
there may be a cluster of events that occur from March 28 to April 3. Ideally,
you want to pull out the base rate of occurrence and highlight only the periods
when the frequency is less or greater than the base rate.


2) Events again occur repeatedly over time in an inconsistent way. However, this
time, the event has positive or negative outcomes - such as a spot check of
conformity to regulations. You again want to know whether there is a group of
negative outcomes close together in time. This analysis should take into account
the negative outcomes as well though. E.g., if from June 10 to June 15 you get 5
negative outcomes and no positive outcomes it should be flagged. On the other
hand, if from June 10 to June 15 you get 5 negative outcomes interspersed
between many positive outcomes it should be ignored.


I'm guessing that there is some statistical approach designed to look at
these types of issues. What is it called? What package in R implements it? I
basically just need to know where to start.


Thanks,


Mark

	[[alternative HTML version deleted]]

Clint Bowman

2016-Jul-01 17:49 UTC

head link

[R] How to identify runs or clusters of events in time

Mark,

I did something similar a couple of year ago by coding non-events as 0, 
positive events as +1 and negative events as -1 then summing the value 
through time.  In my case the patterns showed up quite clearly and I used 
other criteria to define the actual periods.

Clint

Clint Bowman			INTERNET:	clint at ecy.wa.gov
Air Quality Modeler		INTERNET:	clint at math.utah.edu
Department of Ecology		VOICE:		(360) 407-6815
PO Box 47600			FAX:		(360) 407-7534
Olympia, WA 98504-7600

         USPS:           PO Box 47600, Olympia, WA 98504-7600
         Parcels:        300 Desmond Drive, Lacey, WA 98503-1274

On Fri, 1 Jul 2016, Mark Shanks wrote:
> Hi,
>
>
> Imagine the two problems:
>
>
> 1) You have an event that occurs repeatedly over time. You want to identify
periods when the event occurs more frequently than the base rate of occurrence.
Ideally, you don't want to have to specify the period (e.g., break into
months), so the analysis can be sensitive to scenarios such as many events
happening only between, e.g., June 10 and June 15 - even though the overall
number of events for the month may not be much greater than usual. Similarly,
there may be a cluster of events that occur from March 28 to April 3. Ideally,
you want to pull out the base rate of occurrence and highlight only the periods
when the frequency is less or greater than the base rate.
>
>
> 2) Events again occur repeatedly over time in an inconsistent way. However,
this time, the event has positive or negative outcomes - such as a spot check of
conformity to regulations. You again want to know whether there is a group of
negative outcomes close together in time. This analysis should take into account
the negative outcomes as well though. E.g., if from June 10 to June 15 you get 5
negative outcomes and no positive outcomes it should be flagged. On the other
hand, if from June 10 to June 15 you get 5 negative outcomes interspersed
between many positive outcomes it should be ignored.
>
>
> I'm guessing that there is some statistical approach designed to look
at these types of issues. What is it called? What package in R implements it? I
basically just need to know where to start.
>
>
> Thanks,
>
>
> Mark
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry

2016-Jul-02 02:31 UTC

head link

[R] How to identify runs or clusters of events in time

See below

On Fri, 1 Jul 2016, Mark Shanks wrote:
> Hi,
>
>
> Imagine the two problems:
>
>
> 1) You have an event that occurs repeatedly over time. You want to 
> identify periods when the event occurs more frequently than the base 
> rate of occurrence. Ideally, you don't want to have to specify the 
> period (e.g., break into months), so the analysis can be sensitive to 
> scenarios such as many events happening only between, e.g., June 10 and 
> June 15 - even though the overall number of events for the month may not 
> be much greater than usual. Similarly, there may be a cluster of events 
> that occur from March 28 to April 3. Ideally, you want to pull out the 
> base rate of occurrence and highlight only the periods when the 
> frequency is less or greater than the base rate.
>
A good place to start is:

Siegmund, D. O., N. R. Zhang, and B. Yakir. "False discovery rate
for scanning statistics." Biometrika 98.4 (2011): 979-985.

and

Aldous, David. Probability approximations via the Poisson clumping 
heuristic. Vol. 77. Springer Science & Business Media, 2013.

---

A nice illustration of how scan statistcis can be used is:

Aberdein, Jody, and David Spiegelhalter. "Have London's roads
become more dangerous for cyclists?." Significance 10.6 (2013):
46-48.

>
> 2) Events again occur repeatedly over time in an inconsistent way. 
> However, this time, the event has positive or negative outcomes - such 
> as a spot check of conformity to regulations. You again want to know 
> whether there is a group of negative outcomes close together in time. 
> This analysis should take into account the negative outcomes as well 
> though. E.g., if from June 10 to June 15 you get 5 negative outcomes and 
> no positive outcomes it should be flagged. On the other hand, if from 
> June 10 to June 15 you get 5 negative outcomes interspersed between many 
> positive outcomes it should be ignored.
>
>
> I'm guessing that there is some statistical approach designed to look
at
> these types of issues. What is it called?
`Scan statistic' is a good search term. `Poisson clumping', too.
> What package in R implements it? I basically just need to know where to 
> start.
>
>
There are some R packages.

CRAN has packages SNscan and graphscan, which sound like they 
might interest you.

My BioConductor package geneRxCluster:

http://bioconductor.org/packages/release/bioc/html/geneRxCluster.html

seeks clusters in a binary sequence as described in detail at

http://bioinformatics.oxfordjournals.org/content/30/11/1493

HTH,

Chuck

R help - Jul 2016 - How to identify runs or clusters of events in time

[R] How to identify runs or clusters of events in time

[R] How to identify runs or clusters of events in time

[R] How to identify runs or clusters of events in time