Dear all, I have hundreds of thousands of univariate time series of the form: character "seriesid", vector of Date, vector of integer (some exemplary data is at the end of the mail) I am trying to find the ones which somehow "have a shape" over time that looks like the histogramm of a (skewed) normal distribution:> hist(rnorm(200,10,2))The "mean" is not interesting, i.e. it does not matter if the first nonzero observation happens in the 2. or the 40. month of observation. So all that matters is: They should start sometime, the hits per month increase, at some point they decrease and then they more or less disappear. Short Example (hits at consecutive months (Dates omitted)): 1. series: 0 0 0 2 5 8 20 42 30 19 6 1 0 0 0 -> Good 2. series: 0 3 8 9 20 6 0 3 25 67 7 1 0 4 60 20 10 0 4 -> Bad Series 1 would be an ideal case of what I am looking for. Graphical inspection would be easy but is not an option due to the huge amount of series. Questions: 1. Which (if at all) of the many packages that handle time series is appropriate for my problem? 2. Which general approach seems to be the most straightforward and best supported by R? - Is there a way to test the time series directly (preferably)? - Or do I need to "type-cast" them as some kind of histogram data and then test against the pdf of e.g. a normal distribution (but how)? - Or something totally different? Thank you for your time, Andreas Neumann Data Examples (id1 is good, id2 is bad):> id1dates hits 1 2004-12-01 3 2 2005-01-01 4 3 2005-02-01 10 4 2005-03-01 6 5 2005-04-01 35 6 2005-05-01 14 7 2005-06-01 33 8 2005-07-01 13 9 2005-08-01 3 10 2005-09-01 9 11 2005-10-01 8 12 2005-11-01 4 13 2005-12-01 3> id2dates hits 1 2001-01-01 6 2 2001-02-01 5 3 2001-03-01 5 4 2001-04-01 6 5 2001-05-01 2 6 2001-06-01 5 7 2001-07-01 1 8 2001-08-01 6 9 2001-09-01 4 10 2001-10-01 10 11 2001-11-01 0 12 2001-12-01 3 13 2002-01-01 6 14 2002-02-01 5 15 2002-03-01 1 16 2002-04-01 2 17 2002-05-01 4 18 2002-06-01 4 19 2002-07-01 0 20 2002-08-01 1 21 2002-09-01 0 22 2002-10-01 2 23 2002-11-01 2 24 2002-12-01 2 25 2003-01-01 2 26 2003-02-01 3 27 2003-03-01 7
If its good enough just to examine the number of strictly positive runs then sum(rle(sign(id1$hits))$values == 1) will give 1 in the good case (one run) and > 1 in the bad case (multiple runs). On 3/21/06, Andreas Neumann <Andreas.Neumann at em.uni-karlsruhe.de> wrote:> Dear all, > > I have hundreds of thousands of univariate time series of the form: > character "seriesid", vector of Date, vector of integer > (some exemplary data is at the end of the mail) > > I am trying to find the ones which somehow "have a shape" over time that > looks like the histogramm of a (skewed) normal distribution: > > hist(rnorm(200,10,2)) > The "mean" is not interesting, i.e. it does not matter if the first > nonzero observation happens in the 2. or the 40. month of observation. > So all that matters is: They should start sometime, the hits per month > increase, at some point they decrease and then they more or less > disappear. > > Short Example (hits at consecutive months (Dates omitted)): > 1. series: 0 0 0 2 5 8 20 42 30 19 6 1 0 0 0 -> Good > 2. series: 0 3 8 9 20 6 0 3 25 67 7 1 0 4 60 20 10 0 4 -> Bad > > Series 1 would be an ideal case of what I am looking for. > > Graphical inspection would be easy but is not an option due to the huge > amount of series. > > Questions: > > 1. Which (if at all) of the many packages that handle time series is > appropriate for my problem? > > 2. Which general approach seems to be the most straightforward and best > supported by R? > - Is there a way to test the time series directly (preferably)? > - Or do I need to "type-cast" them as some kind of histogram > data and then test against the pdf of e.g. a normal distribution (but > how)? > - Or something totally different? > > > Thank you for your time, > > Andreas Neumann > > > > > Data Examples (id1 is good, id2 is bad): > > > id1 > dates hits > 1 2004-12-01 3 > 2 2005-01-01 4 > 3 2005-02-01 10 > 4 2005-03-01 6 > 5 2005-04-01 35 > 6 2005-05-01 14 > 7 2005-06-01 33 > 8 2005-07-01 13 > 9 2005-08-01 3 > 10 2005-09-01 9 > 11 2005-10-01 8 > 12 2005-11-01 4 > 13 2005-12-01 3 > > > > id2 > dates hits > 1 2001-01-01 6 > 2 2001-02-01 5 > 3 2001-03-01 5 > 4 2001-04-01 6 > 5 2001-05-01 2 > 6 2001-06-01 5 > 7 2001-07-01 1 > 8 2001-08-01 6 > 9 2001-09-01 4 > 10 2001-10-01 10 > 11 2001-11-01 0 > 12 2001-12-01 3 > 13 2002-01-01 6 > 14 2002-02-01 5 > 15 2002-03-01 1 > 16 2002-04-01 2 > 17 2002-05-01 4 > 18 2002-06-01 4 > 19 2002-07-01 0 > 20 2002-08-01 1 > 21 2002-09-01 0 > 22 2002-10-01 2 > 23 2002-11-01 2 > 24 2002-12-01 2 > 25 2003-01-01 2 > 26 2003-02-01 3 > 27 2003-03-01 7 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Kjetil Brinchmann Halvorsen
2006-Mar-21 19:09 UTC
[R] Classifying time series by shape over time
Andreas Neumann wrote:> Dear all, > > I have hundreds of thousands of univariate time series of the form: > character "seriesid", vector of Date, vector of integer > (some exemplary data is at the end of the mail) > > I am trying to find the ones which somehow "have a shape" over time that > looks like the histogramm of a (skewed) normal distribution: >> hist(rnorm(200,10,2)) > The "mean" is not interesting, i.e. it does not matter if the first > nonzero observation happens in the 2. or the 40. month of observation. > So all that matters is: They should start sometime, the hits per month > increase, at some point they decrease and then they more or less > disappear. > > Short Example (hits at consecutive months (Dates omitted)): > 1. series: 0 0 0 2 5 8 20 42 30 19 6 1 0 0 0 -> Good > 2. series: 0 3 8 9 20 6 0 3 25 67 7 1 0 4 60 20 10 0 4 -> Bad > > Series 1 would be an ideal case of what I am looking for. > > Graphical inspection would be easy but is not an option due to the huge > amount of series. >Does function turnpoints)= in package pastecs help_ Kjetil> Questions: > > 1. Which (if at all) of the many packages that handle time series is > appropriate for my problem? > > 2. Which general approach seems to be the most straightforward and best > supported by R? > - Is there a way to test the time series directly (preferably)? > - Or do I need to "type-cast" them as some kind of histogram > data and then test against the pdf of e.g. a normal distribution (but > how)? > - Or something totally different? > > > Thank you for your time, > > Andreas Neumann > > > > > Data Examples (id1 is good, id2 is bad): > >> id1 > dates hits > 1 2004-12-01 3 > 2 2005-01-01 4 > 3 2005-02-01 10 > 4 2005-03-01 6 > 5 2005-04-01 35 > 6 2005-05-01 14 > 7 2005-06-01 33 > 8 2005-07-01 13 > 9 2005-08-01 3 > 10 2005-09-01 9 > 11 2005-10-01 8 > 12 2005-11-01 4 > 13 2005-12-01 3 > > >> id2 > dates hits > 1 2001-01-01 6 > 2 2001-02-01 5 > 3 2001-03-01 5 > 4 2001-04-01 6 > 5 2001-05-01 2 > 6 2001-06-01 5 > 7 2001-07-01 1 > 8 2001-08-01 6 > 9 2001-09-01 4 > 10 2001-10-01 10 > 11 2001-11-01 0 > 12 2001-12-01 3 > 13 2002-01-01 6 > 14 2002-02-01 5 > 15 2002-03-01 1 > 16 2002-04-01 2 > 17 2002-05-01 4 > 18 2002-06-01 4 > 19 2002-07-01 0 > 20 2002-08-01 1 > 21 2002-09-01 0 > 22 2002-10-01 2 > 23 2002-11-01 2 > 24 2002-12-01 2 > 25 2003-01-01 2 > 26 2003-02-01 3 > 27 2003-03-01 7 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >