Mike Williamson
2010-Dec-17 01:35 UTC
[R] how to convert "sloppy data" into a time series?
Hi All, First let me state that I did search for a while on r-help, google, and using the "sos" package inside of 'R', without much luck. I want to know how to create a univariate time series from a set of data that will have huge time gaps in it. For instance, here is a snapshot of a piece of data that I would like to analyze: *Row queued_time processTime 50 2010-06-15 21:50:42.443 6.399989e-02 secs 63 2010-06-15 21:51:57.347 6.300020e-02 secs 156 2010-06-29 14:53:26.073 3.011863e+06 secs 175 2010-07-22 10:14:57.503 4.334879e+06 secs 278 2010-08-05 11:29:56.713 6.155674e+06 secs 509 2010-08-05 11:29:57.443 3.120779e+06 secs 531 2010-08-05 11:29:57.543 3.120779e+06 secs 555 2010-08-05 11:29:57.647 3.120779e+06 secs 190 2010-08-05 11:29:57.943 3.120778e+06 secs 230 2010-08-05 11:29:58.047 3.120778e+06 secs 211 2010-08-05 11:29:58.917 3.120777e+06 secs 251 2010-08-05 11:29:59.077 3.120777e+06 secs 298 2010-08-05 11:29:59.297 3.120777e+06 secs 320 2010-08-05 11:29:59.397 3.120777e+06 secs 366 2010-08-05 11:29:59.707 3.120777e+06 secs 342 2010-08-05 11:30:00.987 3.120775e+06 secs 380 2010-08-05 11:30:01.200 3.120775e+06 secs 120 2010-08-19 09:31:47.207 2.358866e+06 secs 141 2010-08-19 09:31:47.500 2.358866e+06 secs 842 2010-09-03 13:58:21.463 3.641194e+06 secs * I would like to be able to take the second column, the "processTime", and put it into a time series using the first column as the key to say when it occurred. But everything I could find, such as ts(), went on the assumption that I had fully univariate data to start with, and all I needed to do was set the frequency & start date (in the case of ts() ). I can adjust the "queued time" arbitrarily as needed, so that if, for instance, the data set would end up far too sparse & empty by keeping the current precision, I could cut the "queued_time" precision down to just the year, month, day, hour. But in that case, how would the time series handle the fact that there are several (varying) entries with the same value stored. The reason I want to do this is because I next want to be able to use all the very nice modeling capabilities that a univariate time series allows, such as arima, etc. Thanks in advance! Mike "Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here." -- xkcd -- Help protect Wikipedia. Donate now: wikimediafoundation.org/wiki/Support_Wikipedia/en [[alternative HTML version deleted]]
David Winsemius
2010-Dec-17 03:57 UTC
[R] how to convert "sloppy data" into a time series?
On Dec 16, 2010, at 8:35 PM, Mike Williamson wrote:> Hi All, > > First let me state that I did search for a while on r-help, > google, and > using the "sos" package inside of 'R', without much luck. I want to > know > how to create a univariate time series from a set of data that will > have > huge time gaps in it. For instance, here is a snapshot of a piece > of data > that I would like to analyze: > > *Row queued_time processTime > 50 2010-06-15 21:50:42.443 6.399989e-02 secs > 63 2010-06-15 21:51:57.347 6.300020e-02 secs > 156 2010-06-29 14:53:26.073 3.011863e+06 secs > 175 2010-07-22 10:14:57.503 4.334879e+06 secs > 278 2010-08-05 11:29:56.713 6.155674e+06 secs > 509 2010-08-05 11:29:57.443 3.120779e+06 secs > 531 2010-08-05 11:29:57.543 3.120779e+06 secs > 555 2010-08-05 11:29:57.647 3.120779e+06 secs > 190 2010-08-05 11:29:57.943 3.120778e+06 secs > 230 2010-08-05 11:29:58.047 3.120778e+06 secs > 211 2010-08-05 11:29:58.917 3.120777e+06 secs > 251 2010-08-05 11:29:59.077 3.120777e+06 secs > 298 2010-08-05 11:29:59.297 3.120777e+06 secs > 320 2010-08-05 11:29:59.397 3.120777e+06 secs > 366 2010-08-05 11:29:59.707 3.120777e+06 secs > 342 2010-08-05 11:30:00.987 3.120775e+06 secs > 380 2010-08-05 11:30:01.200 3.120775e+06 secs > 120 2010-08-19 09:31:47.207 2.358866e+06 secs > 141 2010-08-19 09:31:47.500 2.358866e+06 secs > 842 2010-09-03 13:58:21.463 3.641194e+06 secs > * > I would like to be able to take the second column, the > "processTime", > and put it into a time series using the first column as the key to > say when > it occurred. But everything I could find, such as ts(), went on the > assumption that I had fully univariate data to start with, and all I > needed > to do was set the frequency & start date (in the case of ts() ). > I can adjust the "queued time" arbitrarily as needed, so that if, > for > instance, the data set would end up far too sparse & empty by > keeping the > current precision, I could cut the "queued_time" precision down to > just the > year, month, day, hour. But in that case, how would the time series > handle > the fact that there are several (varying) entries with the same value > stored. > > The reason I want to do this is because I next want to be able to > use > all the very nice modeling capabilities that a univariate time series > allows, such as arima, etc. >Information on package 'its' Description: Package: its Version: 1.1.8 Date: 2009-09-06 Title: Irregular Time Series Author: Portfolio & Risk Advisory Group, Commerzbank Securities Maintainer: Whit Armstrong <armstrong.whit at gmail.com> -- David/> Thanks in advance! > Mike > > > > > > > "Telescopes and bathyscaphes and sonar probes of Scottish lakes, > Tacoma Narrows bridge collapse explained with abstract phase-space > maps, > Some x-ray slides, a music score, Minard's Napoleanic war: > The most exciting frontier is charting what's already here." > -- xkcd > > -- > Help protect Wikipedia. Donate now: > wikimediafoundation.org/wiki/Support_Wikipedia/en > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
Hi: As you mentioned at the outset, you have a very irregular time series, to which David has given you one reasonable suggestion; perhaps another is the zoo package. Those are the standard R packages to deal with irregular time series. There may be others of which I am unaware, though - there may be something in the Rmetrics suite that pertains, for example. Check the Time Series view at CRAN for possible alternatives: cran.r-project.org/web/views ARIMA modeling, OTOH, assumes that the data are equally spaced and stationary (perhaps after suitable differencing or detrending). Consequently, I think you may need to rethink your strategy for modeling these data. One possibility is to aggregate the data appropriately, but you're the one who has to decide what would be the appropriate interval of aggregation and what difficulties might ensue (e.g., unequal sample sizes per time interval). This is not a simple problem, and the best strategy may be to start with description and gradually work your way to a reasonable, scientifically plausible model. A sensible question to ask is: what is the largest time unit I can use without losing vital information? That might be a place to start... Trying to model a time series with very large time gaps is a little like having several stills from a movie and trying to reconstruct the movie and its plot without having seen it beforehand. You'll need to use every bit of knowledge you have about the underlying process to aid in the analysis. HTH, Dennis On Thu, Dec 16, 2010 at 5:35 PM, Mike Williamson <this.is.mvw@gmail.com>wrote:> Hi All, > > First let me state that I did search for a while on r-help, google, and > using the "sos" package inside of 'R', without much luck. I want to know > how to create a univariate time series from a set of data that will have > huge time gaps in it. For instance, here is a snapshot of a piece of data > that I would like to analyze: > > *Row queued_time processTime > 50 2010-06-15 21:50:42.443 6.399989e-02 secs > 63 2010-06-15 21:51:57.347 6.300020e-02 secs > 156 2010-06-29 14:53:26.073 3.011863e+06 secs > 175 2010-07-22 10:14:57.503 4.334879e+06 secs > 278 2010-08-05 11:29:56.713 6.155674e+06 secs > 509 2010-08-05 11:29:57.443 3.120779e+06 secs > 531 2010-08-05 11:29:57.543 3.120779e+06 secs > 555 2010-08-05 11:29:57.647 3.120779e+06 secs > 190 2010-08-05 11:29:57.943 3.120778e+06 secs > 230 2010-08-05 11:29:58.047 3.120778e+06 secs > 211 2010-08-05 11:29:58.917 3.120777e+06 secs > 251 2010-08-05 11:29:59.077 3.120777e+06 secs > 298 2010-08-05 11:29:59.297 3.120777e+06 secs > 320 2010-08-05 11:29:59.397 3.120777e+06 secs > 366 2010-08-05 11:29:59.707 3.120777e+06 secs > 342 2010-08-05 11:30:00.987 3.120775e+06 secs > 380 2010-08-05 11:30:01.200 3.120775e+06 secs > 120 2010-08-19 09:31:47.207 2.358866e+06 secs > 141 2010-08-19 09:31:47.500 2.358866e+06 secs > 842 2010-09-03 13:58:21.463 3.641194e+06 secs > * > I would like to be able to take the second column, the "processTime", > and put it into a time series using the first column as the key to say when > it occurred. But everything I could find, such as ts(), went on the > assumption that I had fully univariate data to start with, and all I needed > to do was set the frequency & start date (in the case of ts() ). > I can adjust the "queued time" arbitrarily as needed, so that if, for > instance, the data set would end up far too sparse & empty by keeping the > current precision, I could cut the "queued_time" precision down to just the > year, month, day, hour. But in that case, how would the time series handle > the fact that there are several (varying) entries with the same value > stored. > > The reason I want to do this is because I next want to be able to use > all the very nice modeling capabilities that a univariate time series > allows, such as arima, etc. > > Thanks in advance! > Mike > > > > > > > "Telescopes and bathyscaphes and sonar probes of Scottish lakes, > Tacoma Narrows bridge collapse explained with abstract phase-space maps, > Some x-ray slides, a music score, Minard's Napoleanic war: > The most exciting frontier is charting what's already here." > -- xkcd > > -- > Help protect Wikipedia. Donate now: > wikimediafoundation.org/wiki/Support_Wikipedia/en > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]