Durant, James T. (ATSDR/DTEM/PRMSB)
2012-Feb-13 00:36 UTC
[R] finding and describing missing data runs in a time series
Hi - I am trying to find and describe missing data in a time series. For instance, in the library openair, there is a data frame called "mydata": library(openair) head(mydata) date ws wd nox no2 o3 pm10 so2 co pm25 1 1998-01-01 00:00:00 0.60 280 285 39 1 29 4.7225 3.3725 NA 2 1998-01-01 01:00:00 2.16 230 NA NA NA 37 NA NA NA 3 1998-01-01 02:00:00 2.76 190 NA NA 3 34 6.8300 9.6025 NA 4 1998-01-01 03:00:00 2.16 170 493 52 3 35 7.6625 10.2175 NA 5 1998-01-01 04:00:00 2.40 180 468 78 2 34 8.0700 8.9125 NA 6 1998-01-01 05:00:00 3.00 190 264 42 0 16 5.5050 3.0525 NA So for example, I would like to be able to detect for pm25, I would like to be able to detect that there are NA's starting at 1998-01-01 0:00:00 and runs for 2887 hourly observations. Then I would be able to know that there is an NA at 2910 and so on. The key information I am looking for is when the NA's start and their length. The closest thing I can use that I know about is timePlot in the openair package with statistic="frequency" but it only gives monthly summary data, and does not tell me if the missing data are clumped together or are dispersed. VR Jim James T. Durant, MSPH CIH Emergency Response Coordinator US Agency for Toxic Substances and Disease Registry Atlanta, GA 30341 770-378-1695 [[alternative HTML version deleted]]
R. Michael Weylandt <michael.weylandt@gmail.com>
2012-Feb-13 02:40 UTC
[R] finding and describing missing data runs in a time series
Not at a computer to test this but perhaps rle(is.na(x)) might help. Michael On Feb 12, 2012, at 7:36 PM, "Durant, James T. (ATSDR/DTEM/PRMSB)" <hzd3 at cdc.gov> wrote:> Hi - > > I am trying to find and describe missing data in a time series. For instance, in the library openair, there is a data frame called "mydata": > library(openair) > head(mydata) > > date ws wd nox no2 o3 pm10 so2 co pm25 > 1 1998-01-01 00:00:00 0.60 280 285 39 1 29 4.7225 3.3725 NA > 2 1998-01-01 01:00:00 2.16 230 NA NA NA 37 NA NA NA > 3 1998-01-01 02:00:00 2.76 190 NA NA 3 34 6.8300 9.6025 NA > 4 1998-01-01 03:00:00 2.16 170 493 52 3 35 7.6625 10.2175 NA > 5 1998-01-01 04:00:00 2.40 180 468 78 2 34 8.0700 8.9125 NA > 6 1998-01-01 05:00:00 3.00 190 264 42 0 16 5.5050 3.0525 NA > > > So for example, I would like to be able to detect for pm25, I would like to be able to detect that there are NA's starting at 1998-01-01 0:00:00 and runs for 2887 hourly observations. Then I would be able to know that there is an NA at 2910 and so on. The key information I am looking for is when the NA's start and their length. The closest thing I can use that I know about is timePlot in the openair package with statistic="frequency" but it only gives monthly summary data, and does not tell me if the missing data are clumped together or are dispersed. > > VR > > Jim > > > James T. Durant, MSPH CIH > Emergency Response Coordinator > US Agency for Toxic Substances and Disease Registry > Atlanta, GA 30341 > 770-378-1695 > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
(Ted Harding)
2012-Feb-13 08:51 UTC
[R] finding and describing missing data runs in a time series
On 13-Feb-2012 Durant, James T. (ATSDR/DTEM/PRMSB) wrote:> Hi - > I am trying to find and describe missing data in a time series. > For instance, in the library openair, there is a data frame > called "mydata": > library(openair) > head(mydata) > > date ws wd nox no2 o3 pm10 so2 co pm25 > 1 1998-01-01 00:00:00 0.60 280 285 39 1 29 4.7225 3.3725 NA > 2 1998-01-01 01:00:00 2.16 230 NA NA NA 37 NA NA NA > 3 1998-01-01 02:00:00 2.76 190 NA NA 3 34 6.8300 9.6025 NA > 4 1998-01-01 03:00:00 2.16 170 493 52 3 35 7.6625 10.2175 NA > 5 1998-01-01 04:00:00 2.40 180 468 78 2 34 8.0700 8.9125 NA > 6 1998-01-01 05:00:00 3.00 190 264 42 0 16 5.5050 3.0525 NA > > > So for example, I would like to be able to detect for pm25, > I would like to be able to detect that there are NA's starting > at 1998-01-01 0:00:00 and runs for 2887 hourly observations. > Then I would be able to know that there is an NA at 2910 and > so on. The key information I am looking for is when the NA's > start and their length. The closest thing I can use that I > know about is timePlot in the openair package with > statistic="frequency" but it only gives monthly summary data, > and does not tell me if the missing data are clumped together > or are dispersed. > > VR > Jim > > James T. Durant, MSPH CIH > Emergency Response Coordinator > US Agency for Toxic Substances and Disease Registry > Atlanta, GA 30341 > 770-378-1695You might consider an approach based on rle(is.na(mydata$pm25)) See ?rle Example: X <- c(1,2,3,NA,NA,NA,4,5,NA,6,7,8,NA,NA,NA,NA,NA) X # [1] 1 2 3 NA NA NA 4 5 NA 6 7 8 NA NA NA NA NA rle(is.na(X)) # Run Length Encoding # lengths: int [1:6] 3 3 2 1 3 5 # values : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE Ted. ------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at wlandres.net> Date: 13-Feb-2012 Time: 08:51:19 This message was sent by XFMail