I am working on a simple pilot project comparing the capability of SQL, SAS and R to perform a rolling mean per the following instructions. I have completed the SQL and SAS analysis, so now it's R's turn. Calculate mean values of x (x=count) for each date in the dataset where mean = the average count of days [t-9] through day [t-3] for each date/illness combination. Dataset aggpilot date illness x 1 2006/01/01 DERM 319 2 2006/01/02 DERM 388 3 2006/01/03 DERM 336 4 2006/01/04 DERM 255 5 2006/01/05 DERM 177 6 2006/01/06 DERM 377 7 2006/01/07 DERM 113 8 2006/01/08 DERM 253 9 2006/01/09 DERM 316 10 2006/01/10 DERM 187 11 2006/01/11 DERM 292 12 2006/01/12 DERM 275 13 2006/01/13 DERM 355 ... 3102 total observations Result of command: str(aggpilot) 'data.frame': 3102 obs. of 3 variables: $ date : Factor w/ 1551 levels "2006/01/01","2006/01/02",..: 1 2 3 4 5 6 7 8 9 10 ... $ illness: Factor w/ 2 levels "DERM","FEVER": 1 1 1 1 1 1 1 1 1 1 ... $ x : int 319 388 336 255 177 377 113 253 316 187 ... The results should look something like this: DATE ILLNESS x MEAN NOTES [ROLLING MEAN DATE RANGE] 1/1/2006 DERM 319 1/2/2006 DERM 388 1/3/2006 DERM 336 1/4/2006 DERM 255 319 [1/1/2006 - 1/1/2006] 1/5/2006 DERM 177 353.5 [1/1/2006 - 1/2/2006] 1/6/2006 DERM 377 347.6666667 [1/1/2006 - 1/3/2006] 1/7/2006 DERM 113 324.5 [1/1/2006 - 1/4/2006] 1/8/2006 DERM 253 295 [1/1/2006 - 1/5/2006] 1/9/2006 DERM 316 308.6666667 [1/1/2006 - 1/6/2006] 1/10/2006 DERM 187 280.7142857 [1/1/2006 - 1/7/2006] 1/11/2006 DERM 292 271.2857143 [1/2/2006 - 1/8/2006] 1/12/2006 DERM 275 261 [1/3/2006 - 1/9/2006] 1/13/2006 DERM 355 239.7142857 [1/4/2006 - 1/10/2006] Ken Hall Computer Scientist Division of Healthcare Information (DHI) (proposed) Public Health Surveillance Program Office (proposed) Office of Surveillance, Epidemiology, & Laboratory Services (OSELS) (proposed) Centers for Disease Control & Prevention (CDC) kha6@cdc.gov Mobile: 404-993-3311 Office: 404-498-6839 [[alternative HTML version deleted]]
I don't know if this is what you are after, but have a look at: rollmean in the zoo package hth Stephen On Fri, Aug 20, 2010 at 2:47 PM, Hall, Ken (CDC/OSELS/NCPHI) <kha6 at cdc.gov> wrote:> I am working on a simple pilot project comparing the capability of SQL, > SAS and R to perform a rolling mean per the following instructions. I > have completed the SQL and SAS analysis, so now it's R's turn. > > Calculate mean values of x (x=count) for each date in the dataset where > mean = the average count of days [t-9] through day [t-3] for each > date/illness combination. > > Dataset aggpilot > > ? ? date ? ? ? ? ? illness x > 1 ? ?2006/01/01 ? ?DERM 319 > 2 ? ?2006/01/02 ? ?DERM 388 > 3 ? ?2006/01/03 ? ?DERM 336 > 4 ? ?2006/01/04 ? ?DERM 255 > 5 ? ?2006/01/05 ? ?DERM 177 > 6 ? ?2006/01/06 ? ?DERM 377 > 7 ? ?2006/01/07 ? ?DERM 113 > 8 ? ?2006/01/08 ? ?DERM 253 > 9 ? ?2006/01/09 ? ?DERM 316 > 10 ? 2006/01/10 ? ?DERM 187 > 11 ? 2006/01/11 ? ?DERM 292 > 12 ? 2006/01/12 ? ?DERM 275 > 13 ? 2006/01/13 ? ?DERM 355 > ... 3102 total observations > > Result of command: str(aggpilot) > > 'data.frame': ? 3102 obs. of ?3 variables: > ?$ date ? : Factor w/ 1551 levels "2006/01/01","2006/01/02",..: 1 2 3 4 > 5 6 7 8 9 10 ... > ?$ illness: Factor w/ 2 levels "DERM","FEVER": 1 1 1 1 1 1 1 1 1 1 ... > ?$ x ? ? ?: int ?319 388 336 255 177 377 113 253 316 187 ... > > The results should look something like this: > > DATE ? ?ILLNESS x ? ? ? MEAN ? ? ? ? ? ? ? ? ? ?NOTES [ROLLING MEAN DATE > RANGE] > 1/1/2006 ? ? ? ?DERM ? ?319 > 1/2/2006 ? ? ? ?DERM ? ?388 > 1/3/2006 ? ? ? ?DERM ? ?336 > 1/4/2006 ? ? ? ?DERM ? ?255 ? ? 319 ? ? ? ? ? ? ? ? ? ? [1/1/2006 - > 1/1/2006] > 1/5/2006 ? ? ? ?DERM ? ?177 ? ? 353.5 ? ? ? ? ? ? ? ? ? [1/1/2006 - > 1/2/2006] > 1/6/2006 ? ? ? ?DERM ? ?377 ? ? 347.6666667 ? ? ? ? ? ? [1/1/2006 - > 1/3/2006] > 1/7/2006 ? ? ? ?DERM ? ?113 ? ? 324.5 ? ? ? ? ? ? ? ? ? [1/1/2006 - > 1/4/2006] > 1/8/2006 ? ? ? ?DERM ? ?253 ? ? 295 ? ? ? ? ? ? ? ? ? ? [1/1/2006 - > 1/5/2006] > 1/9/2006 ? ? ? ?DERM ? ?316 ? ? 308.6666667 ? ? ? ? ? ? [1/1/2006 - > 1/6/2006] > 1/10/2006 ? ? ? DERM ? ?187 ? ? 280.7142857 ? ? ? ? ? ? [1/1/2006 - > 1/7/2006] > 1/11/2006 ? ? ? DERM ? ?292 ? ? 271.2857143 ? ? ? ? ? ? [1/2/2006 - > 1/8/2006] > 1/12/2006 ? ? ? DERM ? ?275 ? ? 261 ? ? ? ? ? ? ? ? ? ? [1/3/2006 - > 1/9/2006] > 1/13/2006 ? ? ? DERM ? ?355 ? ? 239.7142857 ? ? ? ? ? ? [1/4/2006 - > 1/10/2006] > > > Ken Hall > Computer Scientist > Division of Healthcare Information (DHI) (proposed) > Public Health Surveillance Program Office (proposed) > Office of Surveillance, Epidemiology, & Laboratory Services (OSELS) > (proposed) > Centers for Disease Control & Prevention (CDC) > kha6 at cdc.gov > Mobile: 404-993-3311 > Office: 404-498-6839 > > > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Stephen Sefick ____________________________________ | Auburn University? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? | | Department of Biological Sciences? ? ? ? ?? | | 331 Funchess Hall? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | | Auburn, Alabama? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? | | 36849? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | |___________________________________| | sas0025 at auburn.edu? ? ? ? ? ? ? ? ? ? ? ? ? ?? | | http://www.auburn.edu/~sas0025? ? ? ? ? ?? | |___________________________________| Let's not spend our time and resources thinking about things that are so little or so large that all they really do for us is puff us up and make us feel like gods.? We are mammals, and have not exhausted the annoying little problems of being mammals. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? -K. Mullis
On Fri, Aug 20, 2010 at 3:47 PM, Hall, Ken (CDC/OSELS/NCPHI) <kha6 at cdc.gov> wrote:> I am working on a simple pilot project comparing the capability of SQL, > SAS and R to perform a rolling mean per the following instructions. I > have completed the SQL and SAS analysis, so now it's R's turn. > > Calculate mean values of x (x=count) for each date in the dataset where > mean = the average count of days [t-9] through day [t-3] for each > date/illness combination. > > Dataset aggpilot > > ? ? date ? ? ? ? ? illness x > 1 ? ?2006/01/01 ? ?DERM 319 > 2 ? ?2006/01/02 ? ?DERM 388 > 3 ? ?2006/01/03 ? ?DERM 336 > 4 ? ?2006/01/04 ? ?DERM 255 > 5 ? ?2006/01/05 ? ?DERM 177 > 6 ? ?2006/01/06 ? ?DERM 377 > 7 ? ?2006/01/07 ? ?DERM 113 > 8 ? ?2006/01/08 ? ?DERM 253 > 9 ? ?2006/01/09 ? ?DERM 316 > 10 ? 2006/01/10 ? ?DERM 187 > 11 ? 2006/01/11 ? ?DERM 292 > 12 ? 2006/01/12 ? ?DERM 275 > 13 ? 2006/01/13 ? ?DERM 355 > ... 3102 total observations > > Result of command: str(aggpilot) > > 'data.frame': ? 3102 obs. of ?3 variables: > ?$ date ? : Factor w/ 1551 levels "2006/01/01","2006/01/02",..: 1 2 3 4 > 5 6 7 8 9 10 ... > ?$ illness: Factor w/ 2 levels "DERM","FEVER": 1 1 1 1 1 1 1 1 1 1 ... > ?$ x ? ? ?: int ?319 388 336 255 177 377 113 253 316 187 ... > > The results should look something like this: > > DATE ? ?ILLNESS x ? ? ? MEAN ? ? ? ? ? ? ? ? ? ?NOTES [ROLLING MEAN DATE > RANGE] > 1/1/2006 ? ? ? ?DERM ? ?319 > 1/2/2006 ? ? ? ?DERM ? ?388 > 1/3/2006 ? ? ? ?DERM ? ?336 > 1/4/2006 ? ? ? ?DERM ? ?255 ? ? 319 ? ? ? ? ? ? ? ? ? ? [1/1/2006 - > 1/1/2006] > 1/5/2006 ? ? ? ?DERM ? ?177 ? ? 353.5 ? ? ? ? ? ? ? ? ? [1/1/2006 - > 1/2/2006] > 1/6/2006 ? ? ? ?DERM ? ?377 ? ? 347.6666667 ? ? ? ? ? ? [1/1/2006 - > 1/3/2006] > 1/7/2006 ? ? ? ?DERM ? ?113 ? ? 324.5 ? ? ? ? ? ? ? ? ? [1/1/2006 - > 1/4/2006] > 1/8/2006 ? ? ? ?DERM ? ?253 ? ? 295 ? ? ? ? ? ? ? ? ? ? [1/1/2006 - > 1/5/2006] > 1/9/2006 ? ? ? ?DERM ? ?316 ? ? 308.6666667 ? ? ? ? ? ? [1/1/2006 - > 1/6/2006] > 1/10/2006 ? ? ? DERM ? ?187 ? ? 280.7142857 ? ? ? ? ? ? [1/1/2006 - > 1/7/2006] > 1/11/2006 ? ? ? DERM ? ?292 ? ? 271.2857143 ? ? ? ? ? ? [1/2/2006 - > 1/8/2006] > 1/12/2006 ? ? ? DERM ? ?275 ? ? 261 ? ? ? ? ? ? ? ? ? ? [1/3/2006 - > 1/9/2006] > 1/13/2006 ? ? ? DERM ? ?355 ? ? 239.7142857 ? ? ? ? ? ? [1/4/2006 - > 1/10/2006] > >Lines <- "date illness x 2006/01/01 DERM 319 2006/01/02 DERM 388 2006/01/03 DERM 336 2006/01/04 DERM 255 2006/01/05 DERM 177 2006/01/06 DERM 377 2006/01/07 DERM 113 2006/01/08 DERM 253 2006/01/09 DERM 316 2006/01/10 DERM 187 2006/01/11 DERM 292 2006/01/12 DERM 275 2006/01/13 DERM 355 2006/01/01 FEVER 3190 2006/01/02 FEVER 3880 2006/01/03 FEVER 3360 2006/01/04 FEVER 2550 2006/01/05 FEVER 1770 2006/01/06 FEVER 3770 2006/01/07 FEVER 1130 2006/01/08 FEVER 2530 2006/01/09 FEVER 3160 2006/01/10 FEVER 1870 2006/01/11 FEVER 2920 2006/01/12 FEVER 2750 2006/01/13 FEVER 3550" ################################################## # 1. Produces same result from long form data # Only uses core R functionality (no packages). # # For each row form ix, the set of row indices # over which to average. Then form the columns. ################################################## # read in data DF <- read.table(textConnection(Lines), header = TRUE) DF$date <- as.Date(DF$date) # calculate one row of the output one_row <- function(i, X) with(X, { ix <- seq(i-9, i-3) ix <- ix[ix > 0] ix <- ix[X$illness[ix] == X$illness[i]] date <- format(date, "%m/%d/%Y") tt <- date[ix] NOTES <- if (length(ix) > 0) paste(head(tt, 1), "-", tail(tt, 1)) else "" data.frame(DATE = date[i], ILLNESS = illness[i], x = x[i], MEAN = mean(x[ix]), NOTES) }) do.call(rbind, lapply(1:NROW(DF), one_row, DF)) ################################################## # 2. In common situations the partial means at ends # would not be calculated since they are not # comparable to the other values -- they have # higher variance. Also this data would typically # be represented as a multivariate time series, # not as long form data. # With these changes and making use of the # rollmean in the zoo package it simplifies. # # read.zoo with split= reads it into a multivariate # time series with one column per illness. # We convert dates to chron class since it uses # the desired output date format by default. # It is converted to zooreg as its a weakly # regular series (actually its completely regular). # Then we merge it with its rolling mean and # form the desired columns. ################################################## library(zoo) library(chron) x <- read.zoo(textConnection(Lines), header = TRUE, split = "illness") time(x) <- as.chron(time(x)) x <- as.zooreg(x) zm <- merge(x, mean = lag(rollmean(x, 7, align = "right"), -3), all = FALSE) fmt <- function(x) format(x, "%m/%d/%Y") data.frame(DATE = time(zm), coredata(zm), notes = paste(time(zm) - 9, "-", time(zm) - 3)) ################################################## # 3. An alternative approach is to use the sqldf R package # See http://sqldf.googlecode.com ################################################## library(sqldf) sqldf('select strftime("%m/%d/%Y", s1.date*24*60*60, \'unixepoch\') "DATE", s1.illness "ILLNESS", x, mean "MEAN", coalesce(strftime("%m/%d/%Y", fromdate*24*60*60, \'unixepoch\') || \' - \' || strftime("%m/%d/%Y", fromdate*24*60*60, \'unixepoch\'), \'\') "NOTES" from DF s1 left join (select t1.date, avg(t2.x) mean, min(t2.date) fromdate, max(t2.date) todate, max(t2.illness) illness from DF t1, DF t2 where julianday(t1.date) between julianday(t2.date) + 3 AND julianday(t2.date) + 9 and t1.illness = t2.illness group by t1.illness, t1.date order by t1.illness, t1.date) s2 on s1.date = s2.date and s1.illness = s2.illness') # You can also explore doing part of it with sqldf and other parts in R.
On Sat, Aug 21, 2010 at 3:01 AM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> On Fri, Aug 20, 2010 at 3:47 PM, Hall, Ken (CDC/OSELS/NCPHI) > <kha6 at cdc.gov> wrote:I have also added some slightly simpler sqldf solutions here: http://code.google.com/p/sqldf/#Example_16._Moving_Average