Marc Schwartz
2008-Jan-12 02:35 UTC
[Rd] hist.Date() and cut.Date(): approximations used when using breaks = 'months' or 'years'
Hi all, I came across some curious behavior today in using hist.Date() and subsequently noted the same behavior in cut.Date(), both of which are using similar code when 'breaks = "months"' or 'breaks = "years"'. I was in the process of creating a histogram of subject enrollment in a clinical trial. The counts needed to be by month, so essentially used: hist(Dates, breaks = "months") When reviewing the counts generated, I noted a discrepancy between the histogram and another frequency table generated independently. In attempting to identify the etiology, I reviewed the code for hist.Date() and noted the following: start <- as.POSIXlt(min(x, na.rm = TRUE)) ... if (valid == 3) { start$mday <- 1 incr <- 31 } if (valid == 4) { start$mon <- 0 incr <- 366 } start <- .Internal(POSIXlt2Date(start)) maxx <- max(x, na.rm = TRUE) breaks <- seq.int(start, maxx + incr, breaks) breaks <- breaks[1:(1 + max(which(breaks < maxx)))] ... res <- hist.default(unclass(x), unclass(breaks), plot = FALSE, ...) The first check is for breaks = "months" and the second for "years". If I am reading it correctly, it seems that the discrepancy is due to the approximations of the numbers of days in a month and the number of days in a year, respectively, which get further and further off, especially for "boundary" dates near the interval breaks. The use of approximations is not noted in ?hist.Date (or in ?cut.Date), so I was a bit surprised. To give a specific example, I have uploaded a text file containing a date series that shows at least some aspects of the discrepancy. # Read in the file and convert to dates # Total of 1361 entries Dates <- as.Date(scan("http://home.comcast.net/~marc_schwartz/Dates.txt", what "character")) # Get the hist.Date() counts for months> hist(Dates, breaks = "months", plot = FALSE)$counts[1] 2 3 2 9 10 15 21 34 52 85 77 59 56 71 73 55 52 88 67 66 74 86 [23] 58 96 64 71 15 # Get the hist.Date() counts for years> hist(Dates, breaks = "years", plot = FALSE)$counts[1] 6 533 822 # Now format the dates for the subsequent counts months <- format(Dates, format = "%m") years <- format(Dates, format = "%Y") # Tabulate the years - NOTE there are 4 years, not 3 as above> table(years)years 2005 2006 2007 2008 5 491 850 15 # Now split months by years and tabulate - NOTE count diffs> sapply(split(months, years), table)$`2005` 11 12 1 4 $`2006` 01 02 03 04 05 06 07 08 09 10 11 12 2 8 11 14 18 38 45 85 84 58 54 74 $`2007` 01 02 03 04 05 06 07 08 09 10 11 12 71 57 52 78 74 69 70 90 57 87 74 71 $`2008` 01 15 I think that it becomes clear just how far off the hist.Dates() based counts are, though this is clearly affected by the specific date series in question. I would like to suggest that a warning be added to both hist.Date() and to cut.Date() giving users a heads up that approximations are being used for these intervals, possibly resulting in count errors. If it is desirable, I would be willing to spend some time incorporating code similar to the above, as appropriate for each interval specification, and make it available for both functions. I suspect additional tweaking would be required to handle other aspects of the two functions as required. If there are any pitfalls that I should be aware of that perhaps have led to the use of the current approach, I'd love to hear about them, so that I can avoid re-inventing the wheel, if it is desired for me to proceed with code updates here. Thanks, Marc Schwartz