Hi all, I have been chipping away at a problem I encountered in calculating rates per year from a moderately large data file (46412 rows). When I ran the following command, I got obviously wrong output: interval<- c(NA,as.numeric(diff( strptime(mkdf$MEAS_DATE,"%d/%m/%Y")))/365.25) The values in MEAS_DATE looked like this: mkdf$MEAS_DATE[1:10] [1] 1/5/1962 1/5/1963 1/5/1964 1/3/1965 1/4/1966 1/4/1967 1/6/1968 [8] 25/3/1969 1/4/1971 1/2/1974 146 Levels: 10/10/1967 1/10/1947 1/10/1965 1/10/1967 1/10/1983 ... 9/1/1992 To abbreviate three evenings of work, I finally found that values 17170 and 17171 were the same. If I ran the entire set, or anything over 1:17170, I would get output like this: interval[1:10] [1] NA 86340.86 86577.41 71911.29 93673.92 86340.86 101006.98 [8] 70255.44 174337.58 245292.81 If I ran any set of values up to 17170, I would get the correct output: interval[1:10] [1] NA 0.9993155 1.0020534 0.8323066 1.0841889 0.9993155 1.1690623 [8] 0.8131417 2.0177960 2.8390372 If I changed value 17171 by one day (and added that level), the command worked correctly: interval[1:10] [1] NA 0.9993155 1.0020534 0.8323066 1.0841889 0.9993155 1.1690623 [8] 0.8131417 2.0177960 2.8390372 There have been a few messages about this problem, but apparently no solution. The problem can be seen with these examples (I haven't included the real data as it is not mine): foodate<-c("1/7/1991","1/8/1991","1/8/1991","3/8/1991") as.numeric(diff(strptime(foodate,"%d/%m/%Y"))/365.25) [1] 7333.0595 0.0000 473.1006 foodate<-factor(c("1/7/1991","1/8/1991","1/8/1991","3/8/1991")) as.numeric(diff(strptime(foodate,"%d/%m/%Y"))/365.25) [1] 7333.0595 0.0000 473.1006 foodate<-factor(c("1/7/1991","1/8/1991","2/8/1991","3/8/1991")) > as.numeric(diff(strptime(foodate,"%d/%m/%Y"))/365.25) [1] 0.084873374 0.002737851 0.002737851 Beats me. Jim
You are throwing away the clue in your use of as.numeric. First. strptime returns a POSIXlt value, which you will convert to POSIXct when you do arithetic (using diff()). Why are you doing that? So> foodate<-factor(c("1/7/1991","1/8/1991","1/8/1991","3/8/1991")) > diff(strptime(foodate,"%d/%m/%Y"))Time differences in secs [1] 2678400 0 172800 attr(,"tzone") [1] "" is correct. I think you intended diff(as.Date(foodate,"%d/%m/%Y"))/365.25 or even add as.numeric() inside diff(). On Thu, 20 Mar 2008, Jim Lemon wrote:> Hi all, > > I have been chipping away at a problem I encountered in calculating > rates per year from a moderately large data file (46412 rows). When I > ran the following command, I got obviously wrong output: > > interval<- > c(NA,as.numeric(diff( > strptime(mkdf$MEAS_DATE,"%d/%m/%Y")))/365.25) > > The values in MEAS_DATE looked like this: > > mkdf$MEAS_DATE[1:10] > [1] 1/5/1962 1/5/1963 1/5/1964 1/3/1965 1/4/1966 1/4/1967 > 1/6/1968 > [8] 25/3/1969 1/4/1971 1/2/1974 > 146 Levels: 10/10/1967 1/10/1947 1/10/1965 1/10/1967 1/10/1983 ... 9/1/1992 > > To abbreviate three evenings of work, I finally found that values 17170 > and 17171 were the same. If I ran the entire set, or anything over > 1:17170, I would get output like this: > > interval[1:10] > [1] NA 86340.86 86577.41 71911.29 93673.92 86340.86 > 101006.98 > [8] 70255.44 174337.58 245292.81 > > If I ran any set of values up to 17170, I would get the correct output: > > interval[1:10] > [1] NA 0.9993155 1.0020534 0.8323066 1.0841889 0.9993155 > 1.1690623 > [8] 0.8131417 2.0177960 2.8390372 > > If I changed value 17171 by one day (and added that level), the command > worked correctly: > > interval[1:10] > [1] NA 0.9993155 1.0020534 0.8323066 1.0841889 0.9993155 > 1.1690623 > [8] 0.8131417 2.0177960 2.8390372 > > There have been a few messages about this problem, but apparently no > solution. The problem can be seen with these examples (I haven't > included the real data as it is not mine): > > foodate<-c("1/7/1991","1/8/1991","1/8/1991","3/8/1991") > as.numeric(diff(strptime(foodate,"%d/%m/%Y"))/365.25) > [1] 7333.0595 0.0000 473.1006 > > foodate<-factor(c("1/7/1991","1/8/1991","1/8/1991","3/8/1991")) > as.numeric(diff(strptime(foodate,"%d/%m/%Y"))/365.25) > [1] 7333.0595 0.0000 473.1006 > > foodate<-factor(c("1/7/1991","1/8/1991","2/8/1991","3/8/1991")) > > as.numeric(diff(strptime(foodate,"%d/%m/%Y"))/365.25) > [1] 0.084873374 0.002737851 0.002737851 > > Beats me. > > Jim > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Prof Brian Ripley wrote:> You are throwing away the clue in your use of as.numeric. > > First. strptime returns a POSIXlt value, which you will convert to > POSIXct when you do arithetic (using diff()). Why are you doing that? So > >> foodate<-factor(c("1/7/1991","1/8/1991","1/8/1991","3/8/1991")) >> diff(strptime(foodate,"%d/%m/%Y")) > > Time differences in secs > [1] 2678400 0 172800 > attr(,"tzone") > [1] "" > > is correct. I think you intended > > diff(as.Date(foodate,"%d/%m/%Y"))/365.25 > > or even add as.numeric() inside diff(). >This is true, but I am puzzled as to why I get the correct output except when there are two consecutive input values that are the same. The idea was to get the number of years between each date in order to calculate a rate per year. If I put the as.numeric inside diff: diff(as.numeric(strptime(foodate,"%d/%m/%Y"))/365.25) Error in Ops.POSIXt(as.numeric(strptime(foodate, "%d/%m/%Y")), 365.25) : / not defined for "POSIXt" objects Jim
I think I have worked out the problem, and because it may trouble others, I take the liberty of explaining it on the mailing list. When diff is applied to a vector of POSIXt values returned by strptime, the units depend upon the smallest interval in the input vector. If that interval is less than one day, _all_ of the differences are in seconds. If the smallest interval is at least one day, all of the differences are in days. This is quite sensible behavior, and I assume it is the "clue" that Prof. Ripley mentioned. However, if the "units" argument is included in the diff call, it has no effect on diff.POSIXt, which I think does the calculation (in contrast, difftime does return 0 days with units="days"). There may be quite a few R users like me who set up their script using a toy dataset and are puzzled when the real dataset produces what looks like garbage. Thus I humbly suggest the following additions to the help file. Value ... When used with times, diff may return different units depending upon the type of time object. An object created by as.Date always returns days, while a POSIXt object will return days if all differences are at least one day, seconds if any are less than one day. I would also suggest a fix for the underlying code including a "units" argument for diff, except that I could not find it, despite grepping for "diff" in the src directories. Jim