hong.ooi at anz.com
2009-Nov-27 05:55 UTC
[Rd] Long execution time for quantile() and difftime objects (PR#14091)
Full_Name: Hong Ooi Version: 2.10.0 OS: Windows XP Submission from: (NULL) (203.110.235.1) While trying to get summary statistics on a duration variable (the difference between a start and end date), I ran into the following issue. Using summary or quantile (which summary calls) on a difftime object takes an extremely long time if the object is even moderately large. A reproducible example:> x <- as.Date(1:10000, origin="1900-01-01") > x[1:10][1] "1900-01-02" "1900-01-03" "1900-01-04" "1900-01-05" "1900-01-06" [6] "1900-01-07" "1900-01-08" "1900-01-09" "1900-01-10" "1900-01-11"> d <- x - as.Date("1900-01-01") > d[1:10]Time differences in days [1] 1 2 3 4 5 6 7 8 9 10> system.time(summary(d[1:10]))user system elapsed 0.01 0.00 0.01> system.time(summary(d[1:100]))user system elapsed 0.21 0.00 0.20> system.time(summary(d[1:1000]))user system elapsed 3.02 0.00 3.02> system.time(summary(d[1:10000]))user system elapsed 43.56 0.04 43.66 If I unclass d, there is no problem:> system.time(summary(unclass(d[1:10000])))user system elapsed 0 0 0 Testing with Rprof() indicates that the problem lies in [.difftime, although the code for that function seems innocuous enough.> sessionInfo()R version 2.10.0 (2009-10-26) i386-pc-mingw32 locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C [5] LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base
Prof Brian Ripley
2009-Nov-27 14:09 UTC
[Rd] Long execution time for quantile() and difftime objects (PR#14091)
Did you read the help page? x: numeric vector whose sample quantiles are wanted. ?NA? and ?NaN? values are not allowed unless ?na.rm? is ?TRUE?. so only 'numeric' vectors are really supported, although it does say The default method does not allow factors, but works with objects sufficiently like numeric vectors that ?sort?, addition and multiplication work correctly. In principle only sorts and weighted means are needed, so datatimes could have quantiles - but this is not implemented. There is no claim that it works (let alone works well) for class "difftime". If you follow the link to 'sort' it says The default ?sort? method makes use of ?order? for objects with classes, which in turn makes use of the generic function ?xtfrm?. and from ?xtfrm The default method will make use of ?==? and ?>? methods for the class of ?x[i]? (for integers ?i?), and the ?is.na? method for the class of ?x?, but might be rather slow when doing so. So, if you want this to be fast, you need to write an xtfrm method. There is one in R-devel> xtfrm.difftimefunction (x) as.numeric(x) and you can use that in your workspace (and your example is fast in R-devel, because of that function I think, there being other development work in progress in the version I tried). On Fri, 27 Nov 2009, hong.ooi at anz.com wrote:> Full_Name: Hong Ooi > Version: 2.10.0 > OS: Windows XP > Submission from: (NULL) (203.110.235.1) > > > While trying to get summary statistics on a duration variable (the difference > between a start and end date), I ran into the following issue. Using summary or > quantile (which summary calls) on a difftime object takes an extremely long time > if the object is even moderately large. > > A reproducible example: > >> x <- as.Date(1:10000, origin="1900-01-01") >> x[1:10] > [1] "1900-01-02" "1900-01-03" "1900-01-04" "1900-01-05" "1900-01-06" > [6] "1900-01-07" "1900-01-08" "1900-01-09" "1900-01-10" "1900-01-11" >> d <- x - as.Date("1900-01-01") >> d[1:10] > Time differences in days > [1] 1 2 3 4 5 6 7 8 9 10 >> system.time(summary(d[1:10])) > user system elapsed > 0.01 0.00 0.01 >> system.time(summary(d[1:100])) > user system elapsed > 0.21 0.00 0.20 >> system.time(summary(d[1:1000])) > user system elapsed > 3.02 0.00 3.02 >> system.time(summary(d[1:10000])) > user system elapsed > 43.56 0.04 43.66 > > > If I unclass d, there is no problem: > >> system.time(summary(unclass(d[1:10000]))) > user system elapsed > 0 0 0 > > Testing with Rprof() indicates that the problem lies in [.difftime, although the > code for that function seems innocuous enough. > > >> sessionInfo() > R version 2.10.0 (2009-10-26) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 > [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C > [5] LC_TIME=English_Australia.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595