Jens Oehlschlägel
2006-Oct-27 09:14 UTC
[Rd] What to do with a inconsistency in rank() that's in S+ and R ever since?
Dear R-developers, I just realized that rank() behaves inconsistent if combining one of na.last in {TRUE|FALSE} with a ties.method in {"average"|"random"|"max"|"min"}. The documentation suggests that e.g. with na.last=TRUE NAs are treated like the last (=highest) value, which obviously is not the case:> rank(c(1,2,2,NA,NA), na.last = TRUE, ties.method = c("average", "first", "random", "max", "min")[1])[1] 1.0 2.5 2.5 4.0 5.0 I'd expect [1] 1.0 2.5 2.5 4.5 4.5 rather, but in fact NAs seem to be always treated ties.method = "first". I have no idea in which situation one could desire e.g. ties.method = "average" except for NAs!? I am aware that the prototype behaves like this and R ever since behaves like this, however to me this appears very unfortunate. In order not to 'break' existing code, what about adding ties.methods {"NAaverage"|"NArandom"|"NAmax"|"NAmin"} that behave consistently? Best regards Jens Oehlschl?gel P.S. Please cc. me, I am not on the list> version_ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 4.0 year 2006 month 10 day 03 svn rev 39566 language R version.string R version 2.4.0 (2006-10-03)
Andrew Piskorski
2006-Oct-27 15:38 UTC
[Rd] What to do with a inconsistency in rank() that's in S+ and R ever since?
On Fri, Oct 27, 2006 at 11:14:25AM +0200, Jens Oehlschl?gel wrote:> rather, but in fact NAs seem to be always treated ties.method > "first". I have no idea in which situation one could desire > e.g. ties.method = "average" except for NAs!?Interesting. I was aware of the S-Plus vs. R difference, but I didn't realize that it appears to be because R rank() ignores ties.method="average" for NA values.> I am aware that the prototype behaves like this and R ever since > behaves like this, however to me this appears very unfortunate. In > order not to 'break' existing code, what about adding ties.methodsIf you only care about ranking integers and floating point numbers, it's pretty straghtforward to take the S-Plus implementation of rank(), call it to my.rank(), and use it in both R and S-Plus. (Since the R rank() makes calls to .Internal(), you can't re-use its implementation in S-Plus.) Note though that the S-Plus-style my.rank() will still sort strings differently in R than in S-Plus. I never looked into why. Some old notes I have on this issue: R and S-Plus rank() treat NAs differently (which can magnifiy other floating point differences): # S-Plus 6.2.1: # R 2.1.0: > rank(1:5) > rank(1:5) [1] 1 2 3 4 5 [1] 1 2 3 4 5 > rank(c(1,2,NA,4,NA)) > rank(c(1,2,NA,4,NA)) [1] 1.0 2.0 4.5 3.0 4.5 [1] 1 2 4 3 5 > rank(c(1,NA,3,4,NA)) > rank(c(1,NA,3,4,NA)) [1] 1.0 4.5 2.0 3.0 4.5 [1] 1 4 2 3 5 > rank(c(1,NA,3)) > rank(c(1,NA,3)) [1] 1 3 2 [1] 1 3 2 > rank(c(NA,NA,3)) > rank(c(NA,NA,3)) [1] 2.5 2.5 1.0 [1] 2 3 1 -- Andrew Piskorski <atp at piskorski.com> http://www.piskorski.com/