Hello All, Using the standard "summary" function in 'R', I ran across some odd behavior that I cannot understand. Easy to reproduce: Typing: summary(c(6,207936)) Yields:: Min. *1st Qu. Median Mean 3rd Qu. Max.* 6 *51990 104000 104000 156000 207900* None of these values are correct except for the minimum. If I perform "quantile(c(6, 207936))", it gives the correct values. I originally presumed that summary was merely calling "quantile" if it saw a numeric, but this doesn't seem to be the case. Anyone know what's going on here? On a related note, what is the statistically correct answer for calculating the 1st quartile & 3rd quartile when only 2 values are present? I presume one takes the mid-point between the median (also calculated) and the min or max. So in this case, 51988.5 for 1st & 155953.5 for 3rd (which is what quantile calculates). But taking 25% & 75% of the sum of the 2 also seems "reasonable". Either way, "summary" is calculating the wrong number, and most disturbing is that it mis-calculates the max. Regards, Mike "Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here." -- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en [[alternative HTML version deleted]]
On Aug 24, 2010, at 1:06 PM, Mike Williamson wrote:> Hello All, > > Using the standard "summary" function in 'R', I ran across some odd > behavior that I cannot understand. Easy to reproduce: > > Typing: > > summary(c(6,207936)) > > Yields:: > > Min. *1st Qu. Median Mean 3rd Qu. Max.* > 6 *51990 104000 104000 156000 207900* > > > None of these values are correct except for the minimum. If I > perform > "quantile(c(6, 207936))", it gives the correct values. I originally > presumed that summary was merely calling "quantile" if it saw a > numeric, but > this doesn't seem to be the case.I would have assumed as you did, and continue to think so with appropriate modification of "merely" after reading the code in summary.default: else if (is.numeric(object)) { nas <- is.na(object) object <- object[!nas] qq <- stats::quantile(object) qq <- signif(c(qq[1L:3L], mean(object), qq[4L:5L]), digits) names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.") if (any(nas)) c(qq, `NA's` = sum(nas)) else qq Notice the digits argument: > summary(c(6,207936)) Min. 1st Qu. Median Mean 3rd Qu. Max. 6 51990 104000 104000 156000 207900 > quantile(c(6,207936)) 0% 25% 50% 75% 100% 6.0 51988.5 103971.0 155953.5 207936.0 > summary(c(6,207936), digits=6) Min. 1st Qu. Median Mean 3rd Qu. Max. 6.0 51988.5 103971.0 103971.0 155954.0 207936.0>> Anyone know what's going on here? On a related note, what is the > statistically correct answer for calculating the 1st quartile & 3rd > quartile > when only 2 values are present? I presume one takes the mid-point > between > the median (also calculated) and the min or max. So in this case, > 51988.5 > for 1st & 155953.5 for 3rd (which is what quantile calculates). But > taking > 25% & 75% of the sum of the 2 also seems "reasonable". Either way, > "summary" is calculating the wrong number, and most disturbing is > that it > mis-calculates the max. > > Regards,David Winsemius, MD West Hartford, CT
summary.default uses the signif function to round for display purposes. In ?summary, we can see the digits argument is used to control the value passed to signif. > lapply(1:6, function(x) summary(c(6, 207936), digits = x)) [[1]] Min. 1st Qu. Median Mean 3rd Qu. Max. 6e+00 5e+04 1e+05 1e+05 2e+05 2e+05 [[2]] Min. 1st Qu. Median Mean 3rd Qu. Max. 6 52000 100000 100000 160000 210000 [[3]] Min. 1st Qu. Median Mean 3rd Qu. Max. 6 52000 104000 104000 156000 208000 [[4]] Min. 1st Qu. Median Mean 3rd Qu. Max. 6 51990 104000 104000 156000 207900 [[5]] Min. 1st Qu. Median Mean 3rd Qu. Max. 6 51988 103970 103970 155950 207940 [[6]] Min. 1st Qu. Median Mean 3rd Qu. Max. 6.0 51988.5 103971.0 103971.0 155954.0 207936.0 Mike Williamson wrote:> Hello All, > > Using the standard "summary" function in 'R', I ran across some odd > behavior that I cannot understand. Easy to reproduce: > > Typing: > > summary(c(6,207936)) > > Yields:: > > Min. *1st Qu. Median Mean 3rd Qu. Max.* > 6 *51990 104000 104000 156000 207900* > > > None of these values are correct except for the minimum. If I perform > "quantile(c(6, 207936))", it gives the correct values. I originally > presumed that summary was merely calling "quantile" if it saw a numeric, but > this doesn't seem to be the case. > Anyone know what's going on here? On a related note, what is the > statistically correct answer for calculating the 1st quartile & 3rd quartile > when only 2 values are present? I presume one takes the mid-point between > the median (also calculated) and the min or max. So in this case, 51988.5 > for 1st & 155953.5 for 3rd (which is what quantile calculates). But taking > 25% & 75% of the sum of the 2 also seems "reasonable". Either way, > "summary" is calculating the wrong number, and most disturbing is that it > mis-calculates the max. > > Regards, > Mike > > > "Telescopes and bathyscaphes and sonar probes of Scottish lakes, > Tacoma Narrows bridge collapse explained with abstract phase-space maps, > Some x-ray slides, a music score, Minard's Napoleanic war: > The most exciting frontier is charting what's already here." > -- xkcd > > -- > Help protect Wikipedia. Donate now: > http://wikimediafoundation.org/wiki/Support_Wikipedia/en > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On 2010-08-24 11:06, Mike Williamson wrote:> Hello All, > > Using the standard "summary" function in 'R', I ran across some odd > behavior that I cannot understand. Easy to reproduce: > > Typing: > > summary(c(6,207936)) > > Yields:: > > Min. *1st Qu. Median Mean 3rd Qu. Max.* > 6 *51990 104000 104000 156000 207900* > > > None of these values are correct except for the minimum. If I perform > "quantile(c(6, 207936))", it gives the correct values. I originally > presumed that summary was merely calling "quantile" if it saw a numeric, but > this doesn't seem to be the case. > Anyone know what's going on here? On a related note, what is the > statistically correct answer for calculating the 1st quartile& 3rd quartile > when only 2 values are present? I presume one takes the mid-point between > the median (also calculated) and the min or max. So in this case, 51988.5 > for 1st& 155953.5 for 3rd (which is what quantile calculates). But taking > 25%& 75% of the sum of the 2 also seems "reasonable". Either way, > "summary" is calculating the wrong number, and most disturbing is that it > mis-calculates the max. > > Regards, > MikeThis is one of those (many) situations where reading the help pages really helps nicely: help(summary) points you to the 'digits' argument (as David has said) and that probably defaults to 'digits=4' for you. So, no, R is not miscalculating anything. help(quantile) shows that there are quite a few ways to define quantiles and that R defaults to 'type=7'. -Peter Ehlers