Tom Willems
2007-Aug-30 09:00 UTC
[R] Q: Mean, median and confidence intervals with functions "summary" & "boxplot.stats"
Een ingesloten tekst met niet-gespecificeerde tekenset is van het bericht gescrubt ... Naam: niet beschikbaar Url: https://stat.ethz.ch/pipermail/r-help/attachments/20070830/e557d2a7/attachment.ksh
Uwe Ligges
2007-Aug-30 09:57 UTC
[R] Q: Mean, median and confidence intervals with functions "summary" & "boxplot.stats"
Tom Willems wrote:> Dear R ussers, > > My question is, " How can my mean be outside the confidence intervals ?!" > > I think i have the answer for it, but i would like to hear some other > ideas on it. > > First my data is not continuose but categorical, it is a titre calculated > on a dilution serie. > It is stored as a column of values, and a column indicating the phase of > the trail. > Theoreticaly it is possible to have a value ranging from 0 to 4, but in > practice, only sertain values will occure, and they will repeat. > So it are frequencies. > > This is why i belief that it is better to work with a median than with a > mean, because it represents the cluster of values wich occure most. > Below I only give one example, but the mean being below the lowest > confidence limit occures several times over different tests. > > does my answer seam reasonable, or should i perhapes use an other methode, > any sugestion? > > summary_1d = summary(subset(eda_data, phase=='1' & test=='test > 1' ,select=lg_value), na.rm = T) > conf_1d = boxplot.stats(subset(eda_data, phase=='1' & > test=='test 1' ,select=lg_value)) > > Mean Median 95% Confidence Int. StDev. > Variance > 1.198 1.681 1.441 > < 1.922 0.931 > 0.866I do not understand which "confidence" has been calculated? Based on which assumptions / data? Is it pointwise or not? We need much more information - and if you think it is a problem with R or usage of R functions, then please give us a reproducible example. Uwe Ligges> Kind regards, > Tom W. > > > Disclaimer: click here > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
S Ellison
2007-Aug-30 12:17 UTC
[R] Q: Mean, median and confidence intervals with functions "summary" & "boxplot.stats"
If you look at ?boxplot.stats, you will find that the confidence interval it reports is centred on the median and : "The notches (if requested) extend to '+/-1.58 IQR/sqrt(n)'." If you have skewed data it is very possible (as you have found) that the mean is outside median+/-1.58 IQR/sqrt(n). All that is happening is that the majority of the data are around 1 or 2 and you have a substantial number near zero. Result: mean much lower than median. And with a high n, the boxplot notch is very narrow and excludes the mean. But it does sound very much as if you are doing something questionable at best. I would not trust IQR as a dispersion measure on discrete data with few possible values even if they were on an interval scale; too much risk of getting the same IQR for many different distributions. On an ordinal scale it is worse; the only points that are valid at all are the valid scale values, so a CI that uses intermediate values is formally meaningless (what is a shoe size of 7.2, for example? Answer: Not a shoe size at all). It is of course entirely meaningless to talk about an IQR on a categorical scale. It sounds like boxplot.stats is an inappropriate tool for summarising your data.>>> Tom Willems <Tom.Willems at var.fgov.be> 30/08/2007 10:00:50 >>>Dear R ussers, My question is, " How can my mean be outside the confidence intervals ?!" I think i have the answer for it, but i would like to hear some other ideas on it. First my data is not continuose but categorical, it is a titre calculated on a dilution serie. It is stored as a column of values, and a column indicating the phase of the trail. Theoreticaly it is possible to have a value ranging from 0 to 4, but in practice, only sertain values will occure, and they will repeat. So it are frequencies. This is why i belief that it is better to work with a median than with a mean, because it represents the cluster of values wich occure most. Below I only give one example, but the mean being below the lowest confidence limit occures several times over different tests. does my answer seam reasonable, or should i perhapes use an other methode, any sugestion? summary_1d = summary(subset(eda_data, phase=='1' & test=='test 1' ,select=lg_value), na.rm = T) conf_1d = boxplot.stats(subset(eda_data, phase=='1' & test=='test 1' ,select=lg_value)) Mean Median 95% Confidence Int. StDev. Variance 1.198 1.681 1.441 > < 1.922 0.931 0.866 Kind regards, Tom W. Disclaimer: click here [[alternative HTML version deleted]] ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ******************************************************************* This email and any attachments are confidential. Any use, co...{{dropped}}