Dear R users this may be a simple question - but i would appreciate any thoughts does anyone know how you would get one lower and one upper confidence interval for a set of data that consists of proportions. i.e. taking a usual confidence interval for normal data would result in the lower confidence interval being negative - which is not possible given the data (which is constrained between 0 and 1) i can see how you calculate a upper and lower confidence interval for a single proportion, but not for a set of proportions many thanks Darren Shaw ----------------------------------------------------------------- Dr Darren J Shaw Centre for Tropical Veterinary Medicine (CTVM) The University of Edinburgh Scotland, EH25 9RG, UK
Darren Shaw wrote:> this may be a simple question - but i would appreciate any thoughts > > does anyone know how you would get one lower and one upper confidence > interval for a set of data that consists of proportions. i.e. taking a > usual confidence interval for normal data would result in the lower > confidence interval being negative - which is not possible given the data > (which is constrained between 0 and 1) > > i can see how you calculate a upper and lower confidence interval for a > single proportion, but not for a set of proportions(1) Your question appears to be a bit ``off topic''. I.e. it is really about statistical methodology, rather than about how to implement methodology in R. (2) You need to make the scenario clearer. What do your data actually consist of? What are you assuming? The only reasonable scenario that springs to mind (perhaps this is merely indicative of poverty of imagination on my part) is that you have a number of ***independent*** samples, each yielding a sample proportion, and each coming from the same population (or at least from populations having the same population proportion ``p''. I.e. you have p.hat_1, ..., p.hat_n and from these you wish to calculate a confidence interval for p. You need to know the sample ***sizes*** for each sample. If you don't, you're screwed. Full stop. There is absolutely nothing sensible you can do. If you ***do*** know the sample sizes (say k_1, ..., k_n) then the problem is trivial. You have p.hat_j = x_j/k_j for j = 1, ..., n. Let x = x_1 + ... + x_n and k = k_1 + ... + k_n. Form p.hat = x/k. (I.e. you ***really*** just have one big happy sample.) Then calculate the confidence interval for p in the usual way: p.hat +/- (z-value) * sqrt(p.hat * (1 - p.hat)/k) If this is not the scenario with which you need to cope, then you'll have to explain what that scenario actually is. cheers, Rolf Turner rolf at math.unb.ca
Please see: Brown, Cai and DasGupta (2001) Statistical Science, 16: 101-133 and (2002) Annals of Statistics, 30: 160-2001 They show that the actual coverage probability of the standard approximate confidence intervals for a binomial proportion are quite poor, while the standard asymptotic theory applied to logits produces rather better answers. I would expect "confint.glm" in library(MASS) to give decent results, possibly the best available without a very careful study of this particular question. Consider the following: library(MASS)# needed for confint.glm library(boot)# needed for inv.logit DF10 <- data.frame(y=.1, size=10) DF100 <- data.frame(y=.1, size=100) fit10 <- glm(y~1, family=binomial, data=DF10, weights=size) fit100 <- glm(y~1, family=binomial, data=DF100, weights=size) inv.logit(coef(fit10)) (CI10 <- confint(fit10)) (CI100 <- confint(fit100)) inv.logit(CI10) inv.logit(CI100) In R 1.9.1, Windows 2000, I got the following:> inv.logit(coef(fit10))(Intercept) 0.1> > (CI10 <- confint(fit10))Waiting for profiling to be done... 2.5 % 97.5 % -5.1122123 -0.5258854> (CI100 <- confint(fit100))Waiting for profiling to be done... 2.5 % 97.5 % -2.915193 -1.594401> > inv.logit(CI10)2.5 % 97.5 % 0.005986688 0.371477058> inv.logit(CI100)2.5 % 97.5 % 0.0514076 0.1687655> > (naiveCI10 <- .1+c(-2, 2)*sqrt(.1*.9/10))[1] -0.08973666 0.28973666> (naiveCI100 <- .1+c(-2, 2)*sqrt(.1*.9/100))[1] 0.04 0.16 hope this helps. spencer graves Darren Shaw wrote:> Dear R users > > this may be a simple question - but i would appreciate any thoughts > > does anyone know how you would get one lower and one upper confidence > interval for a set of data that consists of proportions. i.e. taking > a usual confidence interval for normal data would result in the lower > confidence interval being negative - which is not possible given the > data (which is constrained between 0 and 1) > > i can see how you calculate a upper and lower confidence interval for > a single proportion, but not for a set of proportions > > many thanks > > > Darren Shaw > > > > ----------------------------------------------------------------- > Dr Darren J Shaw > Centre for Tropical Veterinary Medicine (CTVM) > The University of Edinburgh > Scotland, EH25 9RG, UK > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html
Darren also might consider binconf() in library(Hmisc). > library(Hmisc) > binconf(1, 10, method="all") PointEst Lower Upper Exact 0.1 0.002528579 0.4450161 Wilson 0.1 0.005129329 0.4041500 Asymptotic 0.1 -0.085938510 0.2859385 > binconf(10, 100, method="all") PointEst Lower Upper Exact 0.1 0.04900469 0.1762226 Wilson 0.1 0.05522914 0.1743657 Asymptotic 0.1 0.04120108 0.1587989 Spencer Graves wrote:> Please see: > Brown, Cai and DasGupta (2001) Statistical Science, 16: 101-133 > and (2002) Annals of Statistics, 30: 160-2001 > They show that the actual coverage probability of the standard > approximate confidence intervals for a binomial proportion are quite > poor, while the standard asymptotic theory applied to logits produces > rather better answers. > I would expect "confint.glm" in library(MASS) to give decent > results, possibly the best available without a very careful study of > this particular question. Consider the following: > library(MASS)# needed for confint.glm > library(boot)# needed for inv.logit > DF10 <- data.frame(y=.1, size=10) > DF100 <- data.frame(y=.1, size=100) > fit10 <- glm(y~1, family=binomial, data=DF10, weights=size) > fit100 <- glm(y~1, family=binomial, data=DF100, weights=size) > inv.logit(coef(fit10)) > > (CI10 <- confint(fit10)) > (CI100 <- confint(fit100)) > > inv.logit(CI10) > inv.logit(CI100) > > In R 1.9.1, Windows 2000, I got the following: > >> inv.logit(coef(fit10)) > > (Intercept) > 0.1 > >> >> (CI10 <- confint(fit10)) > > Waiting for profiling to be done... > 2.5 % 97.5 % > -5.1122123 -0.5258854 > >> (CI100 <- confint(fit100)) > > Waiting for profiling to be done... > 2.5 % 97.5 % > -2.915193 -1.594401 > >> >> inv.logit(CI10) > > 2.5 % 97.5 % > 0.005986688 0.371477058 > >> inv.logit(CI100) > > 2.5 % 97.5 % > 0.0514076 0.1687655 > >> >> (naiveCI10 <- .1+c(-2, 2)*sqrt(.1*.9/10)) > > [1] -0.08973666 0.28973666 > >> (naiveCI100 <- .1+c(-2, 2)*sqrt(.1*.9/100)) > > [1] 0.04 0.16-- Chuck Cleland, Ph.D. NDRI, Inc. 71 West 23rd Street, 8th floor New York, NY 10010 tel: (212) 845-4495 (Tu, Th) tel: (732) 452-1424 (M, W, F) fax: (917) 438-0894
FWIW, if the exact intervals are what is desired here, as another poster has already suggested, binom.test() will get you there:> binom.test(1, 10)$conf.int[1] 0.002528579 0.445016117 attr(,"conf.level") [1] 0.95> binom.test(10, 100)$conf.int[1] 0.04900469 0.17622260 attr(,"conf.level") [1] 0.95 HTH, Marc Schwartz On Mon, 2004-07-12 at 13:19, Chuck Cleland wrote:> Darren also might consider binconf() in library(Hmisc). > > > library(Hmisc) > > > binconf(1, 10, method="all") > PointEst Lower Upper > Exact 0.1 0.002528579 0.4450161 > Wilson 0.1 0.005129329 0.4041500 > Asymptotic 0.1 -0.085938510 0.2859385 > > > binconf(10, 100, method="all") > PointEst Lower Upper > Exact 0.1 0.04900469 0.1762226 > Wilson 0.1 0.05522914 0.1743657 > Asymptotic 0.1 0.04120108 0.1587989 > > Spencer Graves wrote: > > Please see: > > Brown, Cai and DasGupta (2001) Statistical Science, 16: 101-133 > > and (2002) Annals of Statistics, 30: 160-2001 > > They show that the actual coverage probability of the standard > > approximate confidence intervals for a binomial proportion are quite > > poor, while the standard asymptotic theory applied to logits produces > > rather better answers. > > I would expect "confint.glm" in library(MASS) to give decent > > results, possibly the best available without a very careful study of > > this particular question. Consider the following: > > library(MASS)# needed for confint.glm > > library(boot)# needed for inv.logit > > DF10 <- data.frame(y=.1, size=10) > > DF100 <- data.frame(y=.1, size=100) > > fit10 <- glm(y~1, family=binomial, data=DF10, weights=size) > > fit100 <- glm(y~1, family=binomial, data=DF100, weights=size) > > inv.logit(coef(fit10)) > > > > (CI10 <- confint(fit10)) > > (CI100 <- confint(fit100)) > > > > inv.logit(CI10) > > inv.logit(CI100) > > > > In R 1.9.1, Windows 2000, I got the following: > > > >> inv.logit(coef(fit10)) > > > > (Intercept) > > 0.1 > > > >> > >> (CI10 <- confint(fit10)) > > > > Waiting for profiling to be done... > > 2.5 % 97.5 % > > -5.1122123 -0.5258854 > > > >> (CI100 <- confint(fit100)) > > > > Waiting for profiling to be done... > > 2.5 % 97.5 % > > -2.915193 -1.594401 > > > >> > >> inv.logit(CI10) > > > > 2.5 % 97.5 % > > 0.005986688 0.371477058 > > > >> inv.logit(CI100) > > > > 2.5 % 97.5 % > > 0.0514076 0.1687655 > > > >> > >> (naiveCI10 <- .1+c(-2, 2)*sqrt(.1*.9/10)) > > > > [1] -0.08973666 0.28973666 > > > >> (naiveCI100 <- .1+c(-2, 2)*sqrt(.1*.9/100)) > > > > [1] 0.04 0.16
There's also an article by Agresti and Coull author = {Alan Agresti and B. A. Coull}, title = {Approximate is better than "exact" for interval estimation of binomial proportions}, journal = {American Statistician}, year = {1998}, volume = {52}, number = {}, pages = {119-126}, which may be of interest. Peter Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax)>>> Spencer Graves <spencer.graves at pdf.com> 7/12/2004 3:07:00 PM >>>According to Brown, Cai and DasGupta (cited below), the "exact" confidence intervals are hyperconservative, as they are designed to produce actual coverage probabilities at least the nominal. Thus for a 95% confidence interval, the actual coverage could be 98% or more, depending on the true but unknown proportion; please check their papers for exact numbers. They report that the Wilson procedure performs reasonably well, as does the asymptotic logit procedure. I can't say without checking, but I would naively expect that confint.glm would likely also be among the leaders. By the way, confint.glm is independent of the parameterization, assuming 2*log(likelihood ratio) is approximately chi-square. It is therefore subject to intrinsic nonlinearity but is at least free of parameter effects (see, e.g., Bates and Watts (1988) Nonlinear Regression Analysis and Its Applications (Wiley)). To check this, consider the following: fit10c <- glm(y~1, family=binomial(link=cloglog), data=DF10, weights=size) fit100c <- glm(y~1, family=binomial(link=cloglog), data=DF100, weights=size) (CI10c <- confint(fit10c)) (CI100c <- confint(fit100c))> 1-exp(-exp(CI10c))2.5 % 97.5 % 0.005989334 0.371562793> 1-exp(-exp(CI100c))2.5 % 97.5 % 0.05140762 0.16875918 These are precisely the number reported below with the default binomial link = logit. hope this helps. spencer graves Chuck Cleland wrote:> Darren also might consider binconf() in library(Hmisc). > > > library(Hmisc) > > > binconf(1, 10, method="all") > PointEst Lower Upper > Exact 0.1 0.002528579 0.4450161 > Wilson 0.1 0.005129329 0.4041500 > Asymptotic 0.1 -0.085938510 0.2859385 > > > binconf(10, 100, method="all") > PointEst Lower Upper > Exact 0.1 0.04900469 0.1762226 > Wilson 0.1 0.05522914 0.1743657 > Asymptotic 0.1 0.04120108 0.1587989 > > Spencer Graves wrote: > >> Please see: >> Brown, Cai and DasGupta (2001) Statistical Science, 16:101-133>> and (2002) Annals of Statistics, 30: 160-2001 >> They show that the actual coverage probability of the standard>> approximate confidence intervals for a binomial proportion are quite>> poor, while the standard asymptotic theory applied to logitsproduces>> rather better answers. >> I would expect "confint.glm" in library(MASS) to give decent >> results, possibly the best available without a very careful study of>> this particular question. Consider the following: >> library(MASS)# needed for confint.glm >> library(boot)# needed for inv.logit >> DF10 <- data.frame(y=.1, size=10) >> DF100 <- data.frame(y=.1, size=100) >> fit10 <- glm(y~1, family=binomial, data=DF10, weights=size) >> fit100 <- glm(y~1, family=binomial, data=DF100, weights=size) >> inv.logit(coef(fit10)) >> >> (CI10 <- confint(fit10)) >> (CI100 <- confint(fit100)) >> >> inv.logit(CI10) >> inv.logit(CI100) >> >> In R 1.9.1, Windows 2000, I got the following: >> >>> inv.logit(coef(fit10)) >> >> >> (Intercept) >> 0.1 >> >>> >>> (CI10 <- confint(fit10)) >> >> >> Waiting for profiling to be done... >> 2.5 % 97.5 % >> -5.1122123 -0.5258854 >> >>> (CI100 <- confint(fit100)) >> >> >> Waiting for profiling to be done... >> 2.5 % 97.5 % >> -2.915193 -1.594401 >> >>> >>> inv.logit(CI10) >> >> >> 2.5 % 97.5 % >> 0.005986688 0.371477058 >> >>> inv.logit(CI100) >> >> >> 2.5 % 97.5 % >> 0.0514076 0.1687655 >> >>> >>> (naiveCI10 <- .1+c(-2, 2)*sqrt(.1*.9/10)) >> >> >> [1] -0.08973666 0.28973666 >> >>> (naiveCI100 <- .1+c(-2, 2)*sqrt(.1*.9/100)) >> >> >> [1] 0.04 0.16 > >______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
There has been a plethora of responses over the past hour or so to a question posed by Darren Shaw about how to estimate (get a confidence interval for) a proportion based on a data set consisting of a number of proportions. These responses have been all off the point. I would suggest to the responders: RTFQ The question was not about how to calculate a confidence interval for a proportion. Responders have gone on and on with academic wanking about alternatives to the ``standard'' procedure, some of which give better coverage properties (and some of which don't; so-called ``exact'' methods are notoriously bad). The point of the question was how to combine the information from a number of (sample) proportions. If the structure and context are as I conjectured in my posting then (a) this is simple, and (b) the combined sample size is almost surely large enough so that the simple and easy standard procedure will produce an eminently adequate result. (Thus making the alternative approaches even more of an academic wank than they usually are.) I think at this point it is worthwhile repeating the quote posted a while back by Doug Bates. (He attributed the quote to George Box, but was unable to supply a citation; I wrote to Box asking him about the quote, and he said ``Nope. 'Twarn't me.'') But irrespective of the source of the quote, the point it makes is valid: ``You have a big approximation and a small approximation. The big approximation is your approximation to the problem you want to solve. The small approximation is involved in getting the solution to the approximate problem.'' That is to say there are ***many*** effects which will have an impact on the proportion estimate required. (Were the samples really random? Were they really independent? Were they really all taken from the same population or populations with the same sample proportion?) The impact of such considerations causes the issue of the roughness of the usual/standard approximate CI for a proportion to pale by comparison. cheers, Rolf Turner rolf at math.unb.ca
Reasonably Related Threads
- binom.test
- Help with ooplot(gplots) and error bars
- How to keep the same class?
- Help needed on applying a function across different data sets and aggregating the results into a single data set
- Hmisc binconf function value interpretation during narrow confidence intervals