Dear R users this may be a simple question - but i would appreciate any thoughts does anyone know how you would get one lower and one upper confidence interval for a set of data that consists of proportions. i.e. taking a usual confidence interval for normal data would result in the lower confidence interval being negative - which is not possible given the data (which is constrained between 0 and 1) i can see how you calculate a upper and lower confidence interval for a single proportion, but not for a set of proportions many thanks Darren Shaw ----------------------------------------------------------------- Dr Darren J Shaw Centre for Tropical Veterinary Medicine (CTVM) The University of Edinburgh Scotland, EH25 9RG, UK
Darren Shaw wrote:> this may be a simple question - but i would appreciate any thoughts > > does anyone know how you would get one lower and one upper confidence > interval for a set of data that consists of proportions. i.e. taking a > usual confidence interval for normal data would result in the lower > confidence interval being negative - which is not possible given the data > (which is constrained between 0 and 1) > > i can see how you calculate a upper and lower confidence interval for a > single proportion, but not for a set of proportions(1) Your question appears to be a bit ``off topic''. I.e. it is really about statistical methodology, rather than about how to implement methodology in R. (2) You need to make the scenario clearer. What do your data actually consist of? What are you assuming? The only reasonable scenario that springs to mind (perhaps this is merely indicative of poverty of imagination on my part) is that you have a number of ***independent*** samples, each yielding a sample proportion, and each coming from the same population (or at least from populations having the same population proportion ``p''. I.e. you have p.hat_1, ..., p.hat_n and from these you wish to calculate a confidence interval for p. You need to know the sample ***sizes*** for each sample. If you don't, you're screwed. Full stop. There is absolutely nothing sensible you can do. If you ***do*** know the sample sizes (say k_1, ..., k_n) then the problem is trivial. You have p.hat_j = x_j/k_j for j = 1, ..., n. Let x = x_1 + ... + x_n and k = k_1 + ... + k_n. Form p.hat = x/k. (I.e. you ***really*** just have one big happy sample.) Then calculate the confidence interval for p in the usual way: p.hat +/- (z-value) * sqrt(p.hat * (1 - p.hat)/k) If this is not the scenario with which you need to cope, then you'll have to explain what that scenario actually is. cheers, Rolf Turner rolf at math.unb.ca
Please see: 
      Brown, Cai and DasGupta (2001) Statistical Science, 16:  101-133 
and (2002) Annals of Statistics, 30:  160-2001 
      They show that the actual coverage probability of the standard 
approximate confidence intervals for a binomial proportion are quite 
poor, while the standard asymptotic theory applied to logits produces 
rather better answers. 
      I would expect "confint.glm" in library(MASS) to give decent 
results, possibly the best available without a very careful study of 
this particular question.  Consider the following: 
  library(MASS)# needed for confint.glm
  library(boot)# needed for inv.logit
  DF10 <- data.frame(y=.1, size=10)
  DF100 <- data.frame(y=.1, size=100)
  fit10 <- glm(y~1, family=binomial, data=DF10, weights=size)
  fit100 <- glm(y~1, family=binomial, data=DF100, weights=size)
  inv.logit(coef(fit10))
 
  (CI10 <- confint(fit10))
  (CI100 <- confint(fit100))
 
  inv.logit(CI10)
  inv.logit(CI100)
      In R 1.9.1, Windows 2000, I got the following: 
>   inv.logit(coef(fit10))
(Intercept)
        0.1>  
>   (CI10 <- confint(fit10))
Waiting for profiling to be done...
     2.5 %     97.5 %
-5.1122123 -0.5258854>   (CI100 <- confint(fit100))
Waiting for profiling to be done...
    2.5 %    97.5 %
-2.915193 -1.594401>  
>   inv.logit(CI10)
      2.5 %      97.5 %
0.005986688 0.371477058>   inv.logit(CI100)
    2.5 %    97.5 %
0.0514076 0.1687655>
>   (naiveCI10 <- .1+c(-2, 2)*sqrt(.1*.9/10))
[1] -0.08973666  0.28973666>   (naiveCI100 <- .1+c(-2, 2)*sqrt(.1*.9/100))
[1] 0.04 0.16
      hope this helps.  spencer graves
Darren Shaw wrote:
> Dear R users
>
> this may be a simple question - but i would appreciate any thoughts
>
> does anyone know how you would get one lower and one upper confidence 
> interval for a set of data that consists of proportions.  i.e. taking 
> a usual confidence interval for normal data would result in the lower 
> confidence interval being negative - which is not possible given the 
> data (which is constrained between 0 and 1)
>
> i can see how you calculate a upper and lower confidence interval for 
> a single proportion, but not for a set of proportions
>
> many thanks
>
>
> Darren Shaw
>
>
>
> -----------------------------------------------------------------
> Dr Darren J Shaw
> Centre for Tropical Veterinary Medicine (CTVM)
> The University of Edinburgh
> Scotland, EH25 9RG, UK
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
Darren also might consider binconf() in library(Hmisc).
 > library(Hmisc)
 > binconf(1, 10, method="all")
            PointEst        Lower     Upper
Exact           0.1  0.002528579 0.4450161
Wilson          0.1  0.005129329 0.4041500
Asymptotic      0.1 -0.085938510 0.2859385
 > binconf(10, 100, method="all")
            PointEst      Lower     Upper
Exact           0.1 0.04900469 0.1762226
Wilson          0.1 0.05522914 0.1743657
Asymptotic      0.1 0.04120108 0.1587989
Spencer Graves wrote:>      Please see:
>      Brown, Cai and DasGupta (2001) Statistical Science, 16:  101-133 
> and (2002) Annals of Statistics, 30:  160-2001
>      They show that the actual coverage probability of the standard 
> approximate confidence intervals for a binomial proportion are quite 
> poor, while the standard asymptotic theory applied to logits produces 
> rather better answers.
>      I would expect "confint.glm" in library(MASS) to give decent
> results, possibly the best available without a very careful study of 
> this particular question.  Consider the following:
>  library(MASS)# needed for confint.glm
>  library(boot)# needed for inv.logit
>  DF10 <- data.frame(y=.1, size=10)
>  DF100 <- data.frame(y=.1, size=100)
>  fit10 <- glm(y~1, family=binomial, data=DF10, weights=size)
>  fit100 <- glm(y~1, family=binomial, data=DF100, weights=size)
>  inv.logit(coef(fit10))
> 
>  (CI10 <- confint(fit10))
>  (CI100 <- confint(fit100))
> 
>  inv.logit(CI10)
>  inv.logit(CI100)
> 
>      In R 1.9.1, Windows 2000, I got the following:
> 
>>   inv.logit(coef(fit10))
> 
> (Intercept)
>        0.1
> 
>>  
>>   (CI10 <- confint(fit10))
> 
> Waiting for profiling to be done...
>     2.5 %     97.5 %
> -5.1122123 -0.5258854
> 
>>   (CI100 <- confint(fit100))
> 
> Waiting for profiling to be done...
>    2.5 %    97.5 %
> -2.915193 -1.594401
> 
>>  
>>   inv.logit(CI10)
> 
>      2.5 %      97.5 %
> 0.005986688 0.371477058
> 
>>   inv.logit(CI100)
> 
>    2.5 %    97.5 %
> 0.0514076 0.1687655
> 
>>
>>   (naiveCI10 <- .1+c(-2, 2)*sqrt(.1*.9/10))
> 
> [1] -0.08973666  0.28973666
> 
>>   (naiveCI100 <- .1+c(-2, 2)*sqrt(.1*.9/100))
> 
> [1] 0.04 0.16
-- 
Chuck Cleland, Ph.D.
NDRI, Inc.
71 West 23rd Street, 8th floor
New York, NY 10010
tel: (212) 845-4495 (Tu, Th)
tel: (732) 452-1424 (M, W, F)
fax: (917) 438-0894
FWIW, if the exact intervals are what is desired here, as another poster has already suggested, binom.test() will get you there:> binom.test(1, 10)$conf.int[1] 0.002528579 0.445016117 attr(,"conf.level") [1] 0.95> binom.test(10, 100)$conf.int[1] 0.04900469 0.17622260 attr(,"conf.level") [1] 0.95 HTH, Marc Schwartz On Mon, 2004-07-12 at 13:19, Chuck Cleland wrote:> Darren also might consider binconf() in library(Hmisc). > > > library(Hmisc) > > > binconf(1, 10, method="all") > PointEst Lower Upper > Exact 0.1 0.002528579 0.4450161 > Wilson 0.1 0.005129329 0.4041500 > Asymptotic 0.1 -0.085938510 0.2859385 > > > binconf(10, 100, method="all") > PointEst Lower Upper > Exact 0.1 0.04900469 0.1762226 > Wilson 0.1 0.05522914 0.1743657 > Asymptotic 0.1 0.04120108 0.1587989 > > Spencer Graves wrote: > > Please see: > > Brown, Cai and DasGupta (2001) Statistical Science, 16: 101-133 > > and (2002) Annals of Statistics, 30: 160-2001 > > They show that the actual coverage probability of the standard > > approximate confidence intervals for a binomial proportion are quite > > poor, while the standard asymptotic theory applied to logits produces > > rather better answers. > > I would expect "confint.glm" in library(MASS) to give decent > > results, possibly the best available without a very careful study of > > this particular question. Consider the following: > > library(MASS)# needed for confint.glm > > library(boot)# needed for inv.logit > > DF10 <- data.frame(y=.1, size=10) > > DF100 <- data.frame(y=.1, size=100) > > fit10 <- glm(y~1, family=binomial, data=DF10, weights=size) > > fit100 <- glm(y~1, family=binomial, data=DF100, weights=size) > > inv.logit(coef(fit10)) > > > > (CI10 <- confint(fit10)) > > (CI100 <- confint(fit100)) > > > > inv.logit(CI10) > > inv.logit(CI100) > > > > In R 1.9.1, Windows 2000, I got the following: > > > >> inv.logit(coef(fit10)) > > > > (Intercept) > > 0.1 > > > >> > >> (CI10 <- confint(fit10)) > > > > Waiting for profiling to be done... > > 2.5 % 97.5 % > > -5.1122123 -0.5258854 > > > >> (CI100 <- confint(fit100)) > > > > Waiting for profiling to be done... > > 2.5 % 97.5 % > > -2.915193 -1.594401 > > > >> > >> inv.logit(CI10) > > > > 2.5 % 97.5 % > > 0.005986688 0.371477058 > > > >> inv.logit(CI100) > > > > 2.5 % 97.5 % > > 0.0514076 0.1687655 > > > >> > >> (naiveCI10 <- .1+c(-2, 2)*sqrt(.1*.9/10)) > > > > [1] -0.08973666 0.28973666 > > > >> (naiveCI100 <- .1+c(-2, 2)*sqrt(.1*.9/100)) > > > > [1] 0.04 0.16
There's also an article by Agresti and Coull
author = {Alan Agresti and B. A. Coull},
title = {Approximate is better than "exact" for interval estimation of
binomial proportions},
   journal = {American Statistician},
   year = {1998},
   volume = {52},
   number = {},
   pages = {119-126},
which may be of interest.
Peter
Peter L. Flom, PhD
Assistant Director, Statistics and Data Analysis Core
Center for Drug Use and HIV Research
National Development and Research Institutes
71 W. 23rd St
www.peterflom.com
New York, NY 10010
(212) 845-4485 (voice)
(917) 438-0894 (fax)
>>> Spencer Graves <spencer.graves at pdf.com> 7/12/2004 3:07:00
PM >>>
      According to Brown, Cai and DasGupta (cited below), the "exact" 
confidence intervals are hyperconservative, as they are designed to 
produce actual coverage probabilities at least the nominal.  Thus for a
95% confidence interval, the actual coverage could be 98% or more, 
depending on the true but unknown proportion;  please check their
papers 
for exact numbers.  They report that the Wilson procedure performs 
reasonably well, as does the asymptotic logit procedure.  I can't say 
without checking, but I would naively expect that confint.glm would 
likely also be among the leaders. 
      By the way, confint.glm is independent of the parameterization, 
assuming 2*log(likelihood ratio) is approximately chi-square.  It is 
therefore subject to intrinsic nonlinearity but is at least free of 
parameter effects (see, e.g., Bates and Watts (1988) Nonlinear 
Regression Analysis and Its Applications (Wiley)).  To check this, 
consider the following: 
  fit10c <- glm(y~1, family=binomial(link=cloglog), data=DF10,
weights=size)
  fit100c <- glm(y~1, family=binomial(link=cloglog), data=DF100, 
weights=size)
  (CI10c <- confint(fit10c))
  (CI100c <- confint(fit100c))>   1-exp(-exp(CI10c))
      2.5 %      97.5 %
0.005989334 0.371562793>   1-exp(-exp(CI100c))
     2.5 %     97.5 %
0.05140762 0.16875918
      These are precisely the number reported below with the default 
binomial link = logit. 
      hope this helps.  spencer graves
Chuck Cleland wrote:
>   Darren also might consider binconf() in library(Hmisc).
>
> > library(Hmisc)
>
> > binconf(1, 10, method="all")
>            PointEst        Lower     Upper
> Exact           0.1  0.002528579 0.4450161
> Wilson          0.1  0.005129329 0.4041500
> Asymptotic      0.1 -0.085938510 0.2859385
>
> > binconf(10, 100, method="all")
>            PointEst      Lower     Upper
> Exact           0.1 0.04900469 0.1762226
> Wilson          0.1 0.05522914 0.1743657
> Asymptotic      0.1 0.04120108 0.1587989
>
> Spencer Graves wrote:
>
>>      Please see:
>>      Brown, Cai and DasGupta (2001) Statistical Science, 16: 
101-133 >> and (2002) Annals of Statistics, 30:  160-2001
>>      They show that the actual coverage probability of the standard
>> approximate confidence intervals for a binomial proportion are quite
>> poor, while the standard asymptotic theory applied to logits
produces >> rather better answers.
>>      I would expect "confint.glm" in library(MASS) to give
decent
>> results, possibly the best available without a very careful study of
>> this particular question.  Consider the following:
>>  library(MASS)# needed for confint.glm
>>  library(boot)# needed for inv.logit
>>  DF10 <- data.frame(y=.1, size=10)
>>  DF100 <- data.frame(y=.1, size=100)
>>  fit10 <- glm(y~1, family=binomial, data=DF10, weights=size)
>>  fit100 <- glm(y~1, family=binomial, data=DF100, weights=size)
>>  inv.logit(coef(fit10))
>>
>>  (CI10 <- confint(fit10))
>>  (CI100 <- confint(fit100))
>>
>>  inv.logit(CI10)
>>  inv.logit(CI100)
>>
>>      In R 1.9.1, Windows 2000, I got the following:
>>
>>>   inv.logit(coef(fit10))
>>
>>
>> (Intercept)
>>        0.1
>>
>>>  
>>>   (CI10 <- confint(fit10))
>>
>>
>> Waiting for profiling to be done...
>>     2.5 %     97.5 %
>> -5.1122123 -0.5258854
>>
>>>   (CI100 <- confint(fit100))
>>
>>
>> Waiting for profiling to be done...
>>    2.5 %    97.5 %
>> -2.915193 -1.594401
>>
>>>  
>>>   inv.logit(CI10)
>>
>>
>>      2.5 %      97.5 %
>> 0.005986688 0.371477058
>>
>>>   inv.logit(CI100)
>>
>>
>>    2.5 %    97.5 %
>> 0.0514076 0.1687655
>>
>>>
>>>   (naiveCI10 <- .1+c(-2, 2)*sqrt(.1*.9/10))
>>
>>
>> [1] -0.08973666  0.28973666
>>
>>>   (naiveCI100 <- .1+c(-2, 2)*sqrt(.1*.9/100))
>>
>>
>> [1] 0.04 0.16
>
>
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
There has been a plethora of responses over the past hour or so to a question posed by Darren Shaw about how to estimate (get a confidence interval for) a proportion based on a data set consisting of a number of proportions. These responses have been all off the point. I would suggest to the responders: RTFQ The question was not about how to calculate a confidence interval for a proportion. Responders have gone on and on with academic wanking about alternatives to the ``standard'' procedure, some of which give better coverage properties (and some of which don't; so-called ``exact'' methods are notoriously bad). The point of the question was how to combine the information from a number of (sample) proportions. If the structure and context are as I conjectured in my posting then (a) this is simple, and (b) the combined sample size is almost surely large enough so that the simple and easy standard procedure will produce an eminently adequate result. (Thus making the alternative approaches even more of an academic wank than they usually are.) I think at this point it is worthwhile repeating the quote posted a while back by Doug Bates. (He attributed the quote to George Box, but was unable to supply a citation; I wrote to Box asking him about the quote, and he said ``Nope. 'Twarn't me.'') But irrespective of the source of the quote, the point it makes is valid: ``You have a big approximation and a small approximation. The big approximation is your approximation to the problem you want to solve. The small approximation is involved in getting the solution to the approximate problem.'' That is to say there are ***many*** effects which will have an impact on the proportion estimate required. (Were the samples really random? Were they really independent? Were they really all taken from the same population or populations with the same sample proportion?) The impact of such considerations causes the issue of the roughness of the usual/standard approximate CI for a proportion to pale by comparison. cheers, Rolf Turner rolf at math.unb.ca
Reasonably Related Threads
- binom.test
- Help with ooplot(gplots) and error bars
- How to keep the same class?
- Help needed on applying a function across different data sets and aggregating the results into a single data set
- Hmisc binconf function value interpretation during narrow confidence intervals