On 23/02/05 I suggested that given R had included TOMS 708 to correct for the=20 poor performance of pbeta, TOMS 654 should be included to fix all the pgamma=20 problems. I was slightly surprised to find Morten's code had been included=20 instead 2 days later. I noticed but did not worry that the reference to me had=20 been removed.=20 The derivation of the asymptotic expansion for the gamma distribution used by=20 Morten can be found at http://members.aol.com/iandjmsmith/PoissonApprox.htm=20 It is fairly easy to understand and find error bounds for and hence include=20 sensibly in an algorithm to calculate pgamma. The basis and accuracy of the some of the algorithms I use is discussed in=20 http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute error=20 in the log of the probability gives a good indication of the accuracy of your=20 answer. In the least extreme example you consider=20 (pgamma(0.9*1e25,1e25,log=3DT)the absolute error would be about 5360515 and if you exponentiated the result=20 the relative error would be about 10 to the power 2328042. So the answer you=20 wish to calculate is K times 10 to the power -2.32804237034994E+22, where K is=20 somewhere between 10 to the power plus or minus 2328042. In other words when=20 you make the changes to correct this problem, your calculation will still=20 return values with no real meaning but at least users might be aware of this which=20 would be no bad thing! For me this answer is possibly so meaningless that Nan=20 would be preferable. I did mention to Morten that I had updated my code but I believed that for=20 Gnumeric he was quite satisfied with what he had. If you look at the VBA code at=20 http://members.aol.com/iandjmsmith/Examples.txt you can see the changes I=20 made to stop the overflow problems you seem to be worried about. My code for the=20 pdf of the gamma distribution still fails for shape parameters > 2e307 due to=20 multiplication of the shap parameter by 2pi. The code for dgamma will have the=20 same problem unless it is hidden by use of an 80 or more bit floating point=20 processor. The code for the asymptotic expansion for the gamma distribution=20 seems to be fine for any number, excluding silly ones like Nan and Inf. Indeed it=20 takes the difference from the mean as a parameter and if you supply an=20 accurate value you get a sensible answer as mentioned in=20 http://members.aol.com/iandjmsmith/Accuracy.htm I do not share your apparent sense of panic on this matter. I have no=20 problems with error signals like NaNs because it is obvious to the user that things=20 have gone wrong. Inaccurate answers when the user has no reason to expect them=20 are usually far more difficult to spot and in many cases the results are just=20 believed. That for me is a serious problem. I think you will find that the=20 pgamma code of 2.0.0 did not work for small shape parameters (similar to the=20 problems exhibited by pbeta still for small parameters see PR#2894), was=20 inaccurate for large shape parameters (> 1e5) when it resorted to the normal=20 approximation and was pretty slow in between. Indeed, the normal approximation was the=20 cause of PR#7307. I don't understand your comments about=20 "pt_ =3D -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) =3D -x * log((lambda-x)/x) -=20 (lambda-x)=20 and naively assumes that this is small enough to use a power series expansion=20 in 1/x with coefficients as powers of pt_. To make matters worse, consider =E2=80=A6" In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't think=20 it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If |x|=20 < .01 it is evaluated by a power series, if x < -0.8 or x > 1 it uses=20 log1p(1+x)-x and for other values it uses a continued fraction which essentially=20 evaluates more of the same series used when |x| < .01. Your comments about replacing logspace_add with logspace_sub with simpler=20 code which works at first sight to be a very sensible improvement. However, I=20 would be a bit nervous that lnd-lnp could be very large and the exp of it could=20 return infinity. I'm sure this can be accounted for in the code and lnp +=20 log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly. I am not responsible for the code for calculating the logs of probabilities=20 but I seem to remember that the extremely poor performance of the algorithms in=20 R2.0.0 with logged probabilities was one of the reasons Morten became=20 interested in changing the pgamma code (see PR#7307). I have had a quick look and=20 once the corections mentioned above are made it should be giving nonsense answers=20 with no difficulty. Unfortunately there are still a few examples of sloppy coding and accuracy=20 errors remaining in R. The non-central distribution functions have horrible 1- cancellation errors=20 associated with them (see PR#7099) and separate code is required for the two=20 tails of the distributions to get round the problem. The fix for PR#8251 is a kludge and just moves the inaccuracies to examples=20 with higher non-centrality parameters. pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit=20 implementations. pcauchy works but I believe the pt function is also supposed to work for=20 non integral degrees of freedom so making it work one degree of freedom via=20 pcauchy is hardly much use. qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying the parameters,=20 qnbinom can be made very slow indeed. I do not think there is anything wrong with=20 the Cornish-Fisher expansion. It just seems that it is not always very good=20 for the Negative Binomial distribution. In the example above, the initial=20 approximation is out by 2e6. A slightly different problem which can cause qnbinom and qbinom to go into=20 infinite loops is when the q-value is too big. For example=20 qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15 approx but the code works=20 with values where one of the statements y :=3D y +1 or y =3D y - 1; is executed but=20 does not alter the value of y. df(0,2,2,FALSE) should be 1 not 0 df(0,df,2,FALSE) should be infinity for df < 2 not 0 dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0 Presumably R also has similar problems with the pbeta function. As I recall=20 the TOMS 708 code was pretty much included without edits and therefore didn't=20 calculate logs of probabilities except by calculating the probability and then=20 logging it. I assumed this was why it was not used for small shape parameters=20 where the current code does not work, although it did not seem logical to me.=20 Of course, my memory is not what it was but if that is the case and there are=20 problems with modifying the TOMS code, you could try an asymptotic expansion=20 based on http://members.aol.com/iandjmsmith/BinomialApprox.htm This response has been very rushed. I do not write well when I have plenty of=20 time and I felt I had so many different things to say so I apologise if it is=20 all a bit of a jumble.=20 Ian Smith [[alternative HTML version deleted]]
Prof Brian Ripley
2006-Feb-07 08:52 UTC
[Rd] Problems in d-p-q-r functions (was PR#8528, but unrelated)
For the record, some of these claims are untrue:> df(0, 2, 2)[1] 1> df(0, 1.3, 2)[1] Inf> x <- 1e-170 > pbeta(x, x, x)[1] 0.5 qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is an error, and so is qnbinom(1E-300,0.000002,10000000000) Here we can guess at what you meant, maybe correctly. There were comments in the source code about needing a better search, and I have recently implemented one. So> qnbinom(1e-10, 1e3, 1e-7) # instant[1] 8117986721> qnbinom(0.5, 10000000000, 0.000000002)[1] 5e+18> qnbinom(1e-300, 10000000000, 0.000002)[1] 4.998138e+15 seem to be solved. There were two problems with dbeta, one easily overcome (f underflows) the other fundamental to the way dbinom_raw is computed (n*p can underflow). What I cannot see is why a formula which worked correctly in this region was replaced by one that did not. It is precisely in order not to generate such errors that I used TOMS 708 only in the area where the existing algorithm is problematic. (It may be better elswehere, but I did not have the time to do the requisite analysis. It seems neither do some other people: I would prefer not to spend the time to clear up after such unneeded changes.) On pt, thank you for the report. pt(x, df=1) is not interesting for |x| > 1e150, but it is for smaller values of df and those were underflowing. It is easy to use an asymptotic formula to regain the accuracy. R was potentially generating reports of lack of convergence and loss of accuracy in quite a number of its algorithms, but for reasons unknown to me these were being buried (ML_ERROR did nothing, and has not for a very long time). It's a matter of debate whether in some of these it would be better to return NaN as well, but warnings should have been generated (and now are). As for `panic' (your word: why is it 'panic' to submit a correct bug report?), a major platform returning +Inf for a log probability is very bad news, as is another failing a regression test by getting NaN for a probability which is 0.5. On Sat, 28 Jan 2006 IandJMSmith at aol.com wrote:> On 23/02/05 I suggested that given R had included TOMS 708 to correct for t> he=20 > poor performance of pbeta, TOMS 654 should be included to fix all the pgamm> a=20 > problems. I was slightly surprised to find Morten's code had been included> =20 > instead 2 days later. I noticed but did not worry that the reference to me > had=20 > been removed.=20 > > The derivation of the asymptotic expansion for the gamma distribution used > by=20 > Morten can be found at http://members.aol.com/iandjmsmith/PoissonApprox.htm> =20 > It is fairly easy to understand and find error bounds for and hence include> =20 > sensibly in an algorithm to calculate pgamma. > > The basis and accuracy of the some of the algorithms I use is discussed in> =20 > http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute > error=20 > in the log of the probability gives a good indication of the accuracy of yo> ur=20 > answer. In the least extreme example you consider=20 > (pgamma(0.9*1e25,1e25,log=3DT)the absolute error would be about 5360515 and> if you exponentiated the result=20 > the relative error would be about 10 to the power 2328042. So the answer yo> u=20 > wish to calculate is K times 10 to the power -2.32804237034994E+22, where K> is=20 > somewhere between 10 to the power plus or minus 2328042. In other words whe> n=20 > you make the changes to correct this problem, your calculation will still> =20 > return values with no real meaning but at least users might be aware of thi> s which=20 > would be no bad thing! For me this answer is possibly so meaningless that N> an=20 > would be preferable. > > I did mention to Morten that I had updated my code but I believed that for> =20 > Gnumeric he was quite satisfied with what he had. If you look at the VBA co> de at=20 > http://members.aol.com/iandjmsmith/Examples.txt you can see the changes I> =20 > made to stop the overflow problems you seem to be worried about. My code fo> r the=20 > pdf of the gamma distribution still fails for shape parameters > 2e307 due > to=20 > multiplication of the shap parameter by 2pi. The code for dgamma will have > the=20 > same problem unless it is hidden by use of an 80 or more bit floating point> =20 > processor. The code for the asymptotic expansion for the gamma distribution> =20 > seems to be fine for any number, excluding silly ones like Nan and Inf. Ind> eed it=20 > takes the difference from the mean as a parameter and if you supply an=20 > accurate value you get a sensible answer as mentioned in=20 > http://members.aol.com/iandjmsmith/Accuracy.htm > > I do not share your apparent sense of panic on this matter. I have no=20 > problems with error signals like NaNs because it is obvious to the user tha> t things=20 > have gone wrong. Inaccurate answers when the user has no reason to expect t> hem=20 > are usually far more difficult to spot and in many cases the results are ju> st=20 > believed. That for me is a serious problem. I think you will find that the> =20 > pgamma code of 2.0.0 did not work for small shape parameters (similar to th> e=20 > problems exhibited by pbeta still for small parameters see PR#2894), was=20 > inaccurate for large shape parameters (> 1e5) when it resorted to the norma> l=20 > approximation and was pretty slow in between. Indeed, the normal approximat> ion was the=20 > cause of PR#7307. > > > I don't understand your comments about=20 > "pt_ =3D -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) =3D -x * log((lambda-x> )/x) -=20 > (lambda-x)=20 > and naively assumes that this is small enough to use a power series expansi> on=20 > in 1/x with coefficients as powers of pt_. To make matters worse, consider > =E2=80=A6" > In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't thin> k=20 > it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If > |x|=20 > < .01 it is evaluated by a power series, if x < -0.8 or x > 1 it uses=20 > log1p(1+x)-x and for other values it uses a continued fraction which essent> ially=20 > evaluates more of the same series used when |x| < .01. > > Your comments about replacing logspace_add with logspace_sub with simpler> =20 > code which works at first sight to be a very sensible improvement. However,> I=20 > would be a bit nervous that lnd-lnp could be very large and the exp of it c> ould=20 > return infinity. I'm sure this can be accounted for in the code and lnp +> =20 > log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly. > > I am not responsible for the code for calculating the logs of probabilities> =20 > but I seem to remember that the extremely poor performance of the algorithm> s in=20 > R2.0.0 with logged probabilities was one of the reasons Morten became=20 > interested in changing the pgamma code (see PR#7307). I have had a quick lo> ok and=20 > once the corections mentioned above are made it should be giving nonsense a> nswers=20 > with no difficulty. > > > Unfortunately there are still a few examples of sloppy coding and accuracy> =20 > errors remaining in R. > > The non-central distribution functions have horrible 1- cancellation errors> =20 > associated with them (see PR#7099) and separate code is required for the tw> o=20 > tails of the distributions to get round the problem. > > The fix for PR#8251 is a kludge and just moves the inaccuracies to examples> =20 > with higher non-centrality parameters. > > pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit=20 > implementations. pcauchy works but I believe the pt function is also suppos> ed to work for=20 > non integral degrees of freedom so making it work one degree of freedom via> =20 > pcauchy is hardly much use. > > qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying theparameters,> =20> qnbinom can be made very slow indeed. I do not think there is anything wron> g with=20 > the Cornish-Fisher expansion. It just seems that it is not always very good> =20 > for the Negative Binomial distribution. In the example above, the initial> =20 > approximation is out by 2e6. > > A slightly different problem which can cause qnbinom and qbinom to go into> =20 > infinite loops is when the q-value is too big. For example=20 > qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15 app> rox but the code works=20 > with values where one of the statements y :=3D y +1 or y =3D y - 1; is exec> uted but=20 > does not alter the value of y. > > df(0,2,2,FALSE) should be 1 not 0 > df(0,df,2,FALSE) should be infinity for df < 2 not 0 > dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0 > > Presumably R also has similar problems with the pbeta function. As I recall> =20 > the TOMS 708 code was pretty much included without edits and therefore didn> 't=20 > calculate logs of probabilities except by calculating the probability and t> hen=20 > logging it. I assumed this was why it was not used for small shape paramete> rs=20 > where the current code does not work, although it did not seem logical to m> e.=20 > Of course, my memory is not what it was but if that is the case and there a> re=20 > problems with modifying the TOMS code, you could try anasymptotic expansio> n=20> based on http://members.aol.com/iandjmsmith/BinomialApprox.htm > > This response has been very rushed. I do not write well when I have plenty > of=20 > time and I felt I had so many different things to say so I apologise if it > is=20 > all a bit of a jumble.=20 > > Ian Smith > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595