thr3ads.net - R devel - [Rd] PR#8528 [Jan 2006]

If this information is useful, please help other people find it:
Share via:

IandJMSmith@aol.com

2006-Jan-28 08:43 UTC

[Rd] PR#8528

On 23/02/05 I suggested that given R had included TOMS 708 to correct for the=20
poor performance of pbeta, TOMS 654 should be included to fix all the pgamma=20
problems. I was slightly surprised to find Morten's code had been
included=20
instead 2 days later. I noticed but did not worry that the reference to me
had=20
been removed.=20

The derivation of the asymptotic expansion for the gamma distribution used by=20
Morten can be found at http://members.aol.com/iandjmsmith/PoissonApprox.htm=20
It is fairly easy to understand and find error bounds for and hence include=20
sensibly in an algorithm to calculate pgamma.

The basis and accuracy of the some of the algorithms I use is discussed in=20
http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute
error=20
in the log of the probability gives a good indication of the accuracy of your=20
answer. In the least extreme example you consider=20
(pgamma(0.9*1e25,1e25,log=3DT)the absolute error would be about 5360515 and if
you exponentiated the result=20
the relative error would be about 10 to the power 2328042. So the answer you=20
wish to calculate is K times 10 to the power -2.32804237034994E+22, where K
is=20
somewhere between 10 to the power plus or minus 2328042. In other words when=20
you make the changes to correct this problem, your calculation will still=20
return values with no real meaning but at least users might be aware of this
which=20
would be no bad thing! For me this answer is possibly so meaningless that Nan=20
would be preferable.

I did mention to Morten that I had updated my code but I believed that for=20
Gnumeric he was quite satisfied with what he had. If you look at the VBA code
at=20
http://members.aol.com/iandjmsmith/Examples.txt you can see the changes I=20
made to stop the overflow problems you seem to be worried about. My code for
the=20
pdf of the gamma distribution still fails for shape parameters > 2e307 due
to=20
multiplication of the shap parameter by 2pi. The code for dgamma will have
the=20
same problem unless it is hidden by use of an 80 or more bit floating point=20
processor. The code for the asymptotic expansion for the gamma distribution=20
seems to be fine for any number, excluding silly ones like Nan and Inf. Indeed
it=20
takes the difference from the mean as a parameter and if you supply an=20
accurate value you get a sensible answer as mentioned in=20
http://members.aol.com/iandjmsmith/Accuracy.htm

I do not share your apparent sense of panic on this matter. I have no=20
problems with error signals like NaNs because it is obvious to the user that
things=20
have gone wrong. Inaccurate answers when the user has no reason to expect
them=20
are usually far more difficult to spot and in many cases the results are just=20
believed. That for me is a serious problem. I think you will find that the=20
pgamma code of 2.0.0 did not work for small shape parameters (similar to the=20
problems exhibited by pbeta still for small parameters see PR#2894), was=20
inaccurate for large shape parameters (> 1e5) when it resorted to the
normal=20
approximation and was pretty slow in between. Indeed, the normal approximation
was the=20
cause of PR#7307.


I don't understand your comments about=20
"pt_ =3D -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) =3D -x *
log((lambda-x)/x) -=20
(lambda-x)=20
and naively assumes that this is small enough to use a power series expansion=20
in 1/x with coefficients as powers of pt_. To make matters worse, consider
=E2=80=A6"
In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't
think=20
it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If
|x|=20
< .01 it is evaluated by a power series, if x < -0.8 or x > 1 it
uses=20
log1p(1+x)-x and for other values it uses a continued fraction which
essentially=20
evaluates more of the same series used when |x| < .01.

Your comments about replacing logspace_add with logspace_sub with simpler=20
code which works at first sight to be a very sensible improvement. However, I=20
would be a bit nervous that lnd-lnp could be very large and the exp of it
could=20
return infinity. I'm sure this can be accounted for in the code and lnp +=20
log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly.

I am not responsible for the code for calculating the logs of probabilities=20
but I seem to remember that the extremely poor performance of the algorithms
in=20
R2.0.0 with logged probabilities was one of the reasons Morten became=20
interested in changing the pgamma code (see PR#7307). I have had a quick look
and=20
once the corections mentioned above are made it should be giving nonsense
answers=20
with no difficulty.


Unfortunately there are still a few examples of sloppy coding and accuracy=20
errors remaining in R.

The non-central distribution functions have horrible 1- cancellation errors=20
associated with them (see PR#7099) and separate code is required for the two=20
tails of the distributions to get round the problem.

The fix for PR#8251 is a kludge and just moves the inaccuracies to examples=20
with higher non-centrality parameters.

pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit=20
implementations. pcauchy works but I believe the pt function is also supposed to
work for=20
non integral degrees of freedom so making it work one degree of freedom via=20
pcauchy is hardly much use.

qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying the parameters,=20
qnbinom can be made very slow indeed. I do not think there is anything wrong
with=20
the Cornish-Fisher expansion. It just seems that it is not always very good=20
for the Negative Binomial distribution. In the example above, the initial=20
approximation is out by 2e6.

A slightly different problem which can cause qnbinom and qbinom to go into=20
infinite loops is when the q-value is too big. For example=20
qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15 approx
but the code works=20
with values where one of the statements y :=3D y +1 or y =3D y - 1; is executed
but=20
does not alter the value of y.

df(0,2,2,FALSE) should be 1 not 0
df(0,df,2,FALSE) should be infinity for df < 2 not 0
dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0

Presumably R also has similar problems with the pbeta function. As I recall=20
the TOMS 708 code was pretty much included without edits and therefore
didn't=20
calculate logs of probabilities except by calculating the probability and
then=20
logging it. I assumed this was why it was not used for small shape parameters=20
where the current code does not work, although it did not seem logical to me.=20
Of course, my memory is not what it was but if that is the case and there are=20
problems with modifying the TOMS code, you could try an asymptotic expansion=20
based on http://members.aol.com/iandjmsmith/BinomialApprox.htm

This response has been very rushed. I do not write well when I have plenty of=20
time and I felt I had so many different things to say so I apologise if it is=20
all a bit of a jumble.=20

Ian Smith


	[[alternative HTML version deleted]]

Prof Brian Ripley

2006-Feb-07 08:52 UTC

head link

[Rd] Problems in d-p-q-r functions (was PR#8528, but unrelated)

For the record, some of these claims are untrue:
> df(0, 2, 2)
[1] 1> df(0, 1.3, 2)
[1] Inf> x <- 1e-170
> pbeta(x, x, x)[1] 0.5

qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is an error, and so is 
qnbinom(1E-300,0.000002,10000000000)

Here we can guess at what you meant, maybe correctly.  There were comments 
in the source code about needing a better search, and I have recently 
implemented one.  So
> qnbinom(1e-10, 1e3, 1e-7) # instant
[1] 8117986721> qnbinom(0.5, 10000000000, 0.000000002)
[1] 5e+18> qnbinom(1e-300, 10000000000, 0.000002)[1] 4.998138e+15

seem to be solved.

There were two problems with dbeta, one easily overcome (f underflows) the 
other fundamental to the way dbinom_raw is computed (n*p can underflow). 
What I cannot see is why a formula which worked correctly in this region 
was replaced by one that did not.  It is precisely in order not to 
generate such errors that I used TOMS 708 only in the area where the 
existing algorithm is problematic.  (It may be better elswehere, but I did 
not have the time to do the requisite analysis.  It seems neither do some 
other people: I would prefer not to spend the time to clear up after such 
unneeded changes.)

On pt, thank you for the report.  pt(x, df=1) is not interesting for
|x| > 1e150, but it is for smaller values of df and those were 
underflowing.  It is easy to use an asymptotic formula to regain the 
accuracy.

R was potentially generating reports of lack of convergence and loss of 
accuracy in quite a number of its algorithms, but for reasons unknown to 
me these were being buried (ML_ERROR did nothing, and has not for a very 
long time). It's a matter of debate whether in some of these it would be 
better to return NaN as well, but warnings should have been generated (and 
now are).

As for `panic' (your word: why is it 'panic' to submit a correct bug
report?), a major platform returning +Inf for a log probability is very 
bad news, as is another failing a regression test by getting NaN for a 
probability which is 0.5.

On Sat, 28 Jan 2006 IandJMSmith at aol.com wrote:
> On 23/02/05 I suggested that given R had included TOMS 708 to correct for
t> he=20
> poor performance of pbeta, TOMS 654 should be included to fix all the
pgamm> a=20
> problems. I was slightly surprised to find Morten's code had been
included> =20
> instead 2 days later. I noticed but did not worry that the reference to me
> had=20
> been removed.=20
>
> The derivation of the asymptotic expansion for the gamma distribution used
> by=20
> Morten can be found at
http://members.aol.com/iandjmsmith/PoissonApprox.htm> =20
> It is fairly easy to understand and find error bounds for and hence
include> =20
> sensibly in an algorithm to calculate pgamma.
>
> The basis and accuracy of the some of the algorithms I use is discussed
in> =20
> http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute
> error=20
> in the log of the probability gives a good indication of the accuracy of
yo> ur=20
> answer. In the least extreme example you consider=20
> (pgamma(0.9*1e25,1e25,log=3DT)the absolute error would be about 5360515
and> if you exponentiated the result=20
> the relative error would be about 10 to the power 2328042. So the answer
yo> u=20
> wish to calculate is K times 10 to the power -2.32804237034994E+22, where
K> is=20
> somewhere between 10 to the power plus or minus 2328042. In other words
whe> n=20
> you make the changes to correct this problem, your calculation will
still> =20
> return values with no real meaning but at least users might be aware of
thi> s which=20
> would be no bad thing! For me this answer is possibly so meaningless that
N> an=20
> would be preferable.
>
> I did mention to Morten that I had updated my code but I believed that
for> =20
> Gnumeric he was quite satisfied with what he had. If you look at the VBA
co> de at=20
> http://members.aol.com/iandjmsmith/Examples.txt you can see the changes
I> =20
> made to stop the overflow problems you seem to be worried about. My code
fo> r the=20
> pdf of the gamma distribution still fails for shape parameters > 2e307
due > to=20
> multiplication of the shap parameter by 2pi. The code for dgamma will have
> the=20
> same problem unless it is hidden by use of an 80 or more bit floating
point> =20
> processor. The code for the asymptotic expansion for the gamma
distribution> =20
> seems to be fine for any number, excluding silly ones like Nan and Inf.
Ind> eed it=20
> takes the difference from the mean as a parameter and if you supply an=20
> accurate value you get a sensible answer as mentioned in=20
> http://members.aol.com/iandjmsmith/Accuracy.htm
>
> I do not share your apparent sense of panic on this matter. I have no=20
> problems with error signals like NaNs because it is obvious to the user
tha> t things=20
> have gone wrong. Inaccurate answers when the user has no reason to expect
t> hem=20
> are usually far more difficult to spot and in many cases the results are
ju> st=20
> believed. That for me is a serious problem. I think you will find that
the> =20
> pgamma code of 2.0.0 did not work for small shape parameters (similar to
th> e=20
> problems exhibited by pbeta still for small parameters see PR#2894), was=20
> inaccurate for large shape parameters (> 1e5) when it resorted to the
norma> l=20
> approximation and was pretty slow in between. Indeed, the normal
approximat> ion was the=20
> cause of PR#7307.
>
>
> I don't understand your comments about=20
> "pt_ =3D -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) =3D -x *
log((lambda-x> )/x) -=20
> (lambda-x)=20
> and naively assumes that this is small enough to use a power series
expansi> on=20
> in 1/x with coefficients as powers of pt_. To make matters worse, consider
> =E2=80=A6"
> In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't
thin> k=20
> it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If
> |x|=20
> < .01 it is evaluated by a power series, if x < -0.8 or x > 1 it
uses=20
> log1p(1+x)-x and for other values it uses a continued fraction which
essent> ially=20
> evaluates more of the same series used when |x| < .01.
>
> Your comments about replacing logspace_add with logspace_sub with
simpler> =20
> code which works at first sight to be a very sensible improvement.
However,> I=20
> would be a bit nervous that lnd-lnp could be very large and the exp of it
c> ould=20
> return infinity. I'm sure this can be accounted for in the code and lnp
+> =20
> log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly.
>
> I am not responsible for the code for calculating the logs of
probabilities> =20
> but I seem to remember that the extremely poor performance of the
algorithm> s in=20
> R2.0.0 with logged probabilities was one of the reasons Morten became=20
> interested in changing the pgamma code (see PR#7307). I have had a quick
lo> ok and=20
> once the corections mentioned above are made it should be giving nonsense
a> nswers=20
> with no difficulty.
>
>
> Unfortunately there are still a few examples of sloppy coding and
accuracy> =20
> errors remaining in R.
>
> The non-central distribution functions have horrible 1- cancellation
errors> =20
> associated with them (see PR#7099) and separate code is required for the
tw> o=20
> tails of the distributions to get round the problem.
>
> The fix for PR#8251 is a kludge and just moves the inaccuracies to
examples> =20
> with higher non-centrality parameters.
>
> pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit=20
> implementations. pcauchy works but I believe the pt function is also
suppos> ed to work for=20
> non integral degrees of freedom so making it work one degree of freedom
via> =20
> pcauchy is hardly much use.
>
> qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying the 
parameters,> =20> qnbinom can be made very slow indeed. I do not think there is anything
wron> g with=20
> the Cornish-Fisher expansion. It just seems that it is not always very
good> =20
> for the Negative Binomial distribution. In the example above, the
initial> =20
> approximation is out by 2e6.
>
> A slightly different problem which can cause qnbinom and qbinom to go
into> =20
> infinite loops is when the q-value is too big. For example=20
> qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15
app> rox but the code works=20
> with values where one of the statements y :=3D y +1 or y =3D y - 1; is
exec> uted but=20
> does not alter the value of y.
>
> df(0,2,2,FALSE) should be 1 not 0
> df(0,df,2,FALSE) should be infinity for df < 2 not 0
> dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0
>
> Presumably R also has similar problems with the pbeta function. As I
recall> =20
> the TOMS 708 code was pretty much included without edits and therefore
didn> 't=20
> calculate logs of probabilities except by calculating the probability and
t> hen=20
> logging it. I assumed this was why it was not used for small shape
paramete> rs=20
> where the current code does not work, although it did not seem logical to
m> e.=20
> Of course, my memory is not what it was but if that is the case and there
a> re=20
> problems with modifying the TOMS code, you could try an 
asymptotic expansio> n=20> based on http://members.aol.com/iandjmsmith/BinomialApprox.htm
>
> This response has been very rushed. I do not write well when I have plenty
> of=20
> time and I felt I had so many different things to say so I apologise if it
> is=20
> all a bit of a jumble.=20
>
> Ian Smith
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Apparently Analagous Threads

Search for more apparently analagous threads

R devel - Jan 2006 - PR#8528

[Rd] PR#8528

[Rd] Problems in d-p-q-r functions (was PR#8528, but unrelated)

Apparently Analagous Threads