thr3ads.net - llvm dev - [llvm-dev] The most efficient way to implement an integer based power function pow in LLVM [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Steve (Numerics) Canon via llvm-dev

2017-Jan-12 19:31 UTC

[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM

> On Jan 12, 2017, at 2:21 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
> 
> 
>> On Jan 12, 2017, at 11:04 AM, Steve (Numerics) Canon <scanon at
apple.com <mailto:scanon at apple.com>> wrote:
>> 
>>> 
>>> On Jan 12, 2017, at 12:58 PM, Friedman, Eli via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>> 
>>> On 1/12/2017 9:33 AM, Mehdi Amini via llvm-dev wrote:
>>>>> On Jan 12, 2017, at 5:03 AM, Antoine Pitrou via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>>> 
>>>>> On Mon, 9 Jan 2017 11:43:17 -0600
>>>>> Wei Ding via llvm-dev <llvm-dev at lists.llvm.org
<mailto:llvm-dev at lists.llvm.org>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I want an efficient way to implement function pow in
LLVM instead of
>>>>>> invoking pow() math built-in. For algorithm part, I am
clear for the logic.
>>>>>> But I am not quite sure for which parts of LLVM should
I replace built-in
>>>>>> pow with another efficient pow implementation. Any
comments and feedback
>>>>>> are appreciated. Thanks!
>>>>> In Numba, we have decided to optimize some usages of the
power function
>>>>> in our own front-end, so that LLVM IR gets an already
optimized form,
>>>>> as we have found that otherwise LLVM may miss some
optimization
>>>>> opportunities. YMMV.
>>>> It seems to me that it would be more interesting to gather
these misoptimization and fix LLVM to catch them.
>>>> 
>>>>> (e.g. we detect that the exponent is a compile-time
constant and
>>>>> transform `x**3` into `x*x*x`)
>>>> This seems definitely in the scope of what LLVM could do,
potentially with TTI.
>>> 
>>> LLVM already does this... but only if the pow() call is marked
"fast".  IEEE 754 pow() is supposed to be correctly rounded, but
(x*x)*x has an extra rounding step.
>> 
>> pow( ) is not supposed to be correctly rounded.
>> 
>> IEEE 754 recommends that a correctly rounded power function exist
[9.2], but does not recommend or require that this be the default pow( )
function. [Note: it was not actually shown until quite late in the IEEE 754
revision process that it was even possible to have a reasonably efficient
correctly-rounded pow( ) function, and such a function is at minimum an order of
magnitude slower than a good sub-ulp-accurate-but-not-correctly-rounded
implementation. Most 754 committee members would recommend that the default pow(
) function *not* be correctly rounded.]
> 
> So we should use x*x*x even without fast-math...
> 
>> In a good math library, however, pow( ) should absolutely be sub-ulp
accurate, which means that it should *not* be implemented via log( ) and exp( ),
and indeed, most math libraries don’t do that.
>> 
>> Just to provide some example data, for single-precision x in [1,2) on
current OS X:
>> 
>> 	The worst-case error of x*x*x is 1.28736 ulp
>> 	The worst-case error of powf(x, 3) is 0.500013 ulp
>> 
>> 	The RMS error of x*x*x is 0.585066 ulp
>> 	The RMS error of powf(x,3) is 0.499984 ulp
> 
> … or maybe not! :)
> 
> I’m not sure what to think of these results. For instance what’s the perf
impact of calling into powf(x, 3) on OSX vs using x*x*x?
Large! x*x*x has reciprocal throughput of 1 cycle, plus the opportunity for the
compiler to autovectorize. powf(x, 3) is much harder to vectorize, and has
reciprocal throughput of ~40 cycles on OS X (which is on the faster side; it’s
more like 100 cycles on some platforms).
> At the source level, what could the user do to decide that 1.28ULP is
acceptable for his use-case?
Write x*x*x. =)

I can see an argument for adding a flag that allows a few ulp of error in math
functions (roughly what Intel’s LA versions in MKL or some of the YEPPP
implementations provide), but there is a question of how many users would
actually find it and use it.

– Steve

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170112/5ad886d5/attachment.html>

Mehdi Amini via llvm-dev

2017-Jan-12 19:37 UTC

head link

[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM

> On Jan 12, 2017, at 11:31 AM, Steve (Numerics) Canon <scanon at
apple.com> wrote:
> 
>> 
>> On Jan 12, 2017, at 2:21 PM, Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
>> 
>> 
>>> On Jan 12, 2017, at 11:04 AM, Steve (Numerics) Canon <scanon at
apple.com <mailto:scanon at apple.com>> wrote:
>>> 
>>>> 
>>>> On Jan 12, 2017, at 12:58 PM, Friedman, Eli via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>> 
>>>> On 1/12/2017 9:33 AM, Mehdi Amini via llvm-dev wrote:
>>>>>> On Jan 12, 2017, at 5:03 AM, Antoine Pitrou via
llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>>>>> 
>>>>>> On Mon, 9 Jan 2017 11:43:17 -0600
>>>>>> Wei Ding via llvm-dev <llvm-dev at lists.llvm.org
<mailto:llvm-dev at lists.llvm.org>> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I want an efficient way to implement function pow
in LLVM instead of
>>>>>>> invoking pow() math built-in. For algorithm part, I
am clear for the logic.
>>>>>>> But I am not quite sure for which parts of LLVM
should I replace built-in
>>>>>>> pow with another efficient pow implementation. Any
comments and feedback
>>>>>>> are appreciated. Thanks!
>>>>>> In Numba, we have decided to optimize some usages of
the power function
>>>>>> in our own front-end, so that LLVM IR gets an already
optimized form,
>>>>>> as we have found that otherwise LLVM may miss some
optimization
>>>>>> opportunities. YMMV.
>>>>> It seems to me that it would be more interesting to gather
these misoptimization and fix LLVM to catch them.
>>>>> 
>>>>>> (e.g. we detect that the exponent is a compile-time
constant and
>>>>>> transform `x**3` into `x*x*x`)
>>>>> This seems definitely in the scope of what LLVM could do,
potentially with TTI.
>>>> 
>>>> LLVM already does this... but only if the pow() call is marked
"fast".  IEEE 754 pow() is supposed to be correctly rounded, but
(x*x)*x has an extra rounding step.
>>> 
>>> pow( ) is not supposed to be correctly rounded.
>>> 
>>> IEEE 754 recommends that a correctly rounded power function exist
[9.2], but does not recommend or require that this be the default pow( )
function. [Note: it was not actually shown until quite late in the IEEE 754
revision process that it was even possible to have a reasonably efficient
correctly-rounded pow( ) function, and such a function is at minimum an order of
magnitude slower than a good sub-ulp-accurate-but-not-correctly-rounded
implementation. Most 754 committee members would recommend that the default pow(
) function *not* be correctly rounded.]
>> 
>> So we should use x*x*x even without fast-math...
>> 
>>> In a good math library, however, pow( ) should absolutely be
sub-ulp accurate, which means that it should *not* be implemented via log( ) and
exp( ), and indeed, most math libraries don’t do that.
>>> 
>>> Just to provide some example data, for single-precision x in [1,2)
on current OS X:
>>> 
>>> 	The worst-case error of x*x*x is 1.28736 ulp
>>> 	The worst-case error of powf(x, 3) is 0.500013 ulp
>>> 
>>> 	The RMS error of x*x*x is 0.585066 ulp
>>> 	The RMS error of powf(x,3) is 0.499984 ulp
>> 
>> … or maybe not! :)
>> 
>> I’m not sure what to think of these results. For instance what’s the
perf impact of calling into powf(x, 3) on OSX vs using x*x*x?
> 
> Large! x*x*x has reciprocal throughput of 1 cycle, plus the opportunity for
the compiler to autovectorize. powf(x, 3) is much harder to vectorize, and has
reciprocal throughput of ~40 cycles on OS X (which is on the faster side; it’s
more like 100 cycles on some platforms).
> 
>> At the source level, what could the user do to decide that 1.28ULP is
acceptable for his use-case?
> 
> Write x*x*x. =)
Ahah!
Just to be sure (even though I’m sure you know already my thoughts on this), the
cases I’m thinking about is a function calling powf(x, y) and after inlining and
other optimizations y is identified as 3.

> I can see an argument for adding a flag that allows a few ulp of error in
math functions (roughly what Intel’s LA versions in MKL or some of the YEPPP
implementations provide), but there is a question of how many users would
actually find it and use it.
Yeah I didn’t imagine any good way to make it discoverable and easy to use
unfortunately :(



— 
Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170112/c4275dd1/attachment-0001.html>

François Fayard via llvm-dev

2017-Jan-12 22:03 UTC

head link

[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM

As I work as a consultant for high performance computing, I came across a lot of
these problems. I'll give my thoughts on it:

- The IEEE 754-2008 standard defines the roundind modes for +, -, *, / and sqrt
which should give the result of those operations rounded to -infinity,
+infinity, towards 0 or to the nearest floating point. The default mode being,
rounding to nearest.

- For all other functions, I would say that "it is open bar" for the
platform. The reason is that even though it is possible to write libraries such
that given a double x, std::exp(x) will be the exact value rounded to the
nearest floating point, these algorithms are extremely costly. Most people,
don't need such an accuracy, and an "engineering tradeoff" has to
be made. Think, for instance at this program:

====#include <iostream>
#include <cmath>

int main() {
double x = 3.14159265358979323846;
std::cout << std::sin(x) << std::endl;

return 0;
}
====
and run it on different platforms. I am sure you'll get a lot of fun :-)
You'll discover quickly that you can't even trust the first digit given
by your program by just applying one transcendental function.
One might think that computing accurately x^n for integrers n is simple, but it
is not. Check this http://perso.ens-lyon.fr/jean-michel.muller/p1-Kornerup.pdf
if you are interested.

As a consequence:
- Forget about getting the exact same floating point from one platform to
another unless you only use +, -, *, /, sqrt.
- Even if you do that, you should disable vectorization to get correct results.
Vectorizing a reduction (for instance computing the sum of the elements of a
vector) won't give you the same result as the same loop which is not
vectorized because + is not associative. Worse, a vectorized loop might not give
you the same answer from run to run because of the fact that malloc does not
always align to 32 bytes which affect the loop peeling and therefore the result.
- When it come to multithreading, especially with libraries which dispatch the
work at runtime such as TBB or OpenMP without a static schedule, you also
don't have the same result from one run to another on the same machine.
- Worse, even without vectorization and multithreading, compilers might reorder
the sum to benefit from instruction level parallelism.
So high performance computing and correct rounding are just ennemies of one
another.

For the very few you needs the garantee of correct rounding (mainly people doing
academic research on floating point arithmetic), you understand that using pow
is not even an option. So they don't care about how it is handled.

For the rest of us, I think we should use the most efficient way to compute x^n
: divide and conquer. I do believe that std::pow is completely flawed anyway
(it's my opinion on most of the C++ library, but I'll try not to rant
about it). There was a "float std::pow(float x, int n)" until C++11,
but it dissapeared !!! (check cppreference). Now both x and n have to be
converted to double before anything happens. Even the compilers who are smart
enough to convert std::pow(x, 2.0) to x * x end up working with doubles instead
of floats, loose time in the conversion, and kill the performance by a factor of
2 because they can load half as many numbers in the vector registers. So I
strongly recommend not using it. Instead, I recommend defining:

template <typename T, int n>
T il::ipow(T x)

and specializing it for small values of n (up to 10 for instance). If you like
that game you can even do template meta-programming to define il::ipow (I
personnaly specialized it for values up to 10 and wrote a while loop for n
larger). That way, you'll be sure that when you call il::ipow<5>(x),
it will compute u = x * x, and return u * u * x.

I think the right thing to do is to teach people about these problems, and tell
them how wrong they are when they test x == y with 2 floating point, or even
worse, write unit test that expect a floating point as a result without any
accuracy. In have seen so much of these tests in different industries, that this
is scary. And when their program give 2 completely different answers from one
platform to another, most of the time (my experience is always), it is because
the algorithm is unstable, so both results are wrong.

So my two cents: If you care about performance, write your own ipow. If you
think you need accuracy up to 0.5 ulp, I'll say that I am very reluctant to
believe you and I'll advise you to use specialized libraries. This
presentation gives a nice review of them:
https://indico.cern.ch/event/313684/contributions/1687762/attachments/600507/826484/FDD-2014-CERN1.pdf

Sent from my iPad
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170112/a75c186e/attachment.html>

llvm dev - Jan 2017 - The most efficient way to implement an integer based power function pow in LLVM

[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM

[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM

[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM