Steve (Numerics) Canon via llvm-dev
2017-Jan-12 19:31 UTC
[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM
> On Jan 12, 2017, at 2:21 PM, Mehdi Amini <mehdi.amini at apple.com> wrote: > > >> On Jan 12, 2017, at 11:04 AM, Steve (Numerics) Canon <scanon at apple.com <mailto:scanon at apple.com>> wrote: >> >>> >>> On Jan 12, 2017, at 12:58 PM, Friedman, Eli via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>> >>> On 1/12/2017 9:33 AM, Mehdi Amini via llvm-dev wrote: >>>>> On Jan 12, 2017, at 5:03 AM, Antoine Pitrou via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>>> >>>>> On Mon, 9 Jan 2017 11:43:17 -0600 >>>>> Wei Ding via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>>>> Hi, >>>>>> >>>>>> I want an efficient way to implement function pow in LLVM instead of >>>>>> invoking pow() math built-in. For algorithm part, I am clear for the logic. >>>>>> But I am not quite sure for which parts of LLVM should I replace built-in >>>>>> pow with another efficient pow implementation. Any comments and feedback >>>>>> are appreciated. Thanks! >>>>> In Numba, we have decided to optimize some usages of the power function >>>>> in our own front-end, so that LLVM IR gets an already optimized form, >>>>> as we have found that otherwise LLVM may miss some optimization >>>>> opportunities. YMMV. >>>> It seems to me that it would be more interesting to gather these misoptimization and fix LLVM to catch them. >>>> >>>>> (e.g. we detect that the exponent is a compile-time constant and >>>>> transform `x**3` into `x*x*x`) >>>> This seems definitely in the scope of what LLVM could do, potentially with TTI. >>> >>> LLVM already does this... but only if the pow() call is marked "fast". IEEE 754 pow() is supposed to be correctly rounded, but (x*x)*x has an extra rounding step. >> >> pow( ) is not supposed to be correctly rounded. >> >> IEEE 754 recommends that a correctly rounded power function exist [9.2], but does not recommend or require that this be the default pow( ) function. [Note: it was not actually shown until quite late in the IEEE 754 revision process that it was even possible to have a reasonably efficient correctly-rounded pow( ) function, and such a function is at minimum an order of magnitude slower than a good sub-ulp-accurate-but-not-correctly-rounded implementation. Most 754 committee members would recommend that the default pow( ) function *not* be correctly rounded.] > > So we should use x*x*x even without fast-math... > >> In a good math library, however, pow( ) should absolutely be sub-ulp accurate, which means that it should *not* be implemented via log( ) and exp( ), and indeed, most math libraries don’t do that. >> >> Just to provide some example data, for single-precision x in [1,2) on current OS X: >> >> The worst-case error of x*x*x is 1.28736 ulp >> The worst-case error of powf(x, 3) is 0.500013 ulp >> >> The RMS error of x*x*x is 0.585066 ulp >> The RMS error of powf(x,3) is 0.499984 ulp > > … or maybe not! :) > > I’m not sure what to think of these results. For instance what’s the perf impact of calling into powf(x, 3) on OSX vs using x*x*x?Large! x*x*x has reciprocal throughput of 1 cycle, plus the opportunity for the compiler to autovectorize. powf(x, 3) is much harder to vectorize, and has reciprocal throughput of ~40 cycles on OS X (which is on the faster side; it’s more like 100 cycles on some platforms).> At the source level, what could the user do to decide that 1.28ULP is acceptable for his use-case?Write x*x*x. =) I can see an argument for adding a flag that allows a few ulp of error in math functions (roughly what Intel’s LA versions in MKL or some of the YEPPP implementations provide), but there is a question of how many users would actually find it and use it. – Steve -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170112/5ad886d5/attachment.html>
Mehdi Amini via llvm-dev
2017-Jan-12 19:37 UTC
[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM
> On Jan 12, 2017, at 11:31 AM, Steve (Numerics) Canon <scanon at apple.com> wrote: > >> >> On Jan 12, 2017, at 2:21 PM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote: >> >> >>> On Jan 12, 2017, at 11:04 AM, Steve (Numerics) Canon <scanon at apple.com <mailto:scanon at apple.com>> wrote: >>> >>>> >>>> On Jan 12, 2017, at 12:58 PM, Friedman, Eli via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>> >>>> On 1/12/2017 9:33 AM, Mehdi Amini via llvm-dev wrote: >>>>>> On Jan 12, 2017, at 5:03 AM, Antoine Pitrou via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>>>> >>>>>> On Mon, 9 Jan 2017 11:43:17 -0600 >>>>>> Wei Ding via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I want an efficient way to implement function pow in LLVM instead of >>>>>>> invoking pow() math built-in. For algorithm part, I am clear for the logic. >>>>>>> But I am not quite sure for which parts of LLVM should I replace built-in >>>>>>> pow with another efficient pow implementation. Any comments and feedback >>>>>>> are appreciated. Thanks! >>>>>> In Numba, we have decided to optimize some usages of the power function >>>>>> in our own front-end, so that LLVM IR gets an already optimized form, >>>>>> as we have found that otherwise LLVM may miss some optimization >>>>>> opportunities. YMMV. >>>>> It seems to me that it would be more interesting to gather these misoptimization and fix LLVM to catch them. >>>>> >>>>>> (e.g. we detect that the exponent is a compile-time constant and >>>>>> transform `x**3` into `x*x*x`) >>>>> This seems definitely in the scope of what LLVM could do, potentially with TTI. >>>> >>>> LLVM already does this... but only if the pow() call is marked "fast". IEEE 754 pow() is supposed to be correctly rounded, but (x*x)*x has an extra rounding step. >>> >>> pow( ) is not supposed to be correctly rounded. >>> >>> IEEE 754 recommends that a correctly rounded power function exist [9.2], but does not recommend or require that this be the default pow( ) function. [Note: it was not actually shown until quite late in the IEEE 754 revision process that it was even possible to have a reasonably efficient correctly-rounded pow( ) function, and such a function is at minimum an order of magnitude slower than a good sub-ulp-accurate-but-not-correctly-rounded implementation. Most 754 committee members would recommend that the default pow( ) function *not* be correctly rounded.] >> >> So we should use x*x*x even without fast-math... >> >>> In a good math library, however, pow( ) should absolutely be sub-ulp accurate, which means that it should *not* be implemented via log( ) and exp( ), and indeed, most math libraries don’t do that. >>> >>> Just to provide some example data, for single-precision x in [1,2) on current OS X: >>> >>> The worst-case error of x*x*x is 1.28736 ulp >>> The worst-case error of powf(x, 3) is 0.500013 ulp >>> >>> The RMS error of x*x*x is 0.585066 ulp >>> The RMS error of powf(x,3) is 0.499984 ulp >> >> … or maybe not! :) >> >> I’m not sure what to think of these results. For instance what’s the perf impact of calling into powf(x, 3) on OSX vs using x*x*x? > > Large! x*x*x has reciprocal throughput of 1 cycle, plus the opportunity for the compiler to autovectorize. powf(x, 3) is much harder to vectorize, and has reciprocal throughput of ~40 cycles on OS X (which is on the faster side; it’s more like 100 cycles on some platforms). > >> At the source level, what could the user do to decide that 1.28ULP is acceptable for his use-case? > > Write x*x*x. =)Ahah! Just to be sure (even though I’m sure you know already my thoughts on this), the cases I’m thinking about is a function calling powf(x, y) and after inlining and other optimizations y is identified as 3.> I can see an argument for adding a flag that allows a few ulp of error in math functions (roughly what Intel’s LA versions in MKL or some of the YEPPP implementations provide), but there is a question of how many users would actually find it and use it.Yeah I didn’t imagine any good way to make it discoverable and easy to use unfortunately :( — Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170112/c4275dd1/attachment-0001.html>
François Fayard via llvm-dev
2017-Jan-12 22:03 UTC
[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM
As I work as a consultant for high performance computing, I came across a lot of these problems. I'll give my thoughts on it: - The IEEE 754-2008 standard defines the roundind modes for +, -, *, / and sqrt which should give the result of those operations rounded to -infinity, +infinity, towards 0 or to the nearest floating point. The default mode being, rounding to nearest. - For all other functions, I would say that "it is open bar" for the platform. The reason is that even though it is possible to write libraries such that given a double x, std::exp(x) will be the exact value rounded to the nearest floating point, these algorithms are extremely costly. Most people, don't need such an accuracy, and an "engineering tradeoff" has to be made. Think, for instance at this program: ====#include <iostream> #include <cmath> int main() { double x = 3.14159265358979323846; std::cout << std::sin(x) << std::endl; return 0; } ==== and run it on different platforms. I am sure you'll get a lot of fun :-) You'll discover quickly that you can't even trust the first digit given by your program by just applying one transcendental function. One might think that computing accurately x^n for integrers n is simple, but it is not. Check this http://perso.ens-lyon.fr/jean-michel.muller/p1-Kornerup.pdf if you are interested. As a consequence: - Forget about getting the exact same floating point from one platform to another unless you only use +, -, *, /, sqrt. - Even if you do that, you should disable vectorization to get correct results. Vectorizing a reduction (for instance computing the sum of the elements of a vector) won't give you the same result as the same loop which is not vectorized because + is not associative. Worse, a vectorized loop might not give you the same answer from run to run because of the fact that malloc does not always align to 32 bytes which affect the loop peeling and therefore the result. - When it come to multithreading, especially with libraries which dispatch the work at runtime such as TBB or OpenMP without a static schedule, you also don't have the same result from one run to another on the same machine. - Worse, even without vectorization and multithreading, compilers might reorder the sum to benefit from instruction level parallelism. So high performance computing and correct rounding are just ennemies of one another. For the very few you needs the garantee of correct rounding (mainly people doing academic research on floating point arithmetic), you understand that using pow is not even an option. So they don't care about how it is handled. For the rest of us, I think we should use the most efficient way to compute x^n : divide and conquer. I do believe that std::pow is completely flawed anyway (it's my opinion on most of the C++ library, but I'll try not to rant about it). There was a "float std::pow(float x, int n)" until C++11, but it dissapeared !!! (check cppreference). Now both x and n have to be converted to double before anything happens. Even the compilers who are smart enough to convert std::pow(x, 2.0) to x * x end up working with doubles instead of floats, loose time in the conversion, and kill the performance by a factor of 2 because they can load half as many numbers in the vector registers. So I strongly recommend not using it. Instead, I recommend defining: template <typename T, int n> T il::ipow(T x) and specializing it for small values of n (up to 10 for instance). If you like that game you can even do template meta-programming to define il::ipow (I personnaly specialized it for values up to 10 and wrote a while loop for n larger). That way, you'll be sure that when you call il::ipow<5>(x), it will compute u = x * x, and return u * u * x. I think the right thing to do is to teach people about these problems, and tell them how wrong they are when they test x == y with 2 floating point, or even worse, write unit test that expect a floating point as a result without any accuracy. In have seen so much of these tests in different industries, that this is scary. And when their program give 2 completely different answers from one platform to another, most of the time (my experience is always), it is because the algorithm is unstable, so both results are wrong. So my two cents: If you care about performance, write your own ipow. If you think you need accuracy up to 0.5 ulp, I'll say that I am very reluctant to believe you and I'll advise you to use specialized libraries. This presentation gives a nice review of them: https://indico.cern.ch/event/313684/contributions/1687762/attachments/600507/826484/FDD-2014-CERN1.pdf Sent from my iPad -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170112/a75c186e/attachment.html>