Wei Ding via llvm-dev
2017-Jan-09 17:43 UTC
[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM
Hi, I want an efficient way to implement function pow in LLVM instead of invoking pow() math built-in. For algorithm part, I am clear for the logic. But I am not quite sure for which parts of LLVM should I replace built-in pow with another efficient pow implementation. Any comments and feedback are appreciated. Thanks! -- Wei Ding -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170109/69fc4143/attachment.html>
Friedman, Eli via llvm-dev
2017-Jan-09 18:19 UTC
[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM
On 1/9/2017 9:43 AM, Wei Ding via llvm-dev wrote:> Hi, > > I want an efficient way to implement function pow in LLVM instead of > invoking pow() math built-in. For algorithm part, I am clear for the > logic. But I am not quite sure for which parts of LLVM should I > replace built-in pow with another efficient pow implementation. Any > comments and feedback are appreciated. Thanks!The llvm.pow() intrinsic just calls pow() from libm; if you want to replace it with a different implementation, you can just write a function called pow() and link it into your program. If you're looking for the part of LLVM which optimizes pow() calls, see LibCallSimplifier::optimizePow in lib/Transforms/Utils/SimplifyLibCalls.cpp. -Eli -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Antoine Pitrou via llvm-dev
2017-Jan-12 13:03 UTC
[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM
On Mon, 9 Jan 2017 11:43:17 -0600 Wei Ding via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Hi, > > I want an efficient way to implement function pow in LLVM instead of > invoking pow() math built-in. For algorithm part, I am clear for the logic. > But I am not quite sure for which parts of LLVM should I replace built-in > pow with another efficient pow implementation. Any comments and feedback > are appreciated. Thanks!In Numba, we have decided to optimize some usages of the power function in our own front-end, so that LLVM IR gets an already optimized form, as we have found that otherwise LLVM may miss some optimization opportunities. YMMV. (e.g. we detect that the exponent is a compile-time constant and transform `x**3` into `x*x*x`) Note that not only `pow(x, 3)` can be slower than `x*x*x`, but it may also have some precision issues, as it goes through log() and exp() calls. Regards Antoine.
Mehdi Amini via llvm-dev
2017-Jan-12 17:33 UTC
[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM
> On Jan 12, 2017, at 5:03 AM, Antoine Pitrou via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > On Mon, 9 Jan 2017 11:43:17 -0600 > Wei Ding via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> Hi, >> >> I want an efficient way to implement function pow in LLVM instead of >> invoking pow() math built-in. For algorithm part, I am clear for the logic. >> But I am not quite sure for which parts of LLVM should I replace built-in >> pow with another efficient pow implementation. Any comments and feedback >> are appreciated. Thanks! > > In Numba, we have decided to optimize some usages of the power function > in our own front-end, so that LLVM IR gets an already optimized form, > as we have found that otherwise LLVM may miss some optimization > opportunities. YMMV.It seems to me that it would be more interesting to gather these misoptimization and fix LLVM to catch them.> > (e.g. we detect that the exponent is a compile-time constant and > transform `x**3` into `x*x*x`)This seems definitely in the scope of what LLVM could do, potentially with TTI. — Mehdi> > Note that not only `pow(x, 3)` can be slower than `x*x*x`, but it may > also have some precision issues, as it goes through log() and exp() > calls. > > Regards > > Antoine. > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Francois Fayard via llvm-dev
2017-Jan-13 08:32 UTC
[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM
Bare in mind that writing a very efficient x^n is not that straightforward. Even if it’s a corner cases, the example of x^7 is interesting. If you apply the divide and rule algorithm, you get the following algorithm: double pow_7(double x) { const double x2 = x * x; const double x3 = x * x2; const double x6 = x3 * x3; const double x7 = x * x6; return x7; } But you can also write double pow_7(double x) { const double x2 = x * x; const double x3 = x * x2; const double x4 = x2 * x2; const double x7 = x3 * x4; return x7; } which is a different algorithm using the same number of application. But, because of instruction level parallelism, they don’t get the same performance. In the first implementation every single instruction has to wait for the previous instruction to start. In the second one x3 and x4 could be computed using instruction level parallelism as they don’t depend upon each other. The performance difference in between the two are (compiled with -O3 and -mavx2): - clang 3.9.1: The first one is 25% slower than the second one - intel 17.0.1: The first one is 05% slower than the second one - gcc 6.2.0: The first one is 3% slower than the first one (but both are 2 times slower than clang and intel) Attached is the code asserting my claim. François Fayard ==== #include <chrono> #include <iostream> int main() { const int n = 1000000000; const int length = 10; double* x = new double[length]; for (int i = 0; i < length; ++i) { x[i] = 1.0; } asm volatile("" : : "g"(x) : "memory"); auto begin = std::chrono::high_resolution_clock::now(); for (int k = 0; k < n; ++k) { for (int i = 0; i < length; ++i) { const double x1 = x[i]; const double x2 = x1 * x1; const double x3 = x1 * x2; const double x4 = x2 * x2; const double x7 = x3 * x4; x[i] = x7; } } auto end = std::chrono::high_resolution_clock::now(); double time_fast 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count(); begin = std::chrono::high_resolution_clock::now(); for (int k = 0; k < n; ++k) { for (int i = 0; i < length; ++i) { const double x1 = x[i]; const double x2 = x1 * x1; const double x3 = x1 * x2; const double x6 = x3 * x3; const double x7 = x1 * x6; x[i] = x7; } } end = std::chrono::high_resolution_clock::now(); double time_divide_and_rule 1.0e-9 * std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count(); std::cout << "Optimized x^7: " << 1.0e9 * time_fast / (static_cast<double>(n) * length) << " ns" << std::endl; std::cout << " Classic x^7: " << 1.0e9 * time_divide_and_rule / (static_cast<double>(n) * length) << " ns" << std::endl; delete[] x; return 0; } ====> On Jan 9, 2017, at 6:43 PM, Wei Ding via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hi, > > I want an efficient way to implement function pow in LLVM instead of invoking pow() math built-in. For algorithm part, I am clear for the logic. But I am not quite sure for which parts of LLVM should I replace built-in pow with another efficient pow implementation. Any comments and feedback are appreciated. Thanks!
mats petersson via llvm-dev
2017-Jan-13 11:39 UTC
[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM
Is the original question here not about pow(int, int), rather than pow(float, int)? In which case the problem about rounding or precision becomes less of an issue (of course, for larger values, there will be overflows instead - but that's still a problem if you do `int a = pow(x, y) where x**y is out of range for the int value). In my experience, it's not at all unlikely to find code like `int a pow(2, 3);` in code, often with the resulting questions about "why is my result 7, not 8", because even with 0.5 ulp, it's not (always) EXACTLY the integer value that comes out of the function... -- Mats On 13 January 2017 at 08:32, Francois Fayard via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Bare in mind that writing a very efficient x^n is not that > straightforward. Even if it’s a corner cases, the example of x^7 is > interesting. If you apply the divide and rule algorithm, you get the > following algorithm: > > double pow_7(double x) { > const double x2 = x * x; > const double x3 = x * x2; > const double x6 = x3 * x3; > const double x7 = x * x6; > > return x7; > } > > But you can also write > > double pow_7(double x) { > const double x2 = x * x; > const double x3 = x * x2; > const double x4 = x2 * x2; > const double x7 = x3 * x4; > > return x7; > } > > which is a different algorithm using the same number of application. But, > because of instruction level parallelism, they don’t get the same > performance. In the first implementation every single instruction has to > wait for the previous instruction to start. In the second one x3 and x4 > could be computed using instruction level parallelism as they don’t depend > upon each other. > > The performance difference in between the two are (compiled with -O3 and > -mavx2): > - clang 3.9.1: The first one is 25% slower than the second one > - intel 17.0.1: The first one is 05% slower than the second one > - gcc 6.2.0: The first one is 3% slower than the first one (but both are 2 > times slower than clang and intel) > > Attached is the code asserting my claim. > > François Fayard > > ====> > #include <chrono> > #include <iostream> > > int main() { > const int n = 1000000000; > const int length = 10; > double* x = new double[length]; > for (int i = 0; i < length; ++i) { > x[i] = 1.0; > } > > asm volatile("" : : "g"(x) : "memory"); > > auto begin = std::chrono::high_resolution_clock::now(); > for (int k = 0; k < n; ++k) { > for (int i = 0; i < length; ++i) { > const double x1 = x[i]; > const double x2 = x1 * x1; > const double x3 = x1 * x2; > const double x4 = x2 * x2; > const double x7 = x3 * x4; > x[i] = x7; > } > } > auto end = std::chrono::high_resolution_clock::now(); > double time_fast > 1.0e-9 * > std::chrono::duration_cast<std::chrono::nanoseconds>(end - > begin).count(); > > begin = std::chrono::high_resolution_clock::now(); > for (int k = 0; k < n; ++k) { > for (int i = 0; i < length; ++i) { > const double x1 = x[i]; > const double x2 = x1 * x1; > const double x3 = x1 * x2; > const double x6 = x3 * x3; > const double x7 = x1 * x6; > x[i] = x7; > } > } > end = std::chrono::high_resolution_clock::now(); > double time_divide_and_rule > 1.0e-9 * > std::chrono::duration_cast<std::chrono::nanoseconds>(end - > begin).count(); > > std::cout << "Optimized x^7: " > << 1.0e9 * time_fast / (static_cast<double>(n) * length) << " > ns" > << std::endl; > std::cout << " Classic x^7: " > << 1.0e9 * time_divide_and_rule / (static_cast<double>(n) * > length) > << " ns" << std::endl; > > delete[] x; > > return 0; > } > > ====> > > On Jan 9, 2017, at 6:43 PM, Wei Ding via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > > Hi, > > > > I want an efficient way to implement function pow in LLVM instead of > invoking pow() math built-in. For algorithm part, I am clear for the logic. > But I am not quite sure for which parts of LLVM should I replace built-in > pow with another efficient pow implementation. Any comments and feedback > are appreciated. Thanks! > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170113/51caf9be/attachment.html>