Mehdi Amini via llvm-dev
2017-Jan-03 17:13 UTC
[llvm-dev] Automatic Insertion of OpenACC/OpenMP directives
> On Jan 3, 2017, at 7:17 AM, Jonathan Roelofs <jonathan at codesourcery.com> wrote: > > > > On 12/31/16 12:37 PM, Fernando Magno Quintao Pereira via llvm-dev wrote: >> Dear Mehdi, >> >> I've changed your example a little bit: >> >> float saxpy(float a, float *x, float *y, int n) { >> int j = 0; >> for (int i = 0; i < n; ++i) { >> y[j] = a*x[i] + y[I]; // Change 'I' into 'j'? >> ++j; >> } >> } >> >> I get this code below, once I replace 'I' with 'j'. We are copying n >> positions of both arrays, 'x' and 'y': >> >> float saxpy(float a, float *x, float *y, int n) { >> int j = 0; >> >> long long int AI1[6]; >> AI1[0] = n + -1; >> AI1[1] = 4 * AI1[0]; >> AI1[2] = AI1[1] + 4; >> AI1[3] = AI1[2] / 4; >> AI1[4] = (AI1[3] > 0); >> AI1[5] = (AI1[4] ? AI1[3] : 0); >> #pragma acc data pcopy(x[0:AI1[5]],y[0:AI1[5]]) >> #pragma acc kernels >> for (int i = 0; i < n; ++i) { >> y[j] = a * x[i] + y[j]; >> ++j; >> } > > I'm not familiar with OpenACC, but doesn't this still have a loop carried dependence on j, and therefore isn't correctly parallelizable as written?That was my original concern as well, but I had forgot that OpenACC pragma are not necessarily saying to the compiler that the loop is parallel: #pragma acc kernels only tells the compiler to “try” to parallelize the loop if it can prove it safe, but: #pragma acc parallel kernels bypasses the compiler checks and force parallelization. The tool takes care of figuring out the sizes of the array AFAIK (haven’t read the paper yet to understand the novelty in the approach here). — Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170103/1da218ba/attachment.html>
Fernando Magno Quintao Pereira via llvm-dev
2017-Jan-03 17:19 UTC
[llvm-dev] Automatic Insertion of OpenACC/OpenMP directives
Dear Jonathan and Mehdi, you are right. To extend the ability of the OpenACC compliant compiler, we use a technique called restrictification to disambiguate pointers at runtime. So, if you use this program below as input in http://cuda.dcc.ufmg.br/dawn: float saxpy(float a, float *x, float *y, int n) { int j = 0; for (int i = 0; i < n; ++i) { y[j] = a*x[i] + y[j]; ++j; } } then the final program that you obtain is this one: float saxpy(float a, float *x, float *y, int n) { int j = 0; long long int AI1[6]; AI1[0] = n + -1; AI1[1] = 4 * AI1[0]; AI1[2] = AI1[1] + 4; AI1[3] = AI1[2] / 4; AI1[4] = (AI1[3] > 0); AI1[5] = (AI1[4] ? AI1[3] : 0); char RST_AI1 = 0; RST_AI1 |= !((x + 0 > y + AI1[5]) || (y + 0 > x + AI1[5])); // Restrictification test! #pragma acc data pcopyin(x[0:AI1[5]]) pcopy(y[0:AI1[5]]) if(!RST_AI1) #pragma acc kernels if(!RST_AI1) // See the test being used here! for (int i = 0; i < n; ++i) { y[j] = a * x[i] + y[j]; ++j; } } The input kernel, saxpy is, in the absence of aliasing, a map. Thus, it is parallelizable (assuming no aliasing between arrays x and y). So, how does dawn-cc handle aliasing? Our tool does it via a restrictification check. In the code above, this restrictification check is encoded in the variable RST_AI1. This variable is the result of a test that ensures that the limits of arrays x and y do not overlap. LLVM itself uses some form of restrictification to carry out vectorization. Our checks are produced after the paper: Pericles Rafael Alves, Fabian Gruber, Johannes Doerfert, Alexandros Labrineas, Tobias Grosser, Fabrice Rastello and Fernando Magno Quintão Pereira. Runtime Pointer Disambiguation. OOPSLA'2015. In this example, aliasing does not occur as long as &x + n <= &y (or vice versa). Dawn-cc can create restrictification checks as long as all the involved variables encode affine expressions of induction variables. Regards, Fernando
Jonathan Roelofs via llvm-dev
2017-Jan-03 18:02 UTC
[llvm-dev] Automatic Insertion of OpenACC/OpenMP directives
On 1/3/17 10:19 AM, Fernando Magno Quintao Pereira wrote:> Dear Jonathan and Mehdi, > > you are right. To extend the ability of the OpenACC compliant > compiler, we use a technique called restrictification to disambiguate > pointers at runtime. So, if you use this program below as input in > http://cuda.dcc.ufmg.br/dawn: > > float saxpy(float a, float *x, float *y, int n) { > int j = 0; > for (int i = 0; i < n; ++i) { > y[j] = a*x[i] + y[j]; > ++j; > } > }I think I'm still not clearly getting my point across. Let me explain with a simpler example. Suppose this were the input: void ascend1(int *y, int n) { int j = 0; for (int i = 0; i < n; ++i) { y[j] = j; j++; } } One would expect that the entries in the array y would contain: y[0] = 0, y[1] = 1, y[2] = 2, etc. However, if one were to annotate that with (in OpenMP, as I don't know the semantics of OpenACC): void ascend2(int *y, int n) { int j = 0; #pragma omp parallel for for (int i = 0; i < n; ++i) { y[j] = j; j++; } } now there's a race because the value of j at each iteration of the loop depends on the previous iteration's value of it. Now compare that to: void ascend3(int *y, int n) { for (int i = 0; i < n; ++i) { y[i] = i; } } this one does not have the loop carried dependence on the variable used for indexing, so it is indeed safe to parallelize. Mehdi's comment was that the llvm IR for `ascend1()` and `ascend2()` will look almost the same (at least after optimization), so you'll need some way to recover the information telling you that `ascend1()`s source is *not* parallelizable with the annotations, but that `ascend3()` *is*. Jon> > then the final program that you obtain is this one: > > float saxpy(float a, float *x, float *y, int n) { > int j = 0; > long long int AI1[6]; > AI1[0] = n + -1; > AI1[1] = 4 * AI1[0]; > AI1[2] = AI1[1] + 4; > AI1[3] = AI1[2] / 4; > AI1[4] = (AI1[3] > 0); > AI1[5] = (AI1[4] ? AI1[3] : 0); > char RST_AI1 = 0; > RST_AI1 |= !((x + 0 > y + AI1[5]) || (y + 0 > x + AI1[5])); // > Restrictification test! > #pragma acc data pcopyin(x[0:AI1[5]]) pcopy(y[0:AI1[5]]) if(!RST_AI1) > #pragma acc kernels if(!RST_AI1) // See the test being used here! > for (int i = 0; i < n; ++i) { > y[j] = a * x[i] + y[j]; > ++j; > } > } > > The input kernel, saxpy is, in the absence of aliasing, a map. > Thus, it is parallelizable (assuming no aliasing between arrays x and > y). So, how does dawn-cc handle aliasing? Our tool does it via a > restrictification check. In the code above, this restrictification > check is encoded in the variable RST_AI1. This variable is the result > of a test that ensures that the limits of arrays x and y do not > overlap. LLVM itself uses some form of restrictification to carry out > vectorization. Our checks are produced after the paper: > > Pericles Rafael Alves, Fabian Gruber, Johannes Doerfert, Alexandros > Labrineas, Tobias Grosser, Fabrice Rastello and Fernando Magno Quintão > Pereira. Runtime Pointer Disambiguation. OOPSLA'2015. > > In this example, aliasing does not occur as long as &x + n <= &y > (or vice versa). Dawn-cc can create restrictification checks as long > as all the involved variables encode affine expressions of induction > variables. > > Regards, > > Fernando-- Jon Roelofs jonathan at codesourcery.com CodeSourcery / Mentor Embedded