thr3ads.net - llvm dev - [llvm-dev] Matrix Multiplication not Vectorized using double pointers [Oct 2019]

If this information is useful, please help other people find it:
Share via:

hameeza ahmed via llvm-dev

2019-Oct-20 07:31 UTC

[llvm-dev] Matrix Multiplication not Vectorized using double pointers

Hello,
My matrix multiplication code has variables allocated via double pointers
on heap. The code is not getting vectorized. Following is the code.

 int **buffer_A = (int **)malloc(vecsize * sizeof(int *));
 int **buffer_B = (int **)malloc(vecsize * sizeof(int *));
for(p = 0; p < vecsize; p++)
    {
         buffer_A[p] = (int *)malloc(vecsize * sizeof(int));
     }
for(p = 0; p < vecsize; p++)
    {
         buffer_B[p] = (int *)malloc(vecsize * sizeof(int));
     }
    long i, j, k;
 int **result = (int **)malloc(vecsize * sizeof(int *));
    for(p = 0; p < vecsize; p++)
    {
         result[p] = (int *)malloc(vecsize * sizeof(int));
     }
    for(i = 0; i < vecsize; i++)
    {
        for(j = 0; j < vecsize; j++)
        {
         result[i][j]=0;
            for(k = 0; k < vecsize; k++)
            {
                result[i][j] += buffer_A[i][k] * buffer_B[k][j];
            }
        }
    }
what is the issue? how to resolve it?

Thank You

Regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191020/c200cca4/attachment.html>

Fabian Giesen via llvm-dev

2019-Oct-20 23:23 UTC

head link

[llvm-dev] Matrix Multiplication not Vectorized using double pointers

Compile with optimization remarks on ("-Rpass-analysis=.*", or 
"-Rpass-missed=.*") to get some optimization remarks for why the 
optimizer is or is not handling certain constructs. (Warning, this is a 
lot, and it will often be hard to decode if you don't already know how 
the relevant optimization passes work.)

There are several problems here, but zooming out, there are just way too 
many pointers here that could all potentially alias. Proving statically 
that none of these pointers alias is easier said than done, especially 
when variable-sized arrays are involved. Even if you could handle that 
within a function (not trivial), the next problem after that is crossing 
function boundaries (if there's code between allocation and use), and 
that gets messy really quick.

If you have 2D arrays, then if at all possible, allocating them 
contiguously and use a single base pointer to access all elements of a 
given array. That's a form the loop analysis passes know how to deal 
with and it's going to go better, although you will probably still need 
some annotations to promise the compiler there's no aliasing. (Easier 
said than done; you can flag the result pointer as __restrict, but let's 
just say that this path is fraught with peril.)

After writing the whole thing using single base pointers, manual 2D 
array access and, I can get Clang 9.0.0 to vectorize the loop (for x86 
at -march=skylake which has cheap enough gathers), but it's still a very 
direct translation of the original code, which is not what you actually 
want.

Efficient vectorization of matrix multiplies is a well-studied problem, 
but that doesn't make it easy. Turning the basic three nested loops into 
a form that achieves high performance requires _many_ different 
transform steps. (Among other things, almost certainly turning the three 
nested loops into at least five, to process the data in small blocks, 
getting rid of the gathers and turning them into in-register 
transpositions.)

Getting state-of-the-art loop optimizations into LLVM has been an 
ongoing project for many years (-> "Polly", though that seems to
have
gotten quiet recently?). I believe LLVM+Polly would do a lot better on 
this particular example, but it's not enabled (or compiled in) for 
normal compiler releases.

-Fabian

On 10/20/2019 12:31 AM, hameeza ahmed via llvm-dev
wrote:> Hello,
> My matrix multiplication code has variables allocated via double 
> pointers on heap. The code is not getting vectorized. Following is the
code.
> 
>   int **buffer_A = (int **)malloc(vecsize * sizeof(int *));
>   int **buffer_B = (int **)malloc(vecsize * sizeof(int *));
> for(p = 0; p < vecsize; p++)
>      {
>           buffer_A[p] = (int *)malloc(vecsize * sizeof(int));
>       }
> for(p = 0; p < vecsize; p++)
>      {
>           buffer_B[p] = (int *)malloc(vecsize * sizeof(int));
>       }
>      long i, j, k;
>   int **result = (int **)malloc(vecsize * sizeof(int *));
>      for(p = 0; p < vecsize; p++)
>      {
>           result[p] = (int *)malloc(vecsize * sizeof(int));
>       }
>      for(i = 0; i < vecsize; i++)
>      {
>          for(j = 0; j < vecsize; j++)
>          {
>           result[i][j]=0;
>              for(k = 0; k < vecsize; k++)
>              {
>                  result[i][j] += buffer_A[i][k] * buffer_B[k][j];
>              }
>          }
>      }
> what is the issue? how to resolve it?
> 
> Thank You
> 
> Regards
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Johannes Doerfert via llvm-dev

2020-Apr-21 18:33 UTC

head link

[llvm-dev] Matrix Multiplication not Vectorized using double pointers

On 10/20/19 6:23 PM, Fabian Giesen via llvm-dev wrote:
 > Compile with optimization remarks on ("-Rpass-analysis=.*", or 
"-Rpass-missed=.*") to get some optimization remarks for why the 
optimizer is or is not handling certain constructs. (Warning, this is a 
lot, and it will often be hard to decode if you don't already know how 
the relevant optimization passes work.)
 >
 > There are several problems here, but zooming out, there are just way 
too many pointers here that could all potentially alias. Proving 
statically that none of these pointers alias is easier said than done, 
especially when variable-sized arrays are involved. Even if you could 
handle that within a function (not trivial), the next problem after that 
is crossing function boundaries (if there's code between allocation and 
use), and that gets messy really quick.
 >
 > If you have 2D arrays, then if at all possible, allocating them 
contiguously and use a single base pointer to access all elements of a 
given array. That's a form the loop analysis passes know how to deal 
with and it's going to go better, although you will probably still need 
some annotations to promise the compiler there's no aliasing. (Easier 
said than done; you can flag the result pointer as __restrict, but let's 
just say that this path is fraught with peril.)
 >
 > After writing the whole thing using single base pointers, manual 2D 
array access and, I can get Clang 9.0.0 to vectorize the loop (for x86 
at -march=skylake which has cheap enough gathers), but it's still a very 
direct translation of the original code, which is not what you actually 
want.
 >
 > Efficient vectorization of matrix multiplies is a well-studied 
problem, but that doesn't make it easy. Turning the basic three nested 
loops into a form that achieves high performance requires _many_ 
different transform steps. (Among other things, almost certainly turning 
the three nested loops into at least five, to process the data in small 
blocks, getting rid of the gathers and turning them into in-register 
transpositions.)
 >
 > Getting state-of-the-art loop optimizations into LLVM has been an 
ongoing project for many years (-> "Polly", though that seems to
have
gotten quiet recently?). I believe LLVM+Polly would do a lot better on 
this particular example, but it's not enabled (or compiled in) for 
normal compiler releases.

FWIW, Polly performs pattern matching for matrix multiplications (only)
and replaces them with pretty good code. Without that the tiling Polly
applies is usually the most beneficial factor for code like this. That
said, the code below, with the indirection and missing local restrict
keywords* in the 3D loop, will not be optimized by Polly.

* Support to actually use local restrict keywords is still under
   development in LLVM.

Cheers,
   Johannes

 > -Fabian
 >
 > On 10/20/2019 12:31 AM, hameeza ahmed via llvm-dev wrote:
 >> Hello,
 >> My matrix multiplication code has variables allocated via double 
pointers on heap. The code is not getting vectorized. Following is the code.
 >>
 >>   int **buffer_A = (int **)malloc(vecsize * sizeof(int *));
 >>   int **buffer_B = (int **)malloc(vecsize * sizeof(int *));
 >> for(p = 0; p < vecsize; p++)
 >>      {
 >>           buffer_A[p] = (int *)malloc(vecsize * sizeof(int));
 >>       }
 >> for(p = 0; p < vecsize; p++)
 >>      {
 >>           buffer_B[p] = (int *)malloc(vecsize * sizeof(int));
 >>       }
 >>      long i, j, k;
 >>   int **result = (int **)malloc(vecsize * sizeof(int *));
 >>      for(p = 0; p < vecsize; p++)
 >>      {
 >>           result[p] = (int *)malloc(vecsize * sizeof(int));
 >>       }
 >>      for(i = 0; i < vecsize; i++)
 >>      {
 >>          for(j = 0; j < vecsize; j++)
 >>          {
 >>           result[i][j]=0;
 >>              for(k = 0; k < vecsize; k++)
 >>              {
 >>                  result[i][j] += buffer_A[i][k] * buffer_B[k][j];
 >>              }
 >>          }
 >>      }
 >> what is the issue? how to resolve it?
 >>
 >> Thank You
 >>
 >> Regards
 >>
 >> _______________________________________________
 >> LLVM Developers mailing list
 >> llvm-dev at lists.llvm.org
 >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
 >>
 > _______________________________________________
 > LLVM Developers mailing list
 > llvm-dev at lists.llvm.org
 > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

llvm dev - Oct 2019 - Matrix Multiplication not Vectorized using double pointers

[llvm-dev] Matrix Multiplication not Vectorized using double pointers

[llvm-dev] Matrix Multiplication not Vectorized using double pointers

[llvm-dev] Matrix Multiplication not Vectorized using double pointers