thr3ads.net - llvm dev - [llvm-dev] Determination of statements that contain only matrix multiplication [May 2016]

If this information is useful, please help other people find it:
Share via:

Roman Gareev via llvm-dev

2016-May-16 17:52 UTC

[llvm-dev] Determination of statements that contain only matrix multiplication

Hi Tobias,

could we use information about memory accesses of a SCoP statement and
def-use chains to determine statements, which don’t contain matrix
multiplication of the following form?

for (int i = 0; i < Upper Bound1; i++)
  for (int j = 0; j < Upper Bound2; j++)
    for (int k = 0; k < Upper Bound3; j++)
      C[i][j] += A[i][k] * B[k][j]

We could probably check that memory access relations have the following form:

"accesses" : [
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2]
}"
  },
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1]
}"
  },
  {
    "kind" : "read",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1]
}"
  },
  {
    "kind" : "write",
    "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1]
}"
  }
]

and there are the following relations between access instructions of
the memory accesses:

AccessInst1 --- def-use --- \
AccessInst2 --- def-use --- fmul
                                            |
                                       def-use
                                            |
AccessInst3 --- def-use --- fadd
                                            |
                                       def-use
                                            |
                                 store (AccessInst4)

(fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the
fmul and AccessInst3. store (AccessInst4) is a user of fadd)

Maybe it could be a temporary solution. I think that if the checks are
successfully passed and the basic block of the statement has exactly
14 instructions, the statement contains matrix multiplication and can
be safely optimized with a generation of specific code, which takes
into account information about usage of SIMD registers.

I have one more question. Are memory accesses of MemAccs from the
ScopStmt class ordered by their sequence order?

-- 
                                    Cheers, Roman Gareev.

Michael Kruse via llvm-dev

2016-May-17 11:47 UTC

head link

[llvm-dev] Determination of statements that contain only matrix multiplication

2016-05-16 19:52 GMT+02:00 Roman Gareev <gareevroman at
gmail.com>:> Hi Tobias,
>
> could we use information about memory accesses of a SCoP statement and
> def-use chains to determine statements, which don’t contain matrix
> multiplication of the following form?
Assuming s/don't/do you want to pattern-match gemm kernels inside larger
scops.

> for (int i = 0; i < Upper Bound1; i++)
>   for (int j = 0; j < Upper Bound2; j++)
>     for (int k = 0; k < Upper Bound3; j++)
>       C[i][j] += A[i][k] * B[k][j]
>
> We could probably check that memory access relations have the following
form:
>
> "accesses" : [
>   {
>     "kind" : "read",
>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0,
i2] }"
>   },
>   {
>     "kind" : "read",
>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2,
i1] }"
>   },
>   {
>     "kind" : "read",
>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0,
i1] }"
>   },
>   {
>     "kind" : "write",
>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0,
i1] }"
>   }
> ]
>
> and there are the following relations between access instructions of
> the memory accesses:
>
> AccessInst1 --- def-use --- \
> AccessInst2 --- def-use --- fmul
>                                             |
>                                        def-use
>                                             |
> AccessInst3 --- def-use --- fadd
>                                             |
>                                        def-use
>                                             |
>                                  store (AccessInst4)
>
> (fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the
> fmul and AccessInst3. store (AccessInst4) is a user of fadd)
You'd also have to check which access are not there. E.g. the value of
C[i][j] might be used in the inner loop for some other purpose as
well, which would make some transformations illegal.

Also notice that LICM will hoist the C[i][j] reduction out of the
k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to
undo this again.

> Maybe it could be a temporary solution. I think that if the checks are
> successfully passed and the basic block of the statement has exactly
> 14 instructions, the statement contains matrix multiplication and can
> be safely optimized with a generation of specific code, which takes
> into account information about usage of SIMD registers.
What kind of transformation do you have in mind that cannot be done in
a general way? If you want to take register pressure into account, why
not also for other scops?

Like the LoopIdiomPass, one could replace a detected gemm kernel my a
call to a hand-optimized/external [sdcz]gemm function. However, I
think the goal of Polly is to optimize such kernels by itself.

> I have one more question. Are memory accesses of MemAccs from the
> ScopStmt class ordered by their sequence order?
Only the MK_Array accesses should be in order relative to each other.
All scalar reads happen when entering a statement and all scalar
writes at the end of it.

Michael

-- 
Tardyzentrismus verboten!

Tobias Grosser via llvm-dev

2016-May-17 14:37 UTC

head link

[llvm-dev] Determination of statements that contain only matrix multiplication

On 05/17/2016 01:47 PM, Michael Kruse wrote:> 2016-05-16 19:52 GMT+02:00 Roman Gareev <gareevroman at gmail.com>:
>> Hi Tobias,
>>
>> could we use information about memory accesses of a SCoP statement and
>> def-use chains to determine statements, which don’t contain matrix
>> multiplication of the following form?
> 
> Assuming s/don't/do you want to pattern-match gemm kernels inside
larger scops.
> 
> 
>> for (int i = 0; i < Upper Bound1; i++)
>>   for (int j = 0; j < Upper Bound2; j++)
>>     for (int k = 0; k < Upper Bound3; j++)
>>       C[i][j] += A[i][k] * B[k][j]
>>
>> We could probably check that memory access relations have the following
form:
>>
>> "accesses" : [
>>   {
>>     "kind" : "read",
>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_1[i0, i2] }"
>>   },
>>   {
>>     "kind" : "read",
>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_2[i2, i1] }"
>>   },
>>   {
>>     "kind" : "read",
>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_3[i0, i1] }"
>>   },
>>   {
>>     "kind" : "write",
>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_4[i0, i1] }"
>>   }
>> ]
>>
>> and there are the following relations between access instructions of
>> the memory accesses:
>>
>> AccessInst1 --- def-use --- \
>> AccessInst2 --- def-use --- fmul
>>                                             |
>>                                        def-use
>>                                             |
>> AccessInst3 --- def-use --- fadd
>>                                             |
>>                                        def-use
>>                                             |
>>                                  store (AccessInst4)
>>
>> (fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the
>> fmul and AccessInst3. store (AccessInst4) is a user of fadd)
> 
> You'd also have to check which access are not there. E.g. the value of
> C[i][j] might be used in the inner loop for some other purpose as
> well, which would make some transformations illegal.
> 
> Also notice that LICM will hoist the C[i][j] reduction out of the
> k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to
> undo this again.
Very right.
>> Maybe it could be a temporary solution. I think that if the checks are
>> successfully passed and the basic block of the statement has exactly
>> 14 instructions, the statement contains matrix multiplication and can
>> be safely optimized with a generation of specific code, which takes
>> into account information about usage of SIMD registers.
> 
> What kind of transformation do you have in mind that cannot be done in
> a general way? If you want to take register pressure into account, why
> not also for other scops?
Good questions.
> Like the LoopIdiomPass, one could replace a detected gemm kernel my a
> call to a hand-optimized/external [sdcz]gemm function. However, I
> think the goal of Polly is to optimize such kernels by itself.
Right. This does not seem to be too interesting.

I am personally mostly interested in getting the pieces in place that
allow us to get close-to-peak performance for gemm, but agree with
Michael that we clearly want to generalize this in the future.

However, going with a vertical implementation first might be a good idea
(assuming the pattern matching is simple). In this case we could detect
a gemm kernel and use this logic to get all important optimizations we
need in place.

Best,
Tobias

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - May 2016 - Determination of statements that contain only matrix multiplication

[llvm-dev] Determination of statements that contain only matrix multiplication

[llvm-dev] Determination of statements that contain only matrix multiplication

[llvm-dev] Determination of statements that contain only matrix multiplication

Reasonably Related Threads