thr3ads.net - llvm dev - [llvm-dev] Determination of statements that contain only matrix multiplication [May 2016]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser via llvm-dev

2016-May-17 14:37 UTC

[llvm-dev] Determination of statements that contain only matrix multiplication

On 05/17/2016 01:47 PM, Michael Kruse wrote:> 2016-05-16 19:52 GMT+02:00 Roman Gareev <gareevroman at gmail.com>:
>> Hi Tobias,
>>
>> could we use information about memory accesses of a SCoP statement and
>> def-use chains to determine statements, which don’t contain matrix
>> multiplication of the following form?
> 
> Assuming s/don't/do you want to pattern-match gemm kernels inside
larger scops.
> 
> 
>> for (int i = 0; i < Upper Bound1; i++)
>>   for (int j = 0; j < Upper Bound2; j++)
>>     for (int k = 0; k < Upper Bound3; j++)
>>       C[i][j] += A[i][k] * B[k][j]
>>
>> We could probably check that memory access relations have the following
form:
>>
>> "accesses" : [
>>   {
>>     "kind" : "read",
>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_1[i0, i2] }"
>>   },
>>   {
>>     "kind" : "read",
>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_2[i2, i1] }"
>>   },
>>   {
>>     "kind" : "read",
>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_3[i0, i1] }"
>>   },
>>   {
>>     "kind" : "write",
>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_4[i0, i1] }"
>>   }
>> ]
>>
>> and there are the following relations between access instructions of
>> the memory accesses:
>>
>> AccessInst1 --- def-use --- \
>> AccessInst2 --- def-use --- fmul
>>                                             |
>>                                        def-use
>>                                             |
>> AccessInst3 --- def-use --- fadd
>>                                             |
>>                                        def-use
>>                                             |
>>                                  store (AccessInst4)
>>
>> (fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the
>> fmul and AccessInst3. store (AccessInst4) is a user of fadd)
> 
> You'd also have to check which access are not there. E.g. the value of
> C[i][j] might be used in the inner loop for some other purpose as
> well, which would make some transformations illegal.
> 
> Also notice that LICM will hoist the C[i][j] reduction out of the
> k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to
> undo this again.
Very right.
>> Maybe it could be a temporary solution. I think that if the checks are
>> successfully passed and the basic block of the statement has exactly
>> 14 instructions, the statement contains matrix multiplication and can
>> be safely optimized with a generation of specific code, which takes
>> into account information about usage of SIMD registers.
> 
> What kind of transformation do you have in mind that cannot be done in
> a general way? If you want to take register pressure into account, why
> not also for other scops?
Good questions.
> Like the LoopIdiomPass, one could replace a detected gemm kernel my a
> call to a hand-optimized/external [sdcz]gemm function. However, I
> think the goal of Polly is to optimize such kernels by itself.
Right. This does not seem to be too interesting.

I am personally mostly interested in getting the pieces in place that
allow us to get close-to-peak performance for gemm, but agree with
Michael that we clearly want to generalize this in the future.

However, going with a vertical implementation first might be a good idea
(assuming the pattern matching is simple). In this case we could detect
a gemm kernel and use this logic to get all important optimizations we
need in place.

Best,
Tobias

Roman Gareev via llvm-dev

2016-May-19 14:09 UTC

head link

[llvm-dev] Determination of statements that contain only matrix multiplication

Thank you for the comments and answers!

2016-05-17 19:37 GMT+05:00 Tobias Grosser <tobias at
grosser.es>:> On 05/17/2016 01:47 PM, Michael Kruse wrote:
>> 2016-05-16 19:52 GMT+02:00 Roman Gareev <gareevroman at
gmail.com>:
>>> Hi Tobias,
>>>
>>> could we use information about memory accesses of a SCoP statement
and
>>> def-use chains to determine statements, which don’t contain matrix
>>> multiplication of the following form?
>>
>> Assuming s/don't/do you want to pattern-match gemm kernels inside
larger scops.
>>
>>
>>> for (int i = 0; i < Upper Bound1; i++)
>>>   for (int j = 0; j < Upper Bound2; j++)
>>>     for (int k = 0; k < Upper Bound3; j++)
>>>       C[i][j] += A[i][k] * B[k][j]
>>>
>>> We could probably check that memory access relations have the
following form:
>>>
>>> "accesses" : [
>>>   {
>>>     "kind" : "read",
>>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_1[i0, i2] }"
>>>   },
>>>   {
>>>     "kind" : "read",
>>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_2[i2, i1] }"
>>>   },
>>>   {
>>>     "kind" : "read",
>>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_3[i0, i1] }"
>>>   },
>>>   {
>>>     "kind" : "write",
>>>     "relation" : "{ Stmt_14[i0, i1, i2] ->
MemRef_4[i0, i1] }"
>>>   }
>>> ]
>>>
>>> and there are the following relations between access instructions
of
>>> the memory accesses:
>>>
>>> AccessInst1 --- def-use --- \
>>> AccessInst2 --- def-use --- fmul
>>>                                             |
>>>                                        def-use
>>>                                             |
>>> AccessInst3 --- def-use --- fadd
>>>                                             |
>>>                                        def-use
>>>                                             |
>>>                                  store (AccessInst4)
>>>
>>> (fmul is a user of AccessInst1 and AccessInst2. fadd is a user of
the
>>> fmul and AccessInst3. store (AccessInst4) is a user of fadd)
>>
>> You'd also have to check which access are not there. E.g. the value
of
>> C[i][j] might be used in the inner loop for some other purpose as
>> well, which would make some transformations illegal.
>>
>> Also notice that LICM will hoist the C[i][j] reduction out of the
>> k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to
>> undo this again.
>
> Very right.
I agree with it. However, it could probably be used, if we were able
to apply only transformations, which can be done in general way.
>>> Maybe it could be a temporary solution. I think that if the checks
are
>>> successfully passed and the basic block of the statement has
exactly
>>> 14 instructions, the statement contains matrix multiplication and
can
>>> be safely optimized with a generation of specific code, which takes
>>> into account information about usage of SIMD registers.
>>
> What kind of transformation do you have in mind that cannot be done in
> a general way? If you want to take register pressure into account, why
> not also for other stops?
I wasn’t sure whether llvm is able to hoist loads into SIMD registers
and sink stores from them. However, yesterday I found out that in some
cases llvm can perform such transformations.

For example, let’s consider the following code:

Loop 1 for ic = 0, ..., m−1 in steps of mc
Loop 2   for pc = 0, ..., k−1 in steps of kc
                A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
Loop 3     for jc = 0, ..., n−1 in steps of nc
                  B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
Loop 4       for ir = 0, ..., mc −1 in steps of mr
Loop 5         for jr = 0, ..., nc −1 in steps of nr
Loop 6           for pr = 0, ..., kc −1 in steps of 1
Loop 7             for ir' = 0, ..., mr −1 in steps of 1
Loop 8               for jr' = 0, ..., nr −1 in steps of 1
                            C[ic * mr + i * mc + ir'][jc * nr + j * nc
+ jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr +
jr'];
                            endfor
                          endfor
                        endfor
                      endfor
                    endfor
                  endfor
                endfor
              endfor

To get closer to an implementation of the algorithm from [1] for
matrices stored in row-major order, we can unroll loop 7 and loop 8
and perform vectorization with llvm (a corresponding code can be found
attached). According to the attached IR, llvm sinks and hoists stores
and loads related to matrix C.

Nevertheless, llvm can’t do it, if, for example, we call prefetch
within loop 6 or apply packing transformations in a way that is
similar to the one mentioned in [2] (corresponding implementations are
attached to the email). I haven’t found the reason yet.

Refs:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm
>> Like the LoopIdiomPass, one could replace a detected gemm kernel my a
>> call to a hand-optimized/external [sdcz]gemm function. However, I
>> think the goal of Polly is to optimize such kernels by itself.
>
> Right. This does not seem to be too interesting.
>
> I am personally mostly interested in getting the pieces in place that
> allow us to get close-to-peak performance for gemm, but agree with
> Michael that we clearly want to generalize this in the future.
>
> However, going with a vertical implementation first might be a good idea
> (assuming the pattern matching is simple). In this case we could detect
> a gemm kernel and use this logic to get all important optimizations we
> need in place.
Yes, it is planned for the future. Sorry, I should probably have
mentioned it in the proposal.

-- 
                                    Cheers, Roman Gareev.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gemm_SIMD.c
Type: text/x-csrc
Size: 8387 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: IR
Type: application/octet-stream
Size: 17577 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gemm_SIMD_with_prefetch.c
Type: text/x-csrc
Size: 8492 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment-0001.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gemm_C_SIMD_another_packing.c
Type: text/x-csrc
Size: 7900 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment-0002.c>

Michael Kruse via llvm-dev

2016-May-19 14:58 UTC

head link

[llvm-dev] Determination of statements that contain only matrix multiplication

Thank you for the elaborate explanation, although I don't have time to
go through all of them.

2016-05-19 16:09 GMT+02:00 Roman Gareev <gareevroman at
gmail.com>:> To get closer to an implementation of the algorithm from [1] for
> matrices stored in row-major order, we can unroll loop 7 and loop 8
> and perform vectorization with llvm (a corresponding code can be found
> attached). According to the attached IR, llvm sinks and hoists stores
> and loads related to matrix C.
>
> Nevertheless, llvm can’t do it, if, for example, we call prefetch
> within loop 6 or apply packing transformations in a way that is
> similar to the one mentioned in [2] (corresponding implementations are
> attached to the email). I haven’t found the reason yet.
>
> Refs:
>
> [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
> [2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm
It would be great if you find a general way to handle such concerns,
maybe even modifying LLVM's passes. Since Tobias is your advisor, you
might discuss with him the details for which it is handling cases in
more general ways vs. only in detected gemm kernels.

AFAIK prefetch instructions are not automatically inserted anywhere in
LLVM. It would be nice to insert them X instructions/accesses in
advance, where X is determined somehow. For gemm only, one could do
some measurement for different architectures, but we don't get around
to have have it derived from some command-line argument in either
case.

Michael

-- 
Tardyzentrismus verboten!

Roman Gareev via llvm-dev

2016-May-20 13:05 UTC

head link

[llvm-dev] Determination of statements that contain only matrix multiplication

2016-05-19 21:45 GMT+05:00 4lbert C0hen <4lbert.h.c0hen at
gmail.com>:> One short note. I would advise against spending time on prefetching for
x86.
> Recent hardware prefetchers are amazingly good at strided accesses in
> single-threaded code. Caution: this is not based on objective/published
> data, but on personal experience.
>
> There are open challenges in multiprocessor prefecthing, even for regularly
> strided data, but these are probabably too ambitious to be tackled
> effectively in the time frame of a SoC. There are lots of papers on this
> however.
>
> Now, if you are targeting lower power processors, including most ARM v7a/v8
> implementations, prefetching may be much more important. There are much
> fewer publications on BLAS optimizations for ARM, but they exist. Let me
> know if you need pointers.
>
> In any case, if you have to go beyond loop transformations, unrolling,
> register blocking, I would advise to look into data layout transformations,
> compacting strided blocks, which is one of the key optimizations in [2], or
> transposition. These primarily help with TLB misses, enabling the full
power
> of 3D tiling (they have less impact on 2D tiling). And by the way, they
will
> in turn improve the effectiveness of hardware prefetching.
Thank you very much for the advices! I could probably try to avoid
using of nonhardware prefetching in the project, if Tobias doesn’t
disagree with it. My understanding is that prefetching isn’t used
explicitly in [1] and, according to [2], in some cases 90% of the
turbo boost peak of the processor can be attained without it.

I started to consider prefetching, because it’s used in
implementations of gemm micro-kernels of BLIS framework [3]. If I’m
not mistaken, it’s applied to try to make sure that micro-panel Br is
loaded after micro-panel Ar (as required in [1] p. 11). For example,
its using helps to reduce the execution time of the attached
implementation.

Refs:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm
[3] -
https://github.com/flame/blis/blob/master/kernels/x86_64/sandybridge/3/bli_gemm_int_d8x4.c

-- 
                                    Cheers, Roman Gareev.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gemm_C_SIMD.c
Type: text/x-csrc
Size: 5697 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160520/16edb8dc/attachment.c>

4lbert C0hen via llvm-dev

2016-May-28 14:48 UTC

head link

[llvm-dev] Determination of statements that contain only matrix multiplication

Sorry for not responding earlier.

On 05/20/2016 03:05 PM, Roman Gareev wrote:> Thank you very much for the advices! I could probably try to avoid
> using of nonhardware prefetching in the project, if Tobias doesn’t
> disagree with it. My understanding is that prefetching isn’t used
> explicitly in [1] and, according to [2], in some cases 90% of the
> turbo boost peak of the processor can be attained without it.
Too many negations :-) I'm not sure I followed exactly what you wanted 
to say, but I understand that this is not the priority since you can get 
90% of the performance without worrying about prefetching.
> I started to consider prefetching, because it’s used in
> implementations of gemm micro-kernels of BLIS framework [3]. If I’m
> not mistaken, it’s applied to try to make sure that micro-panel Br is
> loaded after micro-panel Ar (as required in [1] p. 11). For example,
> its using helps to reduce the execution time of the attached
> implementation.
Interesting. The BLIS implementation prefetches only the first cache 
line, before traversing a given interval of memory. This clearly 
confirms the implementation relies on hardware preteching to prefetch 
the subsequent lines. This makes a lot of sense. Yet surprisingly, the 
BLIS implementation does not attempt at anticipating the fetch. It 
schedules the prefetch instruction right before the first load of a 
given interval.
> Refs:
>
> [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
> [2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm
> [3] -
https://github.com/flame/blis/blob/master/kernels/x86_64/sandybridge/3/bli_gemm_int_d8x4.c
>

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - May 2016 - Determination of statements that contain only matrix multiplication

[llvm-dev] Determination of statements that contain only matrix multiplication

[llvm-dev] Determination of statements that contain only matrix multiplication

[llvm-dev] Determination of statements that contain only matrix multiplication

[llvm-dev] Determination of statements that contain only matrix multiplication

[llvm-dev] Determination of statements that contain only matrix multiplication

Maybe Matching Threads