thr3ads.net - llvm dev - [llvm-dev] Determination of statements that contain only matrix multiplication [May 2016]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser via llvm-dev

2016-May-20 09:17 UTC

[llvm-dev] Determination of statements that contain only matrix multiplication

>>>> Maybe it could be a temporary solution. I think that if the
checks are
>>>> successfully passed and the basic block of the statement has
exactly
>>>> 14 instructions, the statement contains matrix multiplication
and can
>>>> be safely optimized with a generation of specific code, which
takes
>>>> into account information about usage of SIMD registers.
>>>
>> What kind of transformation do you have in mind that cannot be done in
>> a general way? If you want to take register pressure into account, why
>> not also for other stops?
> 
> I wasn’t sure whether llvm is able to hoist loads into SIMD registers
> and sink stores from them. However, yesterday I found out that in some
> cases llvm can perform such transformations.
> 
> For example, let’s consider the following code:
> 
> Loop 1 for ic = 0, ..., m−1 in steps of mc
> Loop 2   for pc = 0, ..., k−1 in steps of kc
>                 A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
> Loop 3     for jc = 0, ..., n−1 in steps of nc
>                   B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into
Bc
> Loop 4       for ir = 0, ..., mc −1 in steps of mr
> Loop 5         for jr = 0, ..., nc −1 in steps of nr
> Loop 6           for pr = 0, ..., kc −1 in steps of 1
> Loop 7             for ir' = 0, ..., mr −1 in steps of 1
> Loop 8               for jr' = 0, ..., nr −1 in steps of 1
>                             C[ic * mr + i * mc + ir'][jc * nr + j * nc
> + jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr +
> jr'];
>                             endfor
>                           endfor
>                         endfor
>                       endfor
>                     endfor
>                   endfor
>                 endfor
>               endfor
> 
> To get closer to an implementation of the algorithm from [1] for
> matrices stored in row-major order, we can unroll loop 7 and loop 8
> and perform vectorization with llvm (a corresponding code can be found
> attached). According to the attached IR, llvm sinks and hoists stores
> and loads related to matrix C.
> 
> Nevertheless, llvm can’t do it, if, for example, we call prefetch
> within loop 6
This is likely because LLVM does not understand the side-effects
of the prefetch instruction.
> or apply packing transformations in a way that is
> similar to the one mentioned in [2] (corresponding implementations are
> attached to the email). I haven’t found the reason yet.
I assume this is also due to the unknown side-effects of the packing
implementation.
(LLVM does not understand that the packing function-call does not modify
the C array).

Both can likely be addressed by improving aliasing-modref information,
adding relevant annotations or just hacking the code to ignore this
issue. Otherwise, we can probably also generate IR that makes it easier
for LLVM to perform this register promotion.

I am not fully sure where you are heading at the moment. Do you have
identified a set of individual tasks/changes that you could implement?

Best,
Tobias

Roman Gareev via llvm-dev

2016-May-20 13:07 UTC

head link

[llvm-dev] Determination of statements that contain only matrix multiplication

2016-05-20 14:17 GMT+05:00 Tobias Grosser <tobias at
grosser.es>:>>>>> Maybe it could be a temporary solution. I think that if the
checks are
>>>>> successfully passed and the basic block of the statement
has exactly
>>>>> 14 instructions, the statement contains matrix
multiplication and can
>>>>> be safely optimized with a generation of specific code,
which takes
>>>>> into account information about usage of SIMD registers.
>>>>
>>> What kind of transformation do you have in mind that cannot be done
in
>>> a general way? If you want to take register pressure into account,
why
>>> not also for other stops?
>>
>> I wasn’t sure whether llvm is able to hoist loads into SIMD registers
>> and sink stores from them. However, yesterday I found out that in some
>> cases llvm can perform such transformations.
>>
>> For example, let’s consider the following code:
>>
>> Loop 1 for ic = 0, ..., m−1 in steps of mc
>> Loop 2   for pc = 0, ..., k−1 in steps of kc
>>                 A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into
Ac
>> Loop 3     for jc = 0, ..., n−1 in steps of nc
>>                   B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack
into Bc
>> Loop 4       for ir = 0, ..., mc −1 in steps of mr
>> Loop 5         for jr = 0, ..., nc −1 in steps of nr
>> Loop 6           for pr = 0, ..., kc −1 in steps of 1
>> Loop 7             for ir' = 0, ..., mr −1 in steps of 1
>> Loop 8               for jr' = 0, ..., nr −1 in steps of 1
>>                             C[ic * mr + i * mc + ir'][jc * nr + j *
nc
>> + jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) *
nr +
>> jr'];
>>                             endfor
>>                           endfor
>>                         endfor
>>                       endfor
>>                     endfor
>>                   endfor
>>                 endfor
>>               endfor
>>
>> To get closer to an implementation of the algorithm from [1] for
>> matrices stored in row-major order, we can unroll loop 7 and loop 8
>> and perform vectorization with llvm (a corresponding code can be found
>> attached). According to the attached IR, llvm sinks and hoists stores
>> and loads related to matrix C.
>>
>> Nevertheless, llvm can’t do it, if, for example, we call prefetch
>> within loop 6
>
> This is likely because LLVM does not understand the side-effects
> of the prefetch instruction.
>
>> or apply packing transformations in a way that is
>> similar to the one mentioned in [2] (corresponding implementations are
>> attached to the email). I haven’t found the reason yet.
>
> I assume this is also due to the unknown side-effects of the packing
> implementation.
> (LLVM does not understand that the packing function-call does not modify
> the C array).
>
> Both can likely be addressed by improving aliasing-modref information,
> adding relevant annotations or just hacking the code to ignore this
> issue. Otherwise, we can probably also generate IR that makes it easier
> for LLVM to perform this register promotion.
I think that this is right. Thank you for the explanation!
> I am not fully sure where you are heading at the moment. Do you have
> identified a set of individual tasks/changes that you could implement?
If I’m not mistaken, we decided that the pattern matching could be a
good starting point. That’s why I’m planning to try to implement it in
the near future. Implementation of tiling, interchanging and unrolling
of specific loops based on the algorithm presented in [1] could be the
next step. I could start working on the packing transformation, if the
previous tasks are implemented.

What do you think about it?

Refs:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf

-- 
                                    Cheers, Roman Gareev.

Tobias Grosser via llvm-dev

2016-May-20 14:11 UTC

head link

[llvm-dev] Determination of statements that contain only matrix multiplication

On 05/20/2016 03:07 PM, Roman Gareev wrote:> 2016-05-20 14:17 GMT+05:00 Tobias Grosser <tobias at grosser.es>:
>>>>>> Maybe it could be a temporary solution. I think that if
the checks are
>>>>>> successfully passed and the basic block of the
statement has exactly
>>>>>> 14 instructions, the statement contains matrix
multiplication and can
>>>>>> be safely optimized with a generation of specific code,
which takes
>>>>>> into account information about usage of SIMD registers.
>>>>>
>>>> What kind of transformation do you have in mind that cannot be
done in
>>>> a general way? If you want to take register pressure into
account, why
>>>> not also for other stops?
>>>
>>> I wasn’t sure whether llvm is able to hoist loads into SIMD
registers
>>> and sink stores from them. However, yesterday I found out that in
some
>>> cases llvm can perform such transformations.
>>>
>>> For example, let’s consider the following code:
>>>
>>> Loop 1 for ic = 0, ..., m−1 in steps of mc
>>> Loop 2   for pc = 0, ..., k−1 in steps of kc
>>>                 A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack
into Ac
>>> Loop 3     for jc = 0, ..., n−1 in steps of nc
>>>                   B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc //
Pack into Bc
>>> Loop 4       for ir = 0, ..., mc −1 in steps of mr
>>> Loop 5         for jr = 0, ..., nc −1 in steps of nr
>>> Loop 6           for pr = 0, ..., kc −1 in steps of 1
>>> Loop 7             for ir' = 0, ..., mr −1 in steps of 1
>>> Loop 8               for jr' = 0, ..., nr −1 in steps of 1
>>>                             C[ic * mr + i * mc + ir'][jc * nr +
j * nc
>>> + jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc)
* nr +
>>> jr'];
>>>                             endfor
>>>                           endfor
>>>                         endfor
>>>                       endfor
>>>                     endfor
>>>                   endfor
>>>                 endfor
>>>               endfor
>>>
>>> To get closer to an implementation of the algorithm from [1] for
>>> matrices stored in row-major order, we can unroll loop 7 and loop 8
>>> and perform vectorization with llvm (a corresponding code can be
found
>>> attached). According to the attached IR, llvm sinks and hoists
stores
>>> and loads related to matrix C.
>>>
>>> Nevertheless, llvm can’t do it, if, for example, we call prefetch
>>> within loop 6
>>
>> This is likely because LLVM does not understand the side-effects
>> of the prefetch instruction.
>>
>>> or apply packing transformations in a way that is
>>> similar to the one mentioned in [2] (corresponding implementations
are
>>> attached to the email). I haven’t found the reason yet.
>>
>> I assume this is also due to the unknown side-effects of the packing
>> implementation.
>> (LLVM does not understand that the packing function-call does not
modify
>> the C array).
>>
>> Both can likely be addressed by improving aliasing-modref information,
>> adding relevant annotations or just hacking the code to ignore this
>> issue. Otherwise, we can probably also generate IR that makes it easier
>> for LLVM to perform this register promotion.
> 
> I think that this is right. Thank you for the explanation!
> 
>> I am not fully sure where you are heading at the moment. Do you have
>> identified a set of individual tasks/changes that you could implement?
> 
> If I’m not mistaken, we decided that the pattern matching could be a
> good starting point. That’s why I’m planning to try to implement it in
> the near future. Implementation of tiling, interchanging and unrolling
> of specific loops based on the algorithm presented in [1] could be the
> next step. I could start working on the packing transformation, if the
> previous tasks are implemented.
> 
> What do you think about it?
Sounds great.

Best,
Tobias

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - May 2016 - Determination of statements that contain only matrix multiplication

[llvm-dev] Determination of statements that contain only matrix multiplication

[llvm-dev] Determination of statements that contain only matrix multiplication

[llvm-dev] Determination of statements that contain only matrix multiplication

Seemingly Similar Threads