Tobias Grosser via llvm-dev
2016-May-20 09:17 UTC
[llvm-dev] Determination of statements that contain only matrix multiplication
>>>> Maybe it could be a temporary solution. I think that if the checks are >>>> successfully passed and the basic block of the statement has exactly >>>> 14 instructions, the statement contains matrix multiplication and can >>>> be safely optimized with a generation of specific code, which takes >>>> into account information about usage of SIMD registers. >>> >> What kind of transformation do you have in mind that cannot be done in >> a general way? If you want to take register pressure into account, why >> not also for other stops? > > I wasn’t sure whether llvm is able to hoist loads into SIMD registers > and sink stores from them. However, yesterday I found out that in some > cases llvm can perform such transformations. > > For example, let’s consider the following code: > > Loop 1 for ic = 0, ..., m−1 in steps of mc > Loop 2 for pc = 0, ..., k−1 in steps of kc > A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac > Loop 3 for jc = 0, ..., n−1 in steps of nc > B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc > Loop 4 for ir = 0, ..., mc −1 in steps of mr > Loop 5 for jr = 0, ..., nc −1 in steps of nr > Loop 6 for pr = 0, ..., kc −1 in steps of 1 > Loop 7 for ir' = 0, ..., mr −1 in steps of 1 > Loop 8 for jr' = 0, ..., nr −1 in steps of 1 > C[ic * mr + i * mc + ir'][jc * nr + j * nc > + jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr + > jr']; > endfor > endfor > endfor > endfor > endfor > endfor > endfor > endfor > > To get closer to an implementation of the algorithm from [1] for > matrices stored in row-major order, we can unroll loop 7 and loop 8 > and perform vectorization with llvm (a corresponding code can be found > attached). According to the attached IR, llvm sinks and hoists stores > and loads related to matrix C. > > Nevertheless, llvm can’t do it, if, for example, we call prefetch > within loop 6This is likely because LLVM does not understand the side-effects of the prefetch instruction.> or apply packing transformations in a way that is > similar to the one mentioned in [2] (corresponding implementations are > attached to the email). I haven’t found the reason yet.I assume this is also due to the unknown side-effects of the packing implementation. (LLVM does not understand that the packing function-call does not modify the C array). Both can likely be addressed by improving aliasing-modref information, adding relevant annotations or just hacking the code to ignore this issue. Otherwise, we can probably also generate IR that makes it easier for LLVM to perform this register promotion. I am not fully sure where you are heading at the moment. Do you have identified a set of individual tasks/changes that you could implement? Best, Tobias
Roman Gareev via llvm-dev
2016-May-20 13:07 UTC
[llvm-dev] Determination of statements that contain only matrix multiplication
2016-05-20 14:17 GMT+05:00 Tobias Grosser <tobias at grosser.es>:>>>>> Maybe it could be a temporary solution. I think that if the checks are >>>>> successfully passed and the basic block of the statement has exactly >>>>> 14 instructions, the statement contains matrix multiplication and can >>>>> be safely optimized with a generation of specific code, which takes >>>>> into account information about usage of SIMD registers. >>>> >>> What kind of transformation do you have in mind that cannot be done in >>> a general way? If you want to take register pressure into account, why >>> not also for other stops? >> >> I wasn’t sure whether llvm is able to hoist loads into SIMD registers >> and sink stores from them. However, yesterday I found out that in some >> cases llvm can perform such transformations. >> >> For example, let’s consider the following code: >> >> Loop 1 for ic = 0, ..., m−1 in steps of mc >> Loop 2 for pc = 0, ..., k−1 in steps of kc >> A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac >> Loop 3 for jc = 0, ..., n−1 in steps of nc >> B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc >> Loop 4 for ir = 0, ..., mc −1 in steps of mr >> Loop 5 for jr = 0, ..., nc −1 in steps of nr >> Loop 6 for pr = 0, ..., kc −1 in steps of 1 >> Loop 7 for ir' = 0, ..., mr −1 in steps of 1 >> Loop 8 for jr' = 0, ..., nr −1 in steps of 1 >> C[ic * mr + i * mc + ir'][jc * nr + j * nc >> + jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr + >> jr']; >> endfor >> endfor >> endfor >> endfor >> endfor >> endfor >> endfor >> endfor >> >> To get closer to an implementation of the algorithm from [1] for >> matrices stored in row-major order, we can unroll loop 7 and loop 8 >> and perform vectorization with llvm (a corresponding code can be found >> attached). According to the attached IR, llvm sinks and hoists stores >> and loads related to matrix C. >> >> Nevertheless, llvm can’t do it, if, for example, we call prefetch >> within loop 6 > > This is likely because LLVM does not understand the side-effects > of the prefetch instruction. > >> or apply packing transformations in a way that is >> similar to the one mentioned in [2] (corresponding implementations are >> attached to the email). I haven’t found the reason yet. > > I assume this is also due to the unknown side-effects of the packing > implementation. > (LLVM does not understand that the packing function-call does not modify > the C array). > > Both can likely be addressed by improving aliasing-modref information, > adding relevant annotations or just hacking the code to ignore this > issue. Otherwise, we can probably also generate IR that makes it easier > for LLVM to perform this register promotion.I think that this is right. Thank you for the explanation!> I am not fully sure where you are heading at the moment. Do you have > identified a set of individual tasks/changes that you could implement?If I’m not mistaken, we decided that the pattern matching could be a good starting point. That’s why I’m planning to try to implement it in the near future. Implementation of tiling, interchanging and unrolling of specific loops based on the algorithm presented in [1] could be the next step. I could start working on the packing transformation, if the previous tasks are implemented. What do you think about it? Refs: [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf -- Cheers, Roman Gareev.
Tobias Grosser via llvm-dev
2016-May-20 14:11 UTC
[llvm-dev] Determination of statements that contain only matrix multiplication
On 05/20/2016 03:07 PM, Roman Gareev wrote:> 2016-05-20 14:17 GMT+05:00 Tobias Grosser <tobias at grosser.es>: >>>>>> Maybe it could be a temporary solution. I think that if the checks are >>>>>> successfully passed and the basic block of the statement has exactly >>>>>> 14 instructions, the statement contains matrix multiplication and can >>>>>> be safely optimized with a generation of specific code, which takes >>>>>> into account information about usage of SIMD registers. >>>>> >>>> What kind of transformation do you have in mind that cannot be done in >>>> a general way? If you want to take register pressure into account, why >>>> not also for other stops? >>> >>> I wasn’t sure whether llvm is able to hoist loads into SIMD registers >>> and sink stores from them. However, yesterday I found out that in some >>> cases llvm can perform such transformations. >>> >>> For example, let’s consider the following code: >>> >>> Loop 1 for ic = 0, ..., m−1 in steps of mc >>> Loop 2 for pc = 0, ..., k−1 in steps of kc >>> A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac >>> Loop 3 for jc = 0, ..., n−1 in steps of nc >>> B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc >>> Loop 4 for ir = 0, ..., mc −1 in steps of mr >>> Loop 5 for jr = 0, ..., nc −1 in steps of nr >>> Loop 6 for pr = 0, ..., kc −1 in steps of 1 >>> Loop 7 for ir' = 0, ..., mr −1 in steps of 1 >>> Loop 8 for jr' = 0, ..., nr −1 in steps of 1 >>> C[ic * mr + i * mc + ir'][jc * nr + j * nc >>> + jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr + >>> jr']; >>> endfor >>> endfor >>> endfor >>> endfor >>> endfor >>> endfor >>> endfor >>> endfor >>> >>> To get closer to an implementation of the algorithm from [1] for >>> matrices stored in row-major order, we can unroll loop 7 and loop 8 >>> and perform vectorization with llvm (a corresponding code can be found >>> attached). According to the attached IR, llvm sinks and hoists stores >>> and loads related to matrix C. >>> >>> Nevertheless, llvm can’t do it, if, for example, we call prefetch >>> within loop 6 >> >> This is likely because LLVM does not understand the side-effects >> of the prefetch instruction. >> >>> or apply packing transformations in a way that is >>> similar to the one mentioned in [2] (corresponding implementations are >>> attached to the email). I haven’t found the reason yet. >> >> I assume this is also due to the unknown side-effects of the packing >> implementation. >> (LLVM does not understand that the packing function-call does not modify >> the C array). >> >> Both can likely be addressed by improving aliasing-modref information, >> adding relevant annotations or just hacking the code to ignore this >> issue. Otherwise, we can probably also generate IR that makes it easier >> for LLVM to perform this register promotion. > > I think that this is right. Thank you for the explanation! > >> I am not fully sure where you are heading at the moment. Do you have >> identified a set of individual tasks/changes that you could implement? > > If I’m not mistaken, we decided that the pattern matching could be a > good starting point. That’s why I’m planning to try to implement it in > the near future. Implementation of tiling, interchanging and unrolling > of specific loops based on the algorithm presented in [1] could be the > next step. I could start working on the packing transformation, if the > previous tasks are implemented. > > What do you think about it?Sounds great. Best, Tobias