Tobias Grosser via llvm-dev
2016-May-17 14:37 UTC
[llvm-dev] Determination of statements that contain only matrix multiplication
On 05/17/2016 01:47 PM, Michael Kruse wrote:> 2016-05-16 19:52 GMT+02:00 Roman Gareev <gareevroman at gmail.com>: >> Hi Tobias, >> >> could we use information about memory accesses of a SCoP statement and >> def-use chains to determine statements, which don’t contain matrix >> multiplication of the following form? > > Assuming s/don't/do you want to pattern-match gemm kernels inside larger scops. > > >> for (int i = 0; i < Upper Bound1; i++) >> for (int j = 0; j < Upper Bound2; j++) >> for (int k = 0; k < Upper Bound3; j++) >> C[i][j] += A[i][k] * B[k][j] >> >> We could probably check that memory access relations have the following form: >> >> "accesses" : [ >> { >> "kind" : "read", >> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2] }" >> }, >> { >> "kind" : "read", >> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1] }" >> }, >> { >> "kind" : "read", >> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1] }" >> }, >> { >> "kind" : "write", >> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1] }" >> } >> ] >> >> and there are the following relations between access instructions of >> the memory accesses: >> >> AccessInst1 --- def-use --- \ >> AccessInst2 --- def-use --- fmul >> | >> def-use >> | >> AccessInst3 --- def-use --- fadd >> | >> def-use >> | >> store (AccessInst4) >> >> (fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the >> fmul and AccessInst3. store (AccessInst4) is a user of fadd) > > You'd also have to check which access are not there. E.g. the value of > C[i][j] might be used in the inner loop for some other purpose as > well, which would make some transformations illegal. > > Also notice that LICM will hoist the C[i][j] reduction out of the > k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to > undo this again.Very right.>> Maybe it could be a temporary solution. I think that if the checks are >> successfully passed and the basic block of the statement has exactly >> 14 instructions, the statement contains matrix multiplication and can >> be safely optimized with a generation of specific code, which takes >> into account information about usage of SIMD registers. > > What kind of transformation do you have in mind that cannot be done in > a general way? If you want to take register pressure into account, why > not also for other scops?Good questions.> Like the LoopIdiomPass, one could replace a detected gemm kernel my a > call to a hand-optimized/external [sdcz]gemm function. However, I > think the goal of Polly is to optimize such kernels by itself.Right. This does not seem to be too interesting. I am personally mostly interested in getting the pieces in place that allow us to get close-to-peak performance for gemm, but agree with Michael that we clearly want to generalize this in the future. However, going with a vertical implementation first might be a good idea (assuming the pattern matching is simple). In this case we could detect a gemm kernel and use this logic to get all important optimizations we need in place. Best, Tobias
Roman Gareev via llvm-dev
2016-May-19 14:09 UTC
[llvm-dev] Determination of statements that contain only matrix multiplication
Thank you for the comments and answers! 2016-05-17 19:37 GMT+05:00 Tobias Grosser <tobias at grosser.es>:> On 05/17/2016 01:47 PM, Michael Kruse wrote: >> 2016-05-16 19:52 GMT+02:00 Roman Gareev <gareevroman at gmail.com>: >>> Hi Tobias, >>> >>> could we use information about memory accesses of a SCoP statement and >>> def-use chains to determine statements, which don’t contain matrix >>> multiplication of the following form? >> >> Assuming s/don't/do you want to pattern-match gemm kernels inside larger scops. >> >> >>> for (int i = 0; i < Upper Bound1; i++) >>> for (int j = 0; j < Upper Bound2; j++) >>> for (int k = 0; k < Upper Bound3; j++) >>> C[i][j] += A[i][k] * B[k][j] >>> >>> We could probably check that memory access relations have the following form: >>> >>> "accesses" : [ >>> { >>> "kind" : "read", >>> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2] }" >>> }, >>> { >>> "kind" : "read", >>> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1] }" >>> }, >>> { >>> "kind" : "read", >>> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1] }" >>> }, >>> { >>> "kind" : "write", >>> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1] }" >>> } >>> ] >>> >>> and there are the following relations between access instructions of >>> the memory accesses: >>> >>> AccessInst1 --- def-use --- \ >>> AccessInst2 --- def-use --- fmul >>> | >>> def-use >>> | >>> AccessInst3 --- def-use --- fadd >>> | >>> def-use >>> | >>> store (AccessInst4) >>> >>> (fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the >>> fmul and AccessInst3. store (AccessInst4) is a user of fadd) >> >> You'd also have to check which access are not there. E.g. the value of >> C[i][j] might be used in the inner loop for some other purpose as >> well, which would make some transformations illegal. >> >> Also notice that LICM will hoist the C[i][j] reduction out of the >> k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to >> undo this again. > > Very right.I agree with it. However, it could probably be used, if we were able to apply only transformations, which can be done in general way.>>> Maybe it could be a temporary solution. I think that if the checks are >>> successfully passed and the basic block of the statement has exactly >>> 14 instructions, the statement contains matrix multiplication and can >>> be safely optimized with a generation of specific code, which takes >>> into account information about usage of SIMD registers. >> > What kind of transformation do you have in mind that cannot be done in > a general way? If you want to take register pressure into account, why > not also for other stops?I wasn’t sure whether llvm is able to hoist loads into SIMD registers and sink stores from them. However, yesterday I found out that in some cases llvm can perform such transformations. For example, let’s consider the following code: Loop 1 for ic = 0, ..., m−1 in steps of mc Loop 2 for pc = 0, ..., k−1 in steps of kc A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac Loop 3 for jc = 0, ..., n−1 in steps of nc B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc Loop 4 for ir = 0, ..., mc −1 in steps of mr Loop 5 for jr = 0, ..., nc −1 in steps of nr Loop 6 for pr = 0, ..., kc −1 in steps of 1 Loop 7 for ir' = 0, ..., mr −1 in steps of 1 Loop 8 for jr' = 0, ..., nr −1 in steps of 1 C[ic * mr + i * mc + ir'][jc * nr + j * nc + jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr + jr']; endfor endfor endfor endfor endfor endfor endfor endfor To get closer to an implementation of the algorithm from [1] for matrices stored in row-major order, we can unroll loop 7 and loop 8 and perform vectorization with llvm (a corresponding code can be found attached). According to the attached IR, llvm sinks and hoists stores and loads related to matrix C. Nevertheless, llvm can’t do it, if, for example, we call prefetch within loop 6 or apply packing transformations in a way that is similar to the one mentioned in [2] (corresponding implementations are attached to the email). I haven’t found the reason yet. Refs: [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf [2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm>> Like the LoopIdiomPass, one could replace a detected gemm kernel my a >> call to a hand-optimized/external [sdcz]gemm function. However, I >> think the goal of Polly is to optimize such kernels by itself. > > Right. This does not seem to be too interesting. > > I am personally mostly interested in getting the pieces in place that > allow us to get close-to-peak performance for gemm, but agree with > Michael that we clearly want to generalize this in the future. > > However, going with a vertical implementation first might be a good idea > (assuming the pattern matching is simple). In this case we could detect > a gemm kernel and use this logic to get all important optimizations we > need in place.Yes, it is planned for the future. Sorry, I should probably have mentioned it in the proposal. -- Cheers, Roman Gareev. -------------- next part -------------- A non-text attachment was scrubbed... Name: gemm_SIMD.c Type: text/x-csrc Size: 8387 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment.c> -------------- next part -------------- A non-text attachment was scrubbed... Name: IR Type: application/octet-stream Size: 17577 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: gemm_SIMD_with_prefetch.c Type: text/x-csrc Size: 8492 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment-0001.c> -------------- next part -------------- A non-text attachment was scrubbed... Name: gemm_C_SIMD_another_packing.c Type: text/x-csrc Size: 7900 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment-0002.c>
Michael Kruse via llvm-dev
2016-May-19 14:58 UTC
[llvm-dev] Determination of statements that contain only matrix multiplication
Thank you for the elaborate explanation, although I don't have time to go through all of them. 2016-05-19 16:09 GMT+02:00 Roman Gareev <gareevroman at gmail.com>:> To get closer to an implementation of the algorithm from [1] for > matrices stored in row-major order, we can unroll loop 7 and loop 8 > and perform vectorization with llvm (a corresponding code can be found > attached). According to the attached IR, llvm sinks and hoists stores > and loads related to matrix C. > > Nevertheless, llvm can’t do it, if, for example, we call prefetch > within loop 6 or apply packing transformations in a way that is > similar to the one mentioned in [2] (corresponding implementations are > attached to the email). I haven’t found the reason yet. > > Refs: > > [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf > [2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemmIt would be great if you find a general way to handle such concerns, maybe even modifying LLVM's passes. Since Tobias is your advisor, you might discuss with him the details for which it is handling cases in more general ways vs. only in detected gemm kernels. AFAIK prefetch instructions are not automatically inserted anywhere in LLVM. It would be nice to insert them X instructions/accesses in advance, where X is determined somehow. For gemm only, one could do some measurement for different architectures, but we don't get around to have have it derived from some command-line argument in either case. Michael -- Tardyzentrismus verboten!
Roman Gareev via llvm-dev
2016-May-20 13:05 UTC
[llvm-dev] Determination of statements that contain only matrix multiplication
2016-05-19 21:45 GMT+05:00 4lbert C0hen <4lbert.h.c0hen at gmail.com>:> One short note. I would advise against spending time on prefetching for x86. > Recent hardware prefetchers are amazingly good at strided accesses in > single-threaded code. Caution: this is not based on objective/published > data, but on personal experience. > > There are open challenges in multiprocessor prefecthing, even for regularly > strided data, but these are probabably too ambitious to be tackled > effectively in the time frame of a SoC. There are lots of papers on this > however. > > Now, if you are targeting lower power processors, including most ARM v7a/v8 > implementations, prefetching may be much more important. There are much > fewer publications on BLAS optimizations for ARM, but they exist. Let me > know if you need pointers. > > In any case, if you have to go beyond loop transformations, unrolling, > register blocking, I would advise to look into data layout transformations, > compacting strided blocks, which is one of the key optimizations in [2], or > transposition. These primarily help with TLB misses, enabling the full power > of 3D tiling (they have less impact on 2D tiling). And by the way, they will > in turn improve the effectiveness of hardware prefetching.Thank you very much for the advices! I could probably try to avoid using of nonhardware prefetching in the project, if Tobias doesn’t disagree with it. My understanding is that prefetching isn’t used explicitly in [1] and, according to [2], in some cases 90% of the turbo boost peak of the processor can be attained without it. I started to consider prefetching, because it’s used in implementations of gemm micro-kernels of BLIS framework [3]. If I’m not mistaken, it’s applied to try to make sure that micro-panel Br is loaded after micro-panel Ar (as required in [1] p. 11). For example, its using helps to reduce the execution time of the attached implementation. Refs: [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf [2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm [3] - https://github.com/flame/blis/blob/master/kernels/x86_64/sandybridge/3/bli_gemm_int_d8x4.c -- Cheers, Roman Gareev. -------------- next part -------------- A non-text attachment was scrubbed... Name: gemm_C_SIMD.c Type: text/x-csrc Size: 5697 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160520/16edb8dc/attachment.c>
4lbert C0hen via llvm-dev
2016-May-28 14:48 UTC
[llvm-dev] Determination of statements that contain only matrix multiplication
Sorry for not responding earlier. On 05/20/2016 03:05 PM, Roman Gareev wrote:> Thank you very much for the advices! I could probably try to avoid > using of nonhardware prefetching in the project, if Tobias doesn’t > disagree with it. My understanding is that prefetching isn’t used > explicitly in [1] and, according to [2], in some cases 90% of the > turbo boost peak of the processor can be attained without it.Too many negations :-) I'm not sure I followed exactly what you wanted to say, but I understand that this is not the priority since you can get 90% of the performance without worrying about prefetching.> I started to consider prefetching, because it’s used in > implementations of gemm micro-kernels of BLIS framework [3]. If I’m > not mistaken, it’s applied to try to make sure that micro-panel Br is > loaded after micro-panel Ar (as required in [1] p. 11). For example, > its using helps to reduce the execution time of the attached > implementation.Interesting. The BLIS implementation prefetches only the first cache line, before traversing a given interval of memory. This clearly confirms the implementation relies on hardware preteching to prefetch the subsequent lines. This makes a lot of sense. Yet surprisingly, the BLIS implementation does not attempt at anticipating the fetch. It schedules the prefetch instruction right before the first load of a given interval.> Refs: > > [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf > [2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm > [3] - https://github.com/flame/blis/blob/master/kernels/x86_64/sandybridge/3/bli_gemm_int_d8x4.c >