Roman Gareev via llvm-dev
2016-May-16 17:52 UTC
[llvm-dev] Determination of statements that contain only matrix multiplication
Hi Tobias, could we use information about memory accesses of a SCoP statement and def-use chains to determine statements, which don’t contain matrix multiplication of the following form? for (int i = 0; i < Upper Bound1; i++) for (int j = 0; j < Upper Bound2; j++) for (int k = 0; k < Upper Bound3; j++) C[i][j] += A[i][k] * B[k][j] We could probably check that memory access relations have the following form: "accesses" : [ { "kind" : "read", "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2] }" }, { "kind" : "read", "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1] }" }, { "kind" : "read", "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1] }" }, { "kind" : "write", "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1] }" } ] and there are the following relations between access instructions of the memory accesses: AccessInst1 --- def-use --- \ AccessInst2 --- def-use --- fmul | def-use | AccessInst3 --- def-use --- fadd | def-use | store (AccessInst4) (fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the fmul and AccessInst3. store (AccessInst4) is a user of fadd) Maybe it could be a temporary solution. I think that if the checks are successfully passed and the basic block of the statement has exactly 14 instructions, the statement contains matrix multiplication and can be safely optimized with a generation of specific code, which takes into account information about usage of SIMD registers. I have one more question. Are memory accesses of MemAccs from the ScopStmt class ordered by their sequence order? -- Cheers, Roman Gareev.
Michael Kruse via llvm-dev
2016-May-17 11:47 UTC
[llvm-dev] Determination of statements that contain only matrix multiplication
2016-05-16 19:52 GMT+02:00 Roman Gareev <gareevroman at gmail.com>:> Hi Tobias, > > could we use information about memory accesses of a SCoP statement and > def-use chains to determine statements, which don’t contain matrix > multiplication of the following form?Assuming s/don't/do you want to pattern-match gemm kernels inside larger scops.> for (int i = 0; i < Upper Bound1; i++) > for (int j = 0; j < Upper Bound2; j++) > for (int k = 0; k < Upper Bound3; j++) > C[i][j] += A[i][k] * B[k][j] > > We could probably check that memory access relations have the following form: > > "accesses" : [ > { > "kind" : "read", > "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2] }" > }, > { > "kind" : "read", > "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1] }" > }, > { > "kind" : "read", > "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1] }" > }, > { > "kind" : "write", > "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1] }" > } > ] > > and there are the following relations between access instructions of > the memory accesses: > > AccessInst1 --- def-use --- \ > AccessInst2 --- def-use --- fmul > | > def-use > | > AccessInst3 --- def-use --- fadd > | > def-use > | > store (AccessInst4) > > (fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the > fmul and AccessInst3. store (AccessInst4) is a user of fadd)You'd also have to check which access are not there. E.g. the value of C[i][j] might be used in the inner loop for some other purpose as well, which would make some transformations illegal. Also notice that LICM will hoist the C[i][j] reduction out of the k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to undo this again.> Maybe it could be a temporary solution. I think that if the checks are > successfully passed and the basic block of the statement has exactly > 14 instructions, the statement contains matrix multiplication and can > be safely optimized with a generation of specific code, which takes > into account information about usage of SIMD registers.What kind of transformation do you have in mind that cannot be done in a general way? If you want to take register pressure into account, why not also for other scops? Like the LoopIdiomPass, one could replace a detected gemm kernel my a call to a hand-optimized/external [sdcz]gemm function. However, I think the goal of Polly is to optimize such kernels by itself.> I have one more question. Are memory accesses of MemAccs from the > ScopStmt class ordered by their sequence order?Only the MK_Array accesses should be in order relative to each other. All scalar reads happen when entering a statement and all scalar writes at the end of it. Michael -- Tardyzentrismus verboten!
Tobias Grosser via llvm-dev
2016-May-17 14:37 UTC
[llvm-dev] Determination of statements that contain only matrix multiplication
On 05/17/2016 01:47 PM, Michael Kruse wrote:> 2016-05-16 19:52 GMT+02:00 Roman Gareev <gareevroman at gmail.com>: >> Hi Tobias, >> >> could we use information about memory accesses of a SCoP statement and >> def-use chains to determine statements, which don’t contain matrix >> multiplication of the following form? > > Assuming s/don't/do you want to pattern-match gemm kernels inside larger scops. > > >> for (int i = 0; i < Upper Bound1; i++) >> for (int j = 0; j < Upper Bound2; j++) >> for (int k = 0; k < Upper Bound3; j++) >> C[i][j] += A[i][k] * B[k][j] >> >> We could probably check that memory access relations have the following form: >> >> "accesses" : [ >> { >> "kind" : "read", >> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2] }" >> }, >> { >> "kind" : "read", >> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1] }" >> }, >> { >> "kind" : "read", >> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1] }" >> }, >> { >> "kind" : "write", >> "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1] }" >> } >> ] >> >> and there are the following relations between access instructions of >> the memory accesses: >> >> AccessInst1 --- def-use --- \ >> AccessInst2 --- def-use --- fmul >> | >> def-use >> | >> AccessInst3 --- def-use --- fadd >> | >> def-use >> | >> store (AccessInst4) >> >> (fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the >> fmul and AccessInst3. store (AccessInst4) is a user of fadd) > > You'd also have to check which access are not there. E.g. the value of > C[i][j] might be used in the inner loop for some other purpose as > well, which would make some transformations illegal. > > Also notice that LICM will hoist the C[i][j] reduction out of the > k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to > undo this again.Very right.>> Maybe it could be a temporary solution. I think that if the checks are >> successfully passed and the basic block of the statement has exactly >> 14 instructions, the statement contains matrix multiplication and can >> be safely optimized with a generation of specific code, which takes >> into account information about usage of SIMD registers. > > What kind of transformation do you have in mind that cannot be done in > a general way? If you want to take register pressure into account, why > not also for other scops?Good questions.> Like the LoopIdiomPass, one could replace a detected gemm kernel my a > call to a hand-optimized/external [sdcz]gemm function. However, I > think the goal of Polly is to optimize such kernels by itself.Right. This does not seem to be too interesting. I am personally mostly interested in getting the pieces in place that allow us to get close-to-peak performance for gemm, but agree with Michael that we clearly want to generalize this in the future. However, going with a vertical implementation first might be a good idea (assuming the pattern matching is simple). In this case we could detect a gemm kernel and use this logic to get all important optimizations we need in place. Best, Tobias