Hi, I implemented a pattern matching of the dot product for arm64 and it seemed to work well for the basic case, i.e., class mulB<SDPatternOperator ldop> : PatFrag<(ops node:$Rn, node:$Rm, node:$offset), (mul (ldop (add node:$Rn, node:$offset)), (ldop (add node:$Rm, node:$offset)))>; class mulBz<SDPatternOperator ldop> : PatFrag<(ops node:$Rn, node:$Rm), (mul (ldop node:$Rn), (ldop node:$Rm))>; class DotProductI32<Instruction DOT, SDPatternOperator ldop> : Pat<(i32 (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 3)), (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 2)), (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 1)), (mulBz<ldop> GPR64sp:$Rn, GPR64sp:$Rm))))), (EXTRACT_SUBREG (i64 (DOT (DUPv2i32gpr WZR), (v8i8 (LD1Onev8b GPR64sp:$Rn)), (v8i8 (LD1Onev8b GPR64sp:$Rm)))), sub_32)>, Requires<[HasDotProd]>; def : DotProductI32<SDOTv8i8, sextloadi8>; def : DotProductI32<UDOTv8i8, zextloadi8>; Then when I extended it to 8 element vectors, the time spent by tblgen exploded: from under 7 seconds (on A-72) on the AArch64 td files and the above patch to more than half an hour when I decided to terminate the processes. Here are the additional def'pats that produce the exponential behavior: def VADDV_32 : OutPatFrag<(ops node:$R), (ADDPv2i32 node:$R, node:$R)>; class DotProduct2I32<Instruction DOT, SDPatternOperator ldop> : Pat<(i32 (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 7)), (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 6)), (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 5)), (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 4)), (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 3)), (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 2)), (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 1)), (mulBz<ldop> GPR64sp:$Rn, GPR64sp:$Rm))))))))), (EXTRACT_SUBREG (VADDV_32 (i64 (DOT (DUPv2i32gpr WZR), (v8i8 (LD1Onev8b GPR64sp:$Rn)), (v8i8 (LD1Onev8b GPR64sp:$Rm))))), sub_32)>, Requires<[HasDotProd]>; def : DotProduct2I32<SDOTv8i8, sextloadi8>; def : DotProduct2I32<UDOTv8i8, zextloadi8>; linux-perf profile for the first minute executing llvm-tblgen shows that most of the time is spent in isIsomorphicTo: 28.25% llvm-tblgen llvm-tblgen [.] llvm::TreePatternNode::isIsomorphicTo 21.62% llvm-tblgen llvm-tblgen [.] llvm::TypeSetByHwMode::operator= 15.25% llvm-tblgen libc-2.27.so [.] memcmp 14.61% llvm-tblgen llvm-tblgen [.] std::__shared_ptr<llvm::TreePatternNode, (__gnu_cxx::_Lock_policy)2>::__shared_ptr In call-graph mode `perf -g` points to GenerateVariants that generates most of the calls to isIsomorphicTo: + 100.00% 0.00% llvm-tblgen llvm-tblgen [.] main + 100.00% 0.00% llvm-tblgen llvm-tblgen [.] llvm::TableGenMain + 99.85% 0.00% llvm-tblgen llvm-tblgen [.] (anonymous namespace)::LLVMTableGenMain + 99.85% 0.00% llvm-tblgen llvm-tblgen [.] llvm::EmitDAGISel + 99.85% 0.00% llvm-tblgen llvm-tblgen [.] llvm::CodeGenDAGPatterns::CodeGenDAGPatterns + 99.46% 98.01% llvm-tblgen llvm-tblgen [.] llvm::CodeGenDAGPatterns::GenerateVariants 0.38% 0.00% llvm-tblgen llvm-tblgen [.] GenerateVariantsOf Sebastian
add and mul are both commutable and tablegen tries to create patterns with every possible permutation of commutable operations. ~Craig On Mon, Sep 9, 2019 at 9:08 PM Sebastian Pop via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi, > I implemented a pattern matching of the dot product for arm64 > and it seemed to work well for the basic case, i.e., > > class mulB<SDPatternOperator ldop> : > PatFrag<(ops node:$Rn, node:$Rm, node:$offset), > (mul (ldop (add node:$Rn, node:$offset)), > (ldop (add node:$Rm, node:$offset)))>; > class mulBz<SDPatternOperator ldop> : > PatFrag<(ops node:$Rn, node:$Rm), > (mul (ldop node:$Rn), (ldop node:$Rm))>; > > class DotProductI32<Instruction DOT, SDPatternOperator ldop> : > Pat<(i32 (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 3)), > (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 2)), > (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 1)), > (mulBz<ldop> GPR64sp:$Rn, GPR64sp:$Rm))))), > (EXTRACT_SUBREG > (i64 (DOT (DUPv2i32gpr WZR), > (v8i8 (LD1Onev8b GPR64sp:$Rn)), > (v8i8 (LD1Onev8b GPR64sp:$Rm)))), > sub_32)>, Requires<[HasDotProd]>; > > def : DotProductI32<SDOTv8i8, sextloadi8>; > def : DotProductI32<UDOTv8i8, zextloadi8>; > > Then when I extended it to 8 element vectors, the time spent by tblgen > exploded: > from under 7 seconds (on A-72) on the AArch64 td files and the above patch > to more than half an hour when I decided to terminate the processes. > > Here are the additional def'pats that produce the exponential behavior: > > def VADDV_32 : OutPatFrag<(ops node:$R), (ADDPv2i32 node:$R, node:$R)>; > > class DotProduct2I32<Instruction DOT, SDPatternOperator ldop> : > Pat<(i32 (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 7)), > (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 6)), > (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 5)), > (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 4)), > (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 3)), > (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 2)), > (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 1)), > (mulBz<ldop> GPR64sp:$Rn, GPR64sp:$Rm))))))))), > (EXTRACT_SUBREG > (VADDV_32 > (i64 (DOT (DUPv2i32gpr WZR), > (v8i8 (LD1Onev8b GPR64sp:$Rn)), > (v8i8 (LD1Onev8b GPR64sp:$Rm))))), > sub_32)>, Requires<[HasDotProd]>; > > def : DotProduct2I32<SDOTv8i8, sextloadi8>; > def : DotProduct2I32<UDOTv8i8, zextloadi8>; > > linux-perf profile for the first minute executing llvm-tblgen shows > that most of the time is spent in isIsomorphicTo: > > 28.25% llvm-tblgen llvm-tblgen [.] > llvm::TreePatternNode::isIsomorphicTo > 21.62% llvm-tblgen llvm-tblgen [.] > llvm::TypeSetByHwMode::operator=> 15.25% llvm-tblgen libc-2.27.so [.] memcmp > 14.61% llvm-tblgen llvm-tblgen [.] > std::__shared_ptr<llvm::TreePatternNode, > (__gnu_cxx::_Lock_policy)2>::__shared_ptr > > In call-graph mode `perf -g` points to GenerateVariants that generates > most of the calls to isIsomorphicTo: > > + 100.00% 0.00% llvm-tblgen llvm-tblgen [.] main > + 100.00% 0.00% llvm-tblgen llvm-tblgen [.] > llvm::TableGenMain > + 99.85% 0.00% llvm-tblgen llvm-tblgen [.] (anonymous > namespace)::LLVMTableGenMain > + 99.85% 0.00% llvm-tblgen llvm-tblgen [.] > llvm::EmitDAGISel > + 99.85% 0.00% llvm-tblgen llvm-tblgen [.] > llvm::CodeGenDAGPatterns::CodeGenDAGPatterns > + 99.46% 98.01% llvm-tblgen llvm-tblgen [.] > llvm::CodeGenDAGPatterns::GenerateVariants > 0.38% 0.00% llvm-tblgen llvm-tblgen [.] > GenerateVariantsOf > > Sebastian > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190909/123f3b27/attachment.html>
Thanks Craig! I looked at the implementation of tablegen and it is not easy to come up with some limiting factor. The best I have found is to compute the height of the TreePatternNode and do not generate permutations if the height is larger than 4 or 5. Would there be interest in cleaning up that patch and submitting for review? I have also found out that (on the codes that I care for) I don't need a long def-pat matching dot products of 8 by 8 bytes, as the SLP vectorizer transforms that code into vector add/mul that only needs a simple pattern. Thanks, Sebastian On Tue, Sep 10, 2019 at 12:43 AM Craig Topper <craig.topper at gmail.com> wrote:> > add and mul are both commutable and tablegen tries to create patterns with every possible permutation of commutable operations. > > ~Craig > > > On Mon, Sep 9, 2019 at 9:08 PM Sebastian Pop via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> Hi, >> I implemented a pattern matching of the dot product for arm64 >> and it seemed to work well for the basic case, i.e., >> >> class mulB<SDPatternOperator ldop> : >> PatFrag<(ops node:$Rn, node:$Rm, node:$offset), >> (mul (ldop (add node:$Rn, node:$offset)), >> (ldop (add node:$Rm, node:$offset)))>; >> class mulBz<SDPatternOperator ldop> : >> PatFrag<(ops node:$Rn, node:$Rm), >> (mul (ldop node:$Rn), (ldop node:$Rm))>; >> >> class DotProductI32<Instruction DOT, SDPatternOperator ldop> : >> Pat<(i32 (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 3)), >> (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 2)), >> (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 1)), >> (mulBz<ldop> GPR64sp:$Rn, GPR64sp:$Rm))))), >> (EXTRACT_SUBREG >> (i64 (DOT (DUPv2i32gpr WZR), >> (v8i8 (LD1Onev8b GPR64sp:$Rn)), >> (v8i8 (LD1Onev8b GPR64sp:$Rm)))), >> sub_32)>, Requires<[HasDotProd]>; >> >> def : DotProductI32<SDOTv8i8, sextloadi8>; >> def : DotProductI32<UDOTv8i8, zextloadi8>; >> >> Then when I extended it to 8 element vectors, the time spent by tblgen exploded: >> from under 7 seconds (on A-72) on the AArch64 td files and the above patch >> to more than half an hour when I decided to terminate the processes. >> >> Here are the additional def'pats that produce the exponential behavior: >> >> def VADDV_32 : OutPatFrag<(ops node:$R), (ADDPv2i32 node:$R, node:$R)>; >> >> class DotProduct2I32<Instruction DOT, SDPatternOperator ldop> : >> Pat<(i32 (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 7)), >> (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 6)), >> (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 5)), >> (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 4)), >> (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 3)), >> (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 2)), >> (add (mulB<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 1)), >> (mulBz<ldop> GPR64sp:$Rn, GPR64sp:$Rm))))))))), >> (EXTRACT_SUBREG >> (VADDV_32 >> (i64 (DOT (DUPv2i32gpr WZR), >> (v8i8 (LD1Onev8b GPR64sp:$Rn)), >> (v8i8 (LD1Onev8b GPR64sp:$Rm))))), >> sub_32)>, Requires<[HasDotProd]>; >> >> def : DotProduct2I32<SDOTv8i8, sextloadi8>; >> def : DotProduct2I32<UDOTv8i8, zextloadi8>; >> >> linux-perf profile for the first minute executing llvm-tblgen shows >> that most of the time is spent in isIsomorphicTo: >> >> 28.25% llvm-tblgen llvm-tblgen [.] >> llvm::TreePatternNode::isIsomorphicTo >> 21.62% llvm-tblgen llvm-tblgen [.] >> llvm::TypeSetByHwMode::operator=>> 15.25% llvm-tblgen libc-2.27.so [.] memcmp >> 14.61% llvm-tblgen llvm-tblgen [.] >> std::__shared_ptr<llvm::TreePatternNode, >> (__gnu_cxx::_Lock_policy)2>::__shared_ptr >> >> In call-graph mode `perf -g` points to GenerateVariants that generates >> most of the calls to isIsomorphicTo: >> >> + 100.00% 0.00% llvm-tblgen llvm-tblgen [.] main >> + 100.00% 0.00% llvm-tblgen llvm-tblgen [.] llvm::TableGenMain >> + 99.85% 0.00% llvm-tblgen llvm-tblgen [.] (anonymous >> namespace)::LLVMTableGenMain >> + 99.85% 0.00% llvm-tblgen llvm-tblgen [.] llvm::EmitDAGISel >> + 99.85% 0.00% llvm-tblgen llvm-tblgen [.] >> llvm::CodeGenDAGPatterns::CodeGenDAGPatterns >> + 99.46% 98.01% llvm-tblgen llvm-tblgen [.] >> llvm::CodeGenDAGPatterns::GenerateVariants >> 0.38% 0.00% llvm-tblgen llvm-tblgen [.] GenerateVariantsOf >> >> Sebastian >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Maybe Matching Threads
- Strange regalloc behaviour: one more available register causes much worse allocation
- Strange regalloc behaviour: one more available register causes much worse allocation
- [LLVMdev] x86 backend assembly - mov esp->reg
- [LLVMdev] Pseudo load and store instructions for AArch64
- [LLVMdev] [PATCH] Add new phase to legalization to handle vector operations