Roman Lebedev via llvm-dev
2021-Apr-14 12:44 UTC
[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?
I think i commented about this already, but isn't it a problem that you will still be doing non-sequential loads/stores via plain load/store IR instructions? If they would just natively take the underlying <256 x i32> or whatever, will you even need all this x86_amx special handling, and x86_amx itself? Roman On Wed, Apr 14, 2021 at 3:39 PM Luo, Yuanke via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > Hi, > > > > I discussed with Florian for a solution at https://reviews.llvm.org/D99152. Florian suggest introducing a specific intrinsic to replace bitcast in front-end, and backend need extra effort to optimize or eliminate the intrinsic. This idea looks good to me. Here is my plan. > > specify x86_amx in LangRef and verify the IR. Patches were uploaded at https://reviews.llvm.org/D100032 and https://reviews.llvm.org/D100472. > Add llvm.x86.tile.cast intrinsic in LLVM. > Optimize some of llvm.x86.tile.cast code as bitcast does, and transform llvm.x86.tile.cast to amx intrinsic if it can't be eliminated. > After the above 3 items are finished, replace bitcast with llvm.x86.tile.cast in front-end when generate IR for amx builtin. > After some time for stabilization, remove bitcast transform code from LLVM. > > > > Thanks > > Yuanke > > > > From: Florian Hahn <florian_hahn at apple.com> > Sent: Tuesday, March 23, 2021 6:16 PM > To: Luo, Yuanke <yuanke.luo at intel.com> > Cc: llvm-dev <llvm-dev at lists.llvm.org>; Zhang, Xiang1 <xiang1.zhang at intel.com>; James Y Knight <jyknight at google.com> > Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly? > > > > > > > > On Mar 23, 2021, at 08:21, Luo, Yuanke <yuanke.luo at intel.com> wrote: > > > > I prototyped the approach 1 at https://reviews.llvm.org/D99152 and I realized that sometimes bitcast optimization in middle-end is helpful. For the test case of inner_product(), we need extra effort eliminate llvm.x86.vector.amx.cast.x86amx.v256i32 by ourselves. > > > > > > I think that’s expected, you might need to add some optimizations for the conversion intrinsic. But that can easily be limited to the AMX specific passes and all existing LLVM transformations should remain correct without changes. > > > > Cheers, > > Florian > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Luo, Yuanke via llvm-dev
2021-Apr-14 13:11 UTC
[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?
Hi Roman, I don't know if I understand your question correctly. If we specify the stride to a particular value, the tile load/store can be sequential. For example, we have an int32 array[16][16]. When the stride is specified as 64 bytes, then the tile load is sequential. If the array is int32 array[32][32], and the stride is 128 bytes, then the tile load is not sequential. Take below code as example, the special x86_amx handling is "@llvm.tile.cast". We can have a separate pass to handle "@llvm.tile.cast". %10 = load <256 x i32>, <256 x i32>* %3, align 64, !dbg !36 %11 = call x86_amx @llvm.tile.cast(<256 x i32> %10) %12 = load <256 x i32>, <256 x i32>* %1, align 64, !dbg !38 %13 = call x86_amx @llvm.tile.cast(<256 x i32> %12) %14 = load <256 x i32>, <256 x i32>* %2, align 64, !dbg !39 %15 = call x86_amx @llvm.tile.cast(<256 x i32> %14) %16 = call x86_amx @llvm.x86.tdpbssd.internal(i16 16, i16 16, i16 16, x86_amx %11, x86_amx %13, x86_amx %15), !dbg !37 %17 = call <256 x i32> @llvm.tile.cast(x86_amx %16) store <256 x i32> %17, <256 x i32>* %3, align 64, !dbg !40 Thanks Yuanke -----Original Message----- From: Roman Lebedev <lebedev.ri at gmail.com> Sent: Wednesday, April 14, 2021 8:44 PM To: Luo, Yuanke <yuanke.luo at intel.com> Cc: Florian Hahn <florian_hahn at apple.com>; llvm-dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly? I think i commented about this already, but isn't it a problem that you will still be doing non-sequential loads/stores via plain load/store IR instructions? If they would just natively take the underlying <256 x i32> or whatever, will you even need all this x86_amx special handling, and x86_amx itself? Roman On Wed, Apr 14, 2021 at 3:39 PM Luo, Yuanke via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > Hi, > > > > I discussed with Florian for a solution at https://reviews.llvm.org/D99152. Florian suggest introducing a specific intrinsic to replace bitcast in front-end, and backend need extra effort to optimize or eliminate the intrinsic. This idea looks good to me. Here is my plan. > > specify x86_amx in LangRef and verify the IR. Patches were uploaded at https://reviews.llvm.org/D100032 and https://reviews.llvm.org/D100472. > Add llvm.x86.tile.cast intrinsic in LLVM. > Optimize some of llvm.x86.tile.cast code as bitcast does, and transform llvm.x86.tile.cast to amx intrinsic if it can't be eliminated. > After the above 3 items are finished, replace bitcast with llvm.x86.tile.cast in front-end when generate IR for amx builtin. > After some time for stabilization, remove bitcast transform code from LLVM. > > > > Thanks > > Yuanke > > > > From: Florian Hahn <florian_hahn at apple.com> > Sent: Tuesday, March 23, 2021 6:16 PM > To: Luo, Yuanke <yuanke.luo at intel.com> > Cc: llvm-dev <llvm-dev at lists.llvm.org>; Zhang, Xiang1 > <xiang1.zhang at intel.com>; James Y Knight <jyknight at google.com> > Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly? > > > > > > > > On Mar 23, 2021, at 08:21, Luo, Yuanke <yuanke.luo at intel.com> wrote: > > > > I prototyped the approach 1 at https://reviews.llvm.org/D99152 and I realized that sometimes bitcast optimization in middle-end is helpful. For the test case of inner_product(), we need extra effort eliminate llvm.x86.vector.amx.cast.x86amx.v256i32 by ourselves. > > > > > > I think that’s expected, you might need to add some optimizations for the conversion intrinsic. But that can easily be limited to the AMX specific passes and all existing LLVM transformations should remain correct without changes. > > > > Cheers, > > Florian > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev