Luo, Yuanke via llvm-dev
2021-Mar-23 08:21 UTC
[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?
I prototyped the approach 1 at https://reviews.llvm.org/D99152 and I realized
that sometimes bitcast optimization in middle-end is helpful. For the test case
of inner_product(), we need extra effort eliminate
llvm.x86.vector.amx.cast.x86amx.v256i32 by ourselves.
Thanks
Yuanke
From: Luo, Yuanke
Sent: Tuesday, March 23, 2021 11:37 AM
To: Florian Hahn <florian_hahn at apple.com>; llvm-dev <llvm-dev at
lists.llvm.org>
Cc: Zhang, Xiang1 <Xiang1.Zhang at intel.com>; James Y Knight <jyknight
at google.com>
Subject: RE: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?
Hi Florian,
Yes, we use `bitcast` in the frontend to convert between regular vector and the
AMX values.
The approach 1 looks elegant to me. Thank you for the good idea. We will do some
prototype for approach 1. Hopefully, it can solve all the issues in the
middle-end.
Thanks
Yuanke
From: Florian Hahn <florian_hahn at apple.com<mailto:florian_hahn at
apple.com>>
Sent: Monday, March 22, 2021 11:04 PM
To: Luo, Yuanke <yuanke.luo at intel.com<mailto:yuanke.luo at
intel.com>>; llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Cc: Zhang, Xiang1 <xiang1.zhang at intel.com<mailto:xiang1.zhang at
intel.com>>; James Y Knight <jyknight at google.com<mailto:jyknight
at google.com>>
Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?
On Mar 22, 2021, at 14:02, Luo, Yuanke <yuanke.luo at
intel.com<mailto:yuanke.luo at intel.com>> wrote:
Yes, bitcasts introduced by the frontend call amx intrinsics. We use vector to
represent 2D amx tile in C language, on the other hand we don’t want to mix our
amx tile to other vector operation, so x86_amx is introduced to isolate amx
intrinsics from normal vector operation. The bitcast is to monitor that a normal
vector is passed to amx intrinsics. In below example, we need to transform the
bitcast to a vector store and an amx load intrinsic. The x86_amx* is unexpected
at the beginning, but in the pass of InstrCombine the middle-end generate the
x86_amx pointer.
define dso_local void @test_src_add(<256 x i32> %x, <256 x i32> %y,
i16 %r, i16 %c, i8* %buf, i64 %s) {
; CHECK-LABEL: @test_src_add(
; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP0:%.*]] = alloca <256 x i32>, align 64
; CHECK-NEXT: [[ADD:%.*]] = add <256 x i32> [[Y:%.*]], [[X:%.*]]
; CHECK-NEXT: [[TMP1:%.*]] = bitcast <256 x i32>* [[TMP0]] to i8*
; CHECK-NEXT: store <256 x i32> [[ADD]], <256 x i32>* [[TMP0]],
align 1024
; CHECK-NEXT: [[TMP2:%.*]] = call x86_amx @llvm.x86.tileloadd64.internal(i16
[[R:%.*]], i16 [[C:%.*]], i8* [[TMP1]], i64 64)
; CHECK-NEXT: call void @llvm.x86.tilestored64.internal(i16 [[R]], i16 [[C]],
i8* [[BUF:%.*]], i64 [[S:%.*]], x86_amx [[TMP2]])
; CHECK-NEXT: ret void
;
entry:
%add = add <256 x i32> %y, %x
%t = bitcast <256 x i32> %add to x86_amx
call void @llvm.x86.tilestored64.internal(i16 %r, i16 %c, i8* %buf, i64 %s,
x86_amx %t)
ret void
}
Ok I think I understand the issue better now. IIUC you use `bitcast` in the
frontend to convert between regular vector and the AMX values?
This doesn’t really match the way `bitcast` is defined (as discussed earlier)
and this mismatch seems to be the source of the issues. I don’t think you should
use `bitcast`s that way and instead adjust the frontend to emit different code
for the conversion between vector and amx values (e.g. use an intrinsic to
convert between vector and amx values; the intrinsic can be directly lowered to
the conversion code).
I think there are at least two ways forward:
1. Avoid using bitcasts for the conversion in the frontend.
2. Try & define the semantics of bitcast/load for AMX types, such that the
transformations you want to exclude in instcombine are illegal.
If you decide to go with 2., you probably will have to make a convincing
argument why this is the right thing to do and why other alternatives do not
work, because it means that certain general transformations that are legal at
the moment become illegal for certain types (which is illustrated by the
instcombine patches you mentioned)
Cheers.
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210323/bfc123d6/attachment.html>
Florian Hahn via llvm-dev
2021-Mar-23 10:16 UTC
[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?
> On Mar 23, 2021, at 08:21, Luo, Yuanke <yuanke.luo at intel.com> wrote: > > I prototyped the approach 1 at https://reviews.llvm.org/D99152 <https://reviews.llvm.org/D99152> and I realized that sometimes bitcast optimization in middle-end is helpful. For the test case of inner_product(), we need extra effort eliminate llvm.x86.vector.amx.cast.x86amx.v256i32 by ourselves. >I think that’s expected, you might need to add some optimizations for the conversion intrinsic. But that can easily be limited to the AMX specific passes and all existing LLVM transformations should remain correct without changes. Cheers, Florian -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210323/e7f62ba9/attachment.html>
Luo, Yuanke via llvm-dev
2021-Apr-14 12:39 UTC
[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?
Hi, I discussed with Florian for a solution at https://reviews.llvm.org/D99152. Florian suggest introducing a specific intrinsic to replace bitcast in front-end, and backend need extra effort to optimize or eliminate the intrinsic. This idea looks good to me. Here is my plan. 1. specify x86_amx in LangRef and verify the IR. Patches were uploaded at https://reviews.llvm.org/D100032 and https://reviews.llvm.org/D100472. 2. Add llvm.x86.tile.cast intrinsic in LLVM. 3. Optimize some of llvm.x86.tile.cast code as bitcast does, and transform llvm.x86.tile.cast to amx intrinsic if it can't be eliminated. 4. After the above 3 items are finished, replace bitcast with llvm.x86.tile.cast in front-end when generate IR for amx builtin. 5. After some time for stabilization, remove bitcast transform code from LLVM. Thanks Yuanke From: Florian Hahn <florian_hahn at apple.com> Sent: Tuesday, March 23, 2021 6:16 PM To: Luo, Yuanke <yuanke.luo at intel.com> Cc: llvm-dev <llvm-dev at lists.llvm.org>; Zhang, Xiang1 <xiang1.zhang at intel.com>; James Y Knight <jyknight at google.com> Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly? On Mar 23, 2021, at 08:21, Luo, Yuanke <yuanke.luo at intel.com<mailto:yuanke.luo at intel.com>> wrote: I prototyped the approach 1 at https://reviews.llvm.org/D99152 and I realized that sometimes bitcast optimization in middle-end is helpful. For the test case of inner_product(), we need extra effort eliminate llvm.x86.vector.amx.cast.x86amx.v256i32 by ourselves. I think that’s expected, you might need to add some optimizations for the conversion intrinsic. But that can easily be limited to the AMX specific passes and all existing LLVM transformations should remain correct without changes. Cheers, Florian -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210414/9f8fe217/attachment.html>