Luo, Yuanke via llvm-dev
2021-Mar-19 01:58 UTC
[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?
Hi James,
Thank you for taking the time to deep dive the issue. It is very constructive. I
agree we can transform “load x86_amx*” to amx load intrinsic. But it seems we
need more effort to do the transform than preventing generate “load x86_amx*”. I
can support transforming “load x86_amx*” to amx load intrinsic if people like
this approach.
I also think Florian raise a good question. What the semantics about “load
x86_amx*”. Is it different semantics than regular LLVM pointer types? What’s
your opinions on it?
Thanks
Yuanke
From: James Y Knight <jyknight at google.com>
Sent: Friday, March 19, 2021 9:28 AM
To: Luo, Yuanke <yuanke.luo at intel.com>
Cc: Florian Hahn <florian_hahn at apple.com>; Wang, Pengfei
<pengfei.wang at intel.com>; llvm-dev <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?
Why is that harder than lowering a load <256 x i32> and then bitcast to
x86_amx?
E.g., I see there is in llvm/lib/Target/X86/X86LowerAMXType.cpp a transform:
%src = load <256 x i32>, <256 x i32>* %addr, align 64
%2 = bitcast <256 x i32> %src to x86_amx
-->
%2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %row, i16 %col, i8* %addr,
i64 %stride64)
Isn't it equivalent, then, to do:
%2 = load x86_amx, x86_amx* %addr, align 64
-->
%2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %row, i16 %col, i8* %addr,
i64 %stride64)
On Thu, Mar 18, 2021 at 9:29 AM Luo, Yuanke <yuanke.luo at
intel.com<mailto:yuanke.luo at intel.com>> wrote:
But x86_amx represent a tile. The semantics of hardware instruction tileloadd is
something like ‘llvm.matrix.row.major.load’. How do we lower `%v = load x86_amx,
x86_amx* %ptr` to tileloadd?
From: James Y Knight <jyknight at google.com<mailto:jyknight at
google.com>>
Sent: Thursday, March 18, 2021 9:09 PM
To: Luo, Yuanke <yuanke.luo at intel.com<mailto:yuanke.luo at
intel.com>>
Cc: Florian Hahn <florian_hahn at apple.com<mailto:florian_hahn at
apple.com>>; Wang, Pengfei <pengfei.wang at
intel.com<mailto:pengfei.wang at intel.com>>; llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?
Since the x86_amx type has a fixed size of 1024, I would expect `%v = load
x86_amx, x86_amx* %ptr` to load 1024 bytes of contiguous memory starting at %ptr
-- I don't see why this should be invalid?
On Thu, Mar 18, 2021 at 8:53 AM Luo, Yuanke <yuanke.luo at
intel.com<mailto:yuanke.luo at intel.com>> wrote:
I mean transforming from “load <256 x i32>*” to “load x86_amx*” is not
invalid because x86_amx represent a tile and “load x86_amx*” doesn’t express its
semantics without a stride. Now it looks to me “load x86_amx*” is invalid.
From: James Y Knight <jyknight at google.com<mailto:jyknight at
google.com>>
Sent: Thursday, March 18, 2021 8:41 PM
To: Luo, Yuanke <yuanke.luo at intel.com<mailto:yuanke.luo at
intel.com>>
Cc: Florian Hahn <florian_hahn at apple.com<mailto:florian_hahn at
apple.com>>; Wang, Pengfei <pengfei.wang at
intel.com<mailto:pengfei.wang at intel.com>>; llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?
Err...are you saying this is the expected semantics of a "load
x86_amx" operation today? That doesn't make much sense...Surely a
strided-load operation should be spelled `llvm.matrix.column.major.load` in the
IR, not `load`?
On Thu, Mar 18, 2021 at 8:17 AM Luo, Yuanke via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Thank Florian. I agree with you that pointers to `x86_amx` have different
semantics than regular LLVM pointer types. First the x86_amx pointer point to a
2D array of a big matrix. The data of each row is contiguous, but the data on
contiguous row is not contiguous in memory. Below picture shows the x86_amx load
semantics. We need another operand stride to describe the stride of each rows.
So the semantics for “load <256xi32>*” and “load x86_amx” is different.
Because “load <256 x i32>* assume the memory is contiguous and load a flat
vector.
You also mention that there is no documentation of x86_amx in the langref. I’d
like to add x86_amx to the document. Is there any process to document for a
type?
[cid:image002.jpg at 01D71CA6.676AA470]
Thanks
Yuanke
From: Florian Hahn <florian_hahn at apple.com<mailto:florian_hahn at
apple.com>>
Sent: Thursday, March 18, 2021 6:03 PM
To: Wang, Pengfei <pengfei.wang at intel.com<mailto:pengfei.wang at
intel.com>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>; Luo, Yuanke <yuanke.luo at
intel.com<mailto:yuanke.luo at intel.com>>
Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?
On Mar 17, 2021, at 10:11, Wang, Pengfei via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Hi,
We are developing prototypes for Intel Advanced Matrix Extensions (AMX) [1]
programing model in Clang and LLVM [2].
We met several cases when the certain type we added are optimized unexpectedly
in the middle-end. E.g. optimizing phi + biscast + load:
From
%a = load <256 x i32>, <256 x i32>* %mem, align 64
… …
%b = phi <256 x i32> [ %a, %label1 ], [%someother, %label2]
%c = bitcast <256 x i32> %b to x86_amx
To
%a = bitcast <256 x i32>* %mem to x86_amx*
%b = load x86_amx, x86_amx*, align 64
… …
%c = phi x86_amx [ %b, %label1 ], [%someother, %label2]
To prevent such unexpected transforms, we concretely added the type check in
each point of the optimizations.
Roman pointed out the changes are not the right direction [3], and thought it’s
bug for backend. While we agreed backend might be able to handle it for the
functionality, we think it is better to handle it in the midden-end since they
are negative optimizations for AMX.
First, let me put some background here:
1. x86_amx* is different from trivial pointers.
The AMX load instruction is much different from other load instructions. It is
not only need the memory address but also the shape / stride of the tile
register. We did some extra work in the backend to deduce the shape information
from the context. We don’t want the pass to add new x86_amx related usage
because this will result in the difficulty in deduction. That said bitcasting
other pointer types to x86_amx* is not trivial as assumed here.
The problem appears to be that this difference is not modeled or specified in
LLVM IR AFAICT. The current LangRef does not appear to specific `x86_amx` to
start with. If pointers to `x86_amx` have different semantics than regular LLVM
pointer types, using regular LLVM pointer types for pointers to `x86_amx` may
not be appropriate. I’ve not followed the previous AMX discussions closely, but
it sounds like it may be good to reconsider how x86_amx pointers are modeled in
LVM IR.
Also note that `bitcast` is specified as `no-op`
(https://llvm.org/docs/LangRef.html#id293) (expect for pointers with different
address spaces), but from what you mentioned above this does not match the
semantics for `x86_amx*`. It sounds like this is the underlying problem that
should be addressed, because trying to update various middle end optimization
tot ry to enforce the special semantics does not seem to be a scalable solution.
As Nuno mentioned, you could try and use a separate address space for `x86_amx`
pointers to avoid pointer optimizations.
1. The physical tile registers have more limitations.
* No copy instruction between tile registers.
* Spilling / reload a tile register is expensive in light of its size is
1024 bytes.
* The shapes of tile registers need to be pre-configured before use and
all data in tile registers will turn into invalid once re-configured. That said
we need to dominate as more tile registers as possible to configure their shapes
with one configure instruction, otherwise we need to spill and reload the live
registers once we need to re-configure.
* The number of tile registers is rather small (only 8) and different
shapes cannot be reused.
Based on the limitations, we need to reduce the use / live range of tile
registers. But optimizations may increase the opportunity of the use. So even we
can handle some combined operation for AMX type, we still prefer to prevent it
from the beginning. Unless we can totally roll back the optimization. Which is
also not a good solution in my opinion.
1. For more information, please refer to discussion in [3].
For other optimization points, please refer [4][5].
I think the main controversy from Roman is if middle-end pass should consider
some special type when doing optimization. I tend to let middle-end do the type
check on account of the peculiarity of AMX type. But I’m not sure if we have
precedent to handle the similar issue in other targets. I’m open and glad to do
it either way so long as we have an elegant solution.
Any suggestions are welcome.
IIUC the main problem is not that middle-end passes perform or not perform
optimizations based on certain types. To me it sounds like the actual problem is
that pointers to `x86_amx` do not behave like regular LLVM IR pointers and you
are trying to enforce extra restrictions for bit casts.
Cheers,
Florian
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210319/9449ce29/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 30335 bytes
Desc: image002.jpg
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210319/9449ce29/attachment-0001.jpg>
Luo, Yuanke via llvm-dev
2021-Mar-20 12:51 UTC
[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?
I write a patch (https://reviews.llvm.org/D93788) to transform the load/store
x86_amx* to amx intrinsics. The effort is much more than disable the bitcast
from load/store <256 x i32>* to load/store x86_amx*.
From: Luo, Yuanke
Sent: Friday, March 19, 2021 9:58 AM
To: James Y Knight <jyknight at google.com>
Cc: Florian Hahn <florian_hahn at apple.com>; Wang, Pengfei
<Pengfei.Wang at intel.com>; llvm-dev <llvm-dev at lists.llvm.org>
Subject: RE: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?
Hi James,
Thank you for taking the time to deep dive the issue. It is very constructive. I
agree we can transform “load x86_amx*” to amx load intrinsic. But it seems we
need more effort to do the transform than preventing generate “load x86_amx*”. I
can support transforming “load x86_amx*” to amx load intrinsic if people like
this approach.
I also think Florian raise a good question. What the semantics about “load
x86_amx*”. Is it different semantics than regular LLVM pointer types? What’s
your opinions on it?
Thanks
Yuanke
From: James Y Knight <jyknight at google.com<mailto:jyknight at
google.com>>
Sent: Friday, March 19, 2021 9:28 AM
To: Luo, Yuanke <yuanke.luo at intel.com<mailto:yuanke.luo at
intel.com>>
Cc: Florian Hahn <florian_hahn at apple.com<mailto:florian_hahn at
apple.com>>; Wang, Pengfei <pengfei.wang at
intel.com<mailto:pengfei.wang at intel.com>>; llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?
Why is that harder than lowering a load <256 x i32> and then bitcast to
x86_amx?
E.g., I see there is in llvm/lib/Target/X86/X86LowerAMXType.cpp a transform:
%src = load <256 x i32>, <256 x i32>* %addr, align 64
%2 = bitcast <256 x i32> %src to x86_amx
-->
%2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %row, i16 %col, i8* %addr,
i64 %stride64)
Isn't it equivalent, then, to do:
%2 = load x86_amx, x86_amx* %addr, align 64
-->
%2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %row, i16 %col, i8* %addr,
i64 %stride64)
On Thu, Mar 18, 2021 at 9:29 AM Luo, Yuanke <yuanke.luo at
intel.com<mailto:yuanke.luo at intel.com>> wrote:
But x86_amx represent a tile. The semantics of hardware instruction tileloadd is
something like ‘llvm.matrix.row.major.load’. How do we lower `%v = load x86_amx,
x86_amx* %ptr` to tileloadd?
From: James Y Knight <jyknight at google.com<mailto:jyknight at
google.com>>
Sent: Thursday, March 18, 2021 9:09 PM
To: Luo, Yuanke <yuanke.luo at intel.com<mailto:yuanke.luo at
intel.com>>
Cc: Florian Hahn <florian_hahn at apple.com<mailto:florian_hahn at
apple.com>>; Wang, Pengfei <pengfei.wang at
intel.com<mailto:pengfei.wang at intel.com>>; llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?
Since the x86_amx type has a fixed size of 1024, I would expect `%v = load
x86_amx, x86_amx* %ptr` to load 1024 bytes of contiguous memory starting at %ptr
-- I don't see why this should be invalid?
On Thu, Mar 18, 2021 at 8:53 AM Luo, Yuanke <yuanke.luo at
intel.com<mailto:yuanke.luo at intel.com>> wrote:
I mean transforming from “load <256 x i32>*” to “load x86_amx*” is not
invalid because x86_amx represent a tile and “load x86_amx*” doesn’t express its
semantics without a stride. Now it looks to me “load x86_amx*” is invalid.
From: James Y Knight <jyknight at google.com<mailto:jyknight at
google.com>>
Sent: Thursday, March 18, 2021 8:41 PM
To: Luo, Yuanke <yuanke.luo at intel.com<mailto:yuanke.luo at
intel.com>>
Cc: Florian Hahn <florian_hahn at apple.com<mailto:florian_hahn at
apple.com>>; Wang, Pengfei <pengfei.wang at
intel.com<mailto:pengfei.wang at intel.com>>; llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?
Err...are you saying this is the expected semantics of a "load
x86_amx" operation today? That doesn't make much sense...Surely a
strided-load operation should be spelled `llvm.matrix.column.major.load` in the
IR, not `load`?
On Thu, Mar 18, 2021 at 8:17 AM Luo, Yuanke via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Thank Florian. I agree with you that pointers to `x86_amx` have different
semantics than regular LLVM pointer types. First the x86_amx pointer point to a
2D array of a big matrix. The data of each row is contiguous, but the data on
contiguous row is not contiguous in memory. Below picture shows the x86_amx load
semantics. We need another operand stride to describe the stride of each rows.
So the semantics for “load <256xi32>*” and “load x86_amx” is different.
Because “load <256 x i32>* assume the memory is contiguous and load a flat
vector.
You also mention that there is no documentation of x86_amx in the langref. I’d
like to add x86_amx to the document. Is there any process to document for a
type?
[cid:image001.jpg at 01D71DCA.BC549CA0]
Thanks
Yuanke
From: Florian Hahn <florian_hahn at apple.com<mailto:florian_hahn at
apple.com>>
Sent: Thursday, March 18, 2021 6:03 PM
To: Wang, Pengfei <pengfei.wang at intel.com<mailto:pengfei.wang at
intel.com>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>; Luo, Yuanke <yuanke.luo at
intel.com<mailto:yuanke.luo at intel.com>>
Subject: Re: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?
On Mar 17, 2021, at 10:11, Wang, Pengfei via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Hi,
We are developing prototypes for Intel Advanced Matrix Extensions (AMX) [1]
programing model in Clang and LLVM [2].
We met several cases when the certain type we added are optimized unexpectedly
in the middle-end. E.g. optimizing phi + biscast + load:
From
%a = load <256 x i32>, <256 x i32>* %mem, align 64
… …
%b = phi <256 x i32> [ %a, %label1 ], [%someother, %label2]
%c = bitcast <256 x i32> %b to x86_amx
To
%a = bitcast <256 x i32>* %mem to x86_amx*
%b = load x86_amx, x86_amx*, align 64
… …
%c = phi x86_amx [ %b, %label1 ], [%someother, %label2]
To prevent such unexpected transforms, we concretely added the type check in
each point of the optimizations.
Roman pointed out the changes are not the right direction [3], and thought it’s
bug for backend. While we agreed backend might be able to handle it for the
functionality, we think it is better to handle it in the midden-end since they
are negative optimizations for AMX.
First, let me put some background here:
1. x86_amx* is different from trivial pointers.
The AMX load instruction is much different from other load instructions. It is
not only need the memory address but also the shape / stride of the tile
register. We did some extra work in the backend to deduce the shape information
from the context. We don’t want the pass to add new x86_amx related usage
because this will result in the difficulty in deduction. That said bitcasting
other pointer types to x86_amx* is not trivial as assumed here.
The problem appears to be that this difference is not modeled or specified in
LLVM IR AFAICT. The current LangRef does not appear to specific `x86_amx` to
start with. If pointers to `x86_amx` have different semantics than regular LLVM
pointer types, using regular LLVM pointer types for pointers to `x86_amx` may
not be appropriate. I’ve not followed the previous AMX discussions closely, but
it sounds like it may be good to reconsider how x86_amx pointers are modeled in
LVM IR.
Also note that `bitcast` is specified as `no-op`
(https://llvm.org/docs/LangRef.html#id293) (expect for pointers with different
address spaces), but from what you mentioned above this does not match the
semantics for `x86_amx*`. It sounds like this is the underlying problem that
should be addressed, because trying to update various middle end optimization
tot ry to enforce the special semantics does not seem to be a scalable solution.
As Nuno mentioned, you could try and use a separate address space for `x86_amx`
pointers to avoid pointer optimizations.
1. The physical tile registers have more limitations.
* No copy instruction between tile registers.
* Spilling / reload a tile register is expensive in light of its size is
1024 bytes.
* The shapes of tile registers need to be pre-configured before use and
all data in tile registers will turn into invalid once re-configured. That said
we need to dominate as more tile registers as possible to configure their shapes
with one configure instruction, otherwise we need to spill and reload the live
registers once we need to re-configure.
* The number of tile registers is rather small (only 8) and different
shapes cannot be reused.
Based on the limitations, we need to reduce the use / live range of tile
registers. But optimizations may increase the opportunity of the use. So even we
can handle some combined operation for AMX type, we still prefer to prevent it
from the beginning. Unless we can totally roll back the optimization. Which is
also not a good solution in my opinion.
1. For more information, please refer to discussion in [3].
For other optimization points, please refer [4][5].
I think the main controversy from Roman is if middle-end pass should consider
some special type when doing optimization. I tend to let middle-end do the type
check on account of the peculiarity of AMX type. But I’m not sure if we have
precedent to handle the similar issue in other targets. I’m open and glad to do
it either way so long as we have an elegant solution.
Any suggestions are welcome.
IIUC the main problem is not that middle-end passes perform or not perform
optimizations based on certain types. To me it sounds like the actual problem is
that pointers to `x86_amx` do not behave like regular LLVM IR pointers and you
are trying to enforce extra restrictions for bit casts.
Cheers,
Florian
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210320/977b7d20/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 30387 bytes
Desc: image001.jpg
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210320/977b7d20/attachment-0001.jpg>
Florian Hahn via llvm-dev
2021-Mar-22 13:29 UTC
[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?
> On Mar 19, 2021, at 01:58, Luo, Yuanke <yuanke.luo at intel.com> wrote: > > Hi James, > > Thank you for taking the time to deep dive the issue. It is very constructive. I agree we can transform “load x86_amx*” to amx load intrinsic. But it seems we need more effort to do the transform than preventing generate “load x86_amx*”. I can support transforming “load x86_amx*” to amx load intrinsic if people like this approach. > > I also think Florian raise a good question. What the semantics about “load x86_amx*”. Is it different semantics than regular LLVM pointer types? What’s your opinions on it? >From the points earlier, it sounds like you’d need to change `load` semantics for `x86_amx` to load blocks of data with gaps in between them? I am not sure if that’s a good idea, as there are plenty of places in LLVM that make use of that assumption I think (e.g. the code reasoning about memory locations). I’d expect lots of places would need updating and until everything is updated there will plenty of places that get this subtly wrong. This doesn’t sound scalable. Cheers, Florian -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210322/2e24b61c/attachment.html>