Zhang, Xiang1 via llvm-dev
2021-Apr-13 06:01 UTC
[llvm-dev] RFC [X86 AMX] O0: Support AMX fast register allocation
[X86 AMX]O0: Support AMX fast register allocation
The amx programming model that discussed in llvm-dev
(http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html).
Some features of building AMX at O0 level:
1. Shapes of Tiles are very hard to compare.
2. Live range of tiles register usually very short.
3. AMX memory operation instructions (e.g. TileLoad/Store) used index
register as step, which is trouble in faster register allocation.
(Detail: When we Spill tmm registers in fast reg allocation.
Similar with other registers, we need generate tilestore/load for the spilled
tmm registers.
for example:
TILESTORED %stack.2, 1, %16:gr64_nosp, 0, $noreg, killed $tmm0
.....
$tmm0 = TILELOADD %stack.2, 1, %16:gr64_nosp, 0, $noreg :: (load
1024 from %stack.2)
We need to make sure there is an useable index register for tile
mem.
And let the useable index register = Stride
(but registers has allocated!) )
1>
In O0 level, for the customers usually means clang -O0 -S/-c (Front End and Back
end both compile in O0 level):
The tile data of amx intrinsic must be loaded before uses, and store into mem
after define a tile register.
Some like
----------------------------------------------------------------------
%t1 = call x86_amx @llvm.x86.tileloadd64.internal(m, k, ...)
%t2 = call x86_amx @llvm.x86.tileloadd64.internal(k, n, ...)
%t3 = call x86_amx @llvm.x86.tileloadd64.internal(m, n, ...)
%td = call x86_amx @llvm.x86.tdpbssd.internal(m, n, k, t1, t2, t3) // key amx
intrinsic
call void @llvm.x86.tilestored64.internal(... td)
----------------------------------------------------------------------
Because the life range of tile register is very short (from tileload to
tilestore, impossible to spill), we let fast register allocation directly
allocate tile registers for them.
As the AMX programming model above show, we need ldtilecfg for each tile
register before using them.
So we insert ldtilecfg for every key amx intrinsic (There are 2 reasons do it:
1,we don't much care about the performance at O0. 2,The shapes are very hard
to compare at O0 level )
e.g.
----------------------------------------------------------------------
%cfgmem = alloca <16 x i32>, align 4
store <16 x i32> zeroinitializer, <16 x i32>* %cfgmem
call void @llvm.x86.ldtilecfg.internal(i8* %cfgmem)
---------------------------------------------------------------------
%t1 = call x86_amx @llvm.x86.tileloadd64.internal(m, k, ...)
%t2 = call x86_amx @llvm.x86.tileloadd64.internal(k, n, ...)
%t3 = call x86_amx @llvm.x86.tileloadd64.internal(m, n, ...)
%td = call x86_amx @llvm.x86.tdpbssd.internal(m, n, k, t1, t2, t3) // key amx
intrinsic
call void @llvm.x86.tilestored64.internal(... td)
-------------------------------------------------------------------------
But the ldtilecfg need to write the shapes of tile register in its config mem,
then we write the shapes before fast register allocation. (it is trouble to do
it after register allocation, because the shapes register relocated for
AMXinstrinsics may not live at writing position.) But currently, we don't
know for which physic tile register we write the virtual register of shapes
,(because it is before register allocation). So, we just orderly write these
shapes into config memory:
e.g.
----------------------------------------------------------------------
%cfgmem = alloca <16 x i32>, align 4 *
allocate mem
store <16 x i32> zeroinitializer, <16 x i32>* %cfgmem
* zero init
...
//pre-config shape of %t1 *
store volatile i8 %m, i8* %amx.tmm.0.shape.row, align 1 *
store volatile i16 %k, i16* %amx.tmm.0.shape.col, align 2 *
pre-config
// pre-config shape of %t2 * shapes
store volatile i8 %k, i8* %amx.tmm.1.shape.row, align 1 *
store volatile i16 %n, i16* %amx.tmm.1.shape.col, align 2 *
// pre-config shape of %t3, %td *
....
call void @llvm.x86.ldtilecfg.internal(i8* %cfgmem) * tile
config
-------------------------------------------------------------------------
And then adjust them after fast register allocation.
e.g.
We supposed written the first shape into %amx.tmm.0.shape.row (base + 48), but
after fast register allocation if we find the first shape is not corresponding
to the first tile register (tmm0), it is corresponding to the 2nd tile register
(tmm1), we will adjust the written mem to %amx.tmm.1.shape.row (base + 48 +1).
---------------------------------------------------------------------------
MOV8mi %stack.5, 1, $noreg, 49, $noreg, 8 :: (volatile store 1 into
%ir.amx.tmm.0.shape.row)
MOV16mr %stack.5, 1, $noreg, 18, $noreg, renamable $cx :: (volatile store 2 into
%ir.amx.tmm.0.shape.col)
...
PLDTILECFGV killed renamable $rsi, 1, $noreg, 0, $noreg
--------------------------------------------------------------------------
2>
For the customers, they usually use clang -O0 -S/-c (Front End and Back end both
compile in O0 level).
But for llvm developers, we may usually let Front End build with
O1<https://reviews.llvm.org/owners/package/1/>/2/... and Back End build in
O0 (e.g.: clang -O0 -S -emit-llvm + llc -O0)
Considering this way is not the main way of building program and let the upper
algorithm works too, I "volatiles" the tile data of key AMX intrinsic
in pass "Lower AMX type for load/store", just let it like in clang
-O0, all tile data of key AMX intrinsic must be loaded before uses, and stored
into mem after define a tile register. Because the Back End build it in O0, so
here we don't consider the performance, just care about the correctness.
I first implemented the design at https://reviews.llvm.org/D100026
BR!
Thank you!
Xiang
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210413/e372b459/attachment.html>