Zhang, Xiang1 via llvm-dev
2021-Apr-13 06:01 UTC
[llvm-dev] RFC [X86 AMX] O0: Support AMX fast register allocation
[X86 AMX]O0: Support AMX fast register allocation The amx programming model that discussed in llvm-dev (http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html). Some features of building AMX at O0 level: 1. Shapes of Tiles are very hard to compare. 2. Live range of tiles register usually very short. 3. AMX memory operation instructions (e.g. TileLoad/Store) used index register as step, which is trouble in faster register allocation. (Detail: When we Spill tmm registers in fast reg allocation. Similar with other registers, we need generate tilestore/load for the spilled tmm registers. for example: TILESTORED %stack.2, 1, %16:gr64_nosp, 0, $noreg, killed $tmm0 ..... $tmm0 = TILELOADD %stack.2, 1, %16:gr64_nosp, 0, $noreg :: (load 1024 from %stack.2) We need to make sure there is an useable index register for tile mem. And let the useable index register = Stride (but registers has allocated!) ) 1> In O0 level, for the customers usually means clang -O0 -S/-c (Front End and Back end both compile in O0 level): The tile data of amx intrinsic must be loaded before uses, and store into mem after define a tile register. Some like ---------------------------------------------------------------------- %t1 = call x86_amx @llvm.x86.tileloadd64.internal(m, k, ...) %t2 = call x86_amx @llvm.x86.tileloadd64.internal(k, n, ...) %t3 = call x86_amx @llvm.x86.tileloadd64.internal(m, n, ...) %td = call x86_amx @llvm.x86.tdpbssd.internal(m, n, k, t1, t2, t3) // key amx intrinsic call void @llvm.x86.tilestored64.internal(... td) ---------------------------------------------------------------------- Because the life range of tile register is very short (from tileload to tilestore, impossible to spill), we let fast register allocation directly allocate tile registers for them. As the AMX programming model above show, we need ldtilecfg for each tile register before using them. So we insert ldtilecfg for every key amx intrinsic (There are 2 reasons do it: 1,we don't much care about the performance at O0. 2,The shapes are very hard to compare at O0 level ) e.g. ---------------------------------------------------------------------- %cfgmem = alloca <16 x i32>, align 4 store <16 x i32> zeroinitializer, <16 x i32>* %cfgmem call void @llvm.x86.ldtilecfg.internal(i8* %cfgmem) --------------------------------------------------------------------- %t1 = call x86_amx @llvm.x86.tileloadd64.internal(m, k, ...) %t2 = call x86_amx @llvm.x86.tileloadd64.internal(k, n, ...) %t3 = call x86_amx @llvm.x86.tileloadd64.internal(m, n, ...) %td = call x86_amx @llvm.x86.tdpbssd.internal(m, n, k, t1, t2, t3) // key amx intrinsic call void @llvm.x86.tilestored64.internal(... td) ------------------------------------------------------------------------- But the ldtilecfg need to write the shapes of tile register in its config mem, then we write the shapes before fast register allocation. (it is trouble to do it after register allocation, because the shapes register relocated for AMXinstrinsics may not live at writing position.) But currently, we don't know for which physic tile register we write the virtual register of shapes ,(because it is before register allocation). So, we just orderly write these shapes into config memory: e.g. ---------------------------------------------------------------------- %cfgmem = alloca <16 x i32>, align 4 * allocate mem store <16 x i32> zeroinitializer, <16 x i32>* %cfgmem * zero init ... //pre-config shape of %t1 * store volatile i8 %m, i8* %amx.tmm.0.shape.row, align 1 * store volatile i16 %k, i16* %amx.tmm.0.shape.col, align 2 * pre-config // pre-config shape of %t2 * shapes store volatile i8 %k, i8* %amx.tmm.1.shape.row, align 1 * store volatile i16 %n, i16* %amx.tmm.1.shape.col, align 2 * // pre-config shape of %t3, %td * .... call void @llvm.x86.ldtilecfg.internal(i8* %cfgmem) * tile config ------------------------------------------------------------------------- And then adjust them after fast register allocation. e.g. We supposed written the first shape into %amx.tmm.0.shape.row (base + 48), but after fast register allocation if we find the first shape is not corresponding to the first tile register (tmm0), it is corresponding to the 2nd tile register (tmm1), we will adjust the written mem to %amx.tmm.1.shape.row (base + 48 +1). --------------------------------------------------------------------------- MOV8mi %stack.5, 1, $noreg, 49, $noreg, 8 :: (volatile store 1 into %ir.amx.tmm.0.shape.row) MOV16mr %stack.5, 1, $noreg, 18, $noreg, renamable $cx :: (volatile store 2 into %ir.amx.tmm.0.shape.col) ... PLDTILECFGV killed renamable $rsi, 1, $noreg, 0, $noreg -------------------------------------------------------------------------- 2> For the customers, they usually use clang -O0 -S/-c (Front End and Back end both compile in O0 level). But for llvm developers, we may usually let Front End build with O1<https://reviews.llvm.org/owners/package/1/>/2/... and Back End build in O0 (e.g.: clang -O0 -S -emit-llvm + llc -O0) Considering this way is not the main way of building program and let the upper algorithm works too, I "volatiles" the tile data of key AMX intrinsic in pass "Lower AMX type for load/store", just let it like in clang -O0, all tile data of key AMX intrinsic must be loaded before uses, and stored into mem after define a tile register. Because the Back End build it in O0, so here we don't consider the performance, just care about the correctness. I first implemented the design at https://reviews.llvm.org/D100026 BR! Thank you! Xiang -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210413/e372b459/attachment.html>