林政宗 via llvm-dev
2020-Jun-25 06:11 UTC
[llvm-dev] How to implement load/store for vector predicate register
Hi, there I am writing an backend, and I met a problem. We don't have load/store instructions for vector predicate registers(vpr for short). The hardware has 64 vector registers(vr for short) and 8 vector predicate registers. And there is no move instructions between vr and vpr. vr supports many operations, and vpr supports vpror, vprxor, vprand and vprinv operations. A vr has 512 bits, and a vpr has 128 bits. vr is used for v16i32, v32i16, v64i8. And a scalar register has 32 bits. If we compare or add two v16i32, a element in vpr has 8 bits. If we compare or add two v64i8, then a element in vpr has 2 bits(one bit for compare flag and one bit for carry flag). A element in vpr contains carry flag and compare flag. We have defined registers and a new type(vpr) for vector predicate registers in backend. Although there is no direct instruction to move vpr to vr or to move vr to vpr, there is a method to work around this. And we have load/store instructions for vr. move vpr to vr for v32i16 (from vpr0 to vr1): 1 vclr vr0 // clear vr0 2 ldi r5, 0x00010001 // load immediate (compare bit mask for v32i16) to scalar register r5 3 movr2vr.dup vr2, r5 // duplicate content in r5 into vr2, 4 vadd.t.s16 vr1, vr0, vr2, vpr0 //vector add if element compare bit is set, element type is 16 bit signed integer, now we have moved compare bits from vpr0 to vr1 5 ldi r5, 0x00020002 // load immediate (carry bit mask for v32i16) to scalar register r5 6 movr2vr.dup vr2, r5 // duplicate content in r5 into vr2 7 vadd.c.s16 vr1, vr1, vr2, vpr0 // vr1 = vr1 + vr2, vector add if element carry bit is set, element type is 16 bit signed integer, now we moved carry bits from vpr0 to vr1 too. mov vr to vpr for v32i16 (from vr1 to vpr0): 8 vclr vr0 // clear vr0 9 ldi r5, 0x00010001 // load immediate (compare bit mask for v32i16) to r5 10 movr2vr.dup vr2, r5 // duplicate content of r5 into vr2 11 vand.u16 vr2, vr1, vr2 // vector and, element type is 16 bit unsigned integer, vr2 = vr1 & vr2, now we have moved compare bits from vr1 to vr2 now 12 vslt.s16 vpr0, vr0, vr2 // vector set when less than, element type is 16 bit signed integer, now we have moved compare bits from vr1 to vpr0 13 ldi r5, 0x00020002 // load immediate (carry bit mask for v32i16) to r5 14 movr2vr.dup vr2, r5 // duplicate content of r5 into vr2 15 vand.u16 vr2, vr1, vr2 // vector and for element type 16 bit unsigned integer, vr2 has carry bits now 16 ldi r5, 0x7FFF7FFF // max number for 16 bit signed integer 17 movr2vr.dup vr3, r5 // duplicate r5 into vr3 18 vadd.s16 vr1, vr2, vr3, vpr0 // vpr0 has carry bits set now Each vector type has a different instruction sequence, because the bit mask and element type is different. I have tried to lower load/store for vpr in XXXISelLowering.cpp. But there is no guarantee that line 12 and line 18 would assign the same register for vpr0. vpr0 in line18 is an output and is not an input. And vpr0 in line 12 and line 18 is parallel in SelectionDAG graph. They are both output. I think I would try to define three pseudo instructions for three vector type, and expand the pseudo instruction into instruction sequence before register allocation at next step. But I'm not sure it will work. What should I do? Thanks and best regards, Jerry -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200625/52d8b1fb/attachment.html>
Hal Finkel via llvm-dev
2020-Jun-25 12:29 UTC
[llvm-dev] How to implement load/store for vector predicate register
On 6/25/20 1:11 AM, 林政宗 via llvm-dev wrote:> Hi, there > I am writing an backend, and I met a problem. > We don't have load/store instructions for vector predicate > registers(vpr for short). > The hardware has 64 vector registers(vr for short) and 8 vector > predicate registers. And there is no move instructions between vr and vpr. > vr supports many operations, and vpr supports vpror, vprxor, vprand > and vprinv operations. > A vr has 512 bits, and a vpr has 128 bits. vr is used for v16i32, > v32i16, v64i8. And a scalar register has 32 bits. > If we compare or add two v16i32, a element in vpr has 8 bits. If we > compare or add two v64i8, then a element in vpr has 2 bits(one bit for > compare flag and one bit for carry flag). > A element in vpr contains carry flag and compare flag. > We have defined registers and a new type(vpr) for vector predicate > registers in backend. > Although there is no direct instruction to move vpr to vr or to move > vr to vpr, there is a method to work around this. And we have > load/store instructions for vr. > move vpr to vr for v32i16 (from vpr0 to vr1): > 1 vclr vr0 // clear vr0 > 2 ldi r5, 0x00010001 // load immediate (compare bit mask for > v32i16) to scalar register r5 > 3 movr2vr.dup vr2, r5 // duplicate content in r5 into vr2, > 4 vadd.t.s16 vr1, vr0, vr2, vpr0 //vector add if element compare > bit is set, element type is 16 bit signed integer, now we have moved > compare bits from vpr0 to vr1 > 5 ldi r5, 0x00020002 // load immediate (carry bit mask for > v32i16) to scalar register r5 > 6 movr2vr.dup vr2, r5 // duplicate content in r5 into vr2 > 7 vadd.c.s16 vr1, vr1, vr2, vpr0 // vr1 = vr1 + vr2, vector add > if element carry bit is set, element type is 16 bit signed integer, > now we moved carry bits from vpr0 to vr1 too. > > mov vr to vpr for v32i16 (from vr1 to vpr0): > 8 vclr vr0 // clear vr0 > 9 ldi r5, 0x00010001 // load immediate (compare bit mask for > v32i16) to r5 > 10 movr2vr.dup vr2, r5 // duplicate content of r5 into vr2 > 11 vand.u16 vr2, vr1, vr2 // vector and, element type is 16 bit > unsigned integer, vr2 = vr1 & vr2, now we have moved compare bits from > vr1 to vr2 now > 12 vslt.s16 vpr0, vr0, vr2 // vector set when less than, element > type is 16 bit signed integer, now we have moved compare bits from vr1 > to vpr0 > 13 ldi r5, 0x00020002 // load immediate (carry bit mask for > v32i16) to r5 > 14 movr2vr.dup vr2, r5 // duplicate content of r5 into vr2 > 15 vand.u16 vr2, vr1, vr2 // vector and for element type 16 bit > unsigned integer, vr2 has carry bits now > 16 ldi r5, 0x7FFF7FFF // max number for 16 bit signed integer > 17 movr2vr.dup vr3, r5 // duplicate r5 into vr3 > 18 vadd.s16 vr1, vr2, vr3, vpr0 // vpr0 has carry bits set now > > Each vector type has a different instruction sequence, because the bit > mask and element type is different. > I have tried to lower load/store for vpr in XXXISelLowering.cpp. But > there is no guarantee that line 12 and line 18 would assign the same > register for vpr0. vpr0 in line18 is an output and is not an input. > And vpr0 in line 12 and line 18 is parallel in SelectionDAG graph. > They are both output. > I think I would try to define three pseudo instructions for three > vector type, and expand the pseudo instruction into instruction > sequence before register allocation at next step. But I'm not sure it > will work. > What should I do?This somewhat depends on how you're modeling things, but a late-expanded pseud-instructions seems like a workable approach. If the pseudo-instruction needs temporary registers (and it looks like it does), then the pseudo-instruction should take them as register operands (so that RA will allocate them for you and you don't need to worry about scavenging them later). You might, however, need to mark such operands as "early clobber" to prevent RA from assigning the same register as an input and output (sometimes, depending on how the expanded code uses the registers, this is necessary). -Hal> > Thanks and best regards, > Jerry > > > > > > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200625/0f991309/attachment.html>
林政宗 via llvm-dev
2020-Jun-26 06:58 UTC
[llvm-dev] How to implement load/store for vector predicate register
Hi, I am planning to expanding the pseudo instructions in XXXTargetLowering::EmitInstrWithCustomInserter(), and use temporary virtual registers as operands. If I use virtual registers, do I need to mark them as "early clobber"? I saw that sometimes they marked virtual register as "early clobber" in EmitInstrWithCustomInserter() in MIPS backend. What is the effect of marking a virtual register as "early clobber" before RA? Thanks, Jerry 在 2020-06-25 20:29:30,"Hal Finkel" <hfinkel at anl.gov> 写道: On 6/25/20 1:11 AM, 林政宗 via llvm-dev wrote: Hi, there I am writing an backend, and I met a problem. We don't have load/store instructions for vector predicate registers(vpr for short). The hardware has 64 vector registers(vr for short) and 8 vector predicate registers. And there is no move instructions between vr and vpr. vr supports many operations, and vpr supports vpror, vprxor, vprand and vprinv operations. A vr has 512 bits, and a vpr has 128 bits. vr is used for v16i32, v32i16, v64i8. And a scalar register has 32 bits. If we compare or add two v16i32, a element in vpr has 8 bits. If we compare or add two v64i8, then a element in vpr has 2 bits(one bit for compare flag and one bit for carry flag). A element in vpr contains carry flag and compare flag. We have defined registers and a new type(vpr) for vector predicate registers in backend. Although there is no direct instruction to move vpr to vr or to move vr to vpr, there is a method to work around this. And we have load/store instructions for vr. move vpr to vr for v32i16 (from vpr0 to vr1): 1 vclr vr0 // clear vr0 2 ldi r5, 0x00010001 // load immediate (compare bit mask for v32i16) to scalar register r5 3 movr2vr.dup vr2, r5 // duplicate content in r5 into vr2, 4 vadd.t.s16 vr1, vr0, vr2, vpr0 //vector add if element compare bit is set, element type is 16 bit signed integer, now we have moved compare bits from vpr0 to vr1 5 ldi r5, 0x00020002 // load immediate (carry bit mask for v32i16) to scalar register r5 6 movr2vr.dup vr2, r5 // duplicate content in r5 into vr2 7 vadd.c.s16 vr1, vr1, vr2, vpr0 // vr1 = vr1 + vr2, vector add if element carry bit is set, element type is 16 bit signed integer, now we moved carry bits from vpr0 to vr1 too. mov vr to vpr for v32i16 (from vr1 to vpr0): 8 vclr vr0 // clear vr0 9 ldi r5, 0x00010001 // load immediate (compare bit mask for v32i16) to r5 10 movr2vr.dup vr2, r5 // duplicate content of r5 into vr2 11 vand.u16 vr2, vr1, vr2 // vector and, element type is 16 bit unsigned integer, vr2 = vr1 & vr2, now we have moved compare bits from vr1 to vr2 now 12 vslt.s16 vpr0, vr0, vr2 // vector set when less than, element type is 16 bit signed integer, now we have moved compare bits from vr1 to vpr0 13 ldi r5, 0x00020002 // load immediate (carry bit mask for v32i16) to r5 14 movr2vr.dup vr2, r5 // duplicate content of r5 into vr2 15 vand.u16 vr2, vr1, vr2 // vector and for element type 16 bit unsigned integer, vr2 has carry bits now 16 ldi r5, 0x7FFF7FFF // max number for 16 bit signed integer 17 movr2vr.dup vr3, r5 // duplicate r5 into vr3 18 vadd.s16 vr1, vr2, vr3, vpr0 // vpr0 has carry bits set now Each vector type has a different instruction sequence, because the bit mask and element type is different. I have tried to lower load/store for vpr in XXXISelLowering.cpp. But there is no guarantee that line 12 and line 18 would assign the same register for vpr0. vpr0 in line18 is an output and is not an input. And vpr0 in line 12 and line 18 is parallel in SelectionDAG graph. They are both output. I think I would try to define three pseudo instructions for three vector type, and expand the pseudo instruction into instruction sequence before register allocation at next step. But I'm not sure it will work. What should I do? This somewhat depends on how you're modeling things, but a late-expanded pseud-instructions seems like a workable approach. If the pseudo-instruction needs temporary registers (and it looks like it does), then the pseudo-instruction should take them as register operands (so that RA will allocate them for you and you don't need to worry about scavenging them later). You might, however, need to mark such operands as "early clobber" to prevent RA from assigning the same register as an input and output (sometimes, depending on how the expanded code uses the registers, this is necessary). -Hal Thanks and best regards, Jerry _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200626/8b0b3d56/attachment.html>