James R. Byerly via llvm-dev
2015-Nov-12 16:43 UTC
[llvm-dev] Help making 'narrow instruct microcode' Backend
I've been experimenting with llvm/clang as a user for a while now, but now I'm interested in writing my own backend. I'm also developing the target architecture (maybe to go in an fpga eventually) and I'm intentionally making it extremely simple. I think of it as a narrow microcode, because (for example) performing an add requires a sequence of instructions like: set aluin1 = r1 set aluin2 = r2 aluop add set r3 = aluout I've started implementing the backend in clang, and I got this basic example working by modifying my backend's implementation of SelectionDAGISel::Select to handle ISD::ADD and transform it into the above 4-instruction sequence. However, I'd like to expand the architecture to have more than 1 alu, with one set of alu registers for each alu (ex aluAin1, aluAin2, aluAout, and aluBin1, aluBin2, aluBout). I'd also like to have llvm's register allocation and instruction scheduling select which alu to use at what time, but I don't think I can acomplish this with my current approach. One thought I had was making the aluin1/2 and aluout subregisters of a larger register, and then making multiple instances of that register type and a class for the larger registers, but I'm not sure how to tell llvm that it should use the multi-instruction sequence for ADD. Is there a way to do it in the Select function like I did for 1, but with virtual registers so the allocator/scheduler can pick which alu to use? Any information on how I might implement the super/subregister solution or any alternative approaches to implementing a backend for this architecture are greatly appreciated.
escha via llvm-dev
2015-Nov-12 20:21 UTC
[llvm-dev] Help making 'narrow instruct microcode' Backend
> On Nov 12, 2015, at 8:43 AM, James R. Byerly via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > I've been experimenting with llvm/clang as a user for a while now, but now > I'm interested in writing my own backend. I'm also developing the target > architecture (maybe to go in an fpga eventually) and I'm intentionally > making it extremely simple. I think of it as a narrow microcode, because > (for example) performing an add requires a sequence of instructions like: > set aluin1 = r1 > set aluin2 = r2 > aluop add > set r3 = aluoutThis sounds very similar to some sorts of existing VLIW architectures, which may look something like this: // START BUNDLE { // input phase s0 = <some input> s1 = <some input> […] sN = <some input> // phase 0 t0 = <some operation using s registers> t1 = <some operation using s registers> […] tN = <some operation using s registers> // phase 1 // stuff using t registers […] // phase N // stuff using t registers // output phase rX = t0 rY = t1 rZ = t2 } // END BUNDLE Of course, the bundles can be as complex or simple as you like; they could have just one phase and one operation, for example. Either way, this sort of thing where it defines inputs, then defines an instruction, then defines outputs, sounds very much like a problem for VLIW bundles. VLIW tends to be very simple/“close to the metal” insomuch as the bits in the instruction encoding directly map to muxes between the phases, i.e. selecting which register is passed from where to where. —escha
James R. Byerly via llvm-dev
2015-Nov-24 17:40 UTC
[llvm-dev] Help making 'narrow instruct microcode' Backend
> >> On Nov 12, 2015, at 8:43 AM, James R. Byerly via llvm-dev >> <llvm-dev at lists.llvm.org> wrote: >> >> I've been experimenting with llvm/clang as a user for a while now, but >> now >> I'm interested in writing my own backend. I'm also developing the target >> architecture (maybe to go in an fpga eventually) and I'm intentionally >> making it extremely simple. I think of it as a narrow microcode, because >> (for example) performing an add requires a sequence of instructions >> like: >> set aluin1 = r1 >> set aluin2 = r2 >> aluop add >> set r3 = aluout > > This sounds very similar to some sorts of existing VLIW architectures, > which may look something like this: > > // START BUNDLE > { > // input phase > s0 = <some input> > s1 = <some input> > [â¦] > sN = <some input> > // phase 0 > t0 = <some operation using s registers> > t1 = <some operation using s registers> > [â¦] > tN = <some operation using s registers> > // phase 1 > // stuff using t registers > [â¦] > // phase N > // stuff using t registers > // output phase > rX = t0 > rY = t1 > rZ = t2 > } > // END BUNDLE > > Of course, the bundles can be as complex or simple as you like; they could > have just one phase and one operation, for example. Either way, this sort > of thing where it defines inputs, then defines an instruction, then > defines outputs, sounds very much like a problem for VLIW bundles. VLIW > tends to be very simple/âclose to the metalâ insomuch as the bits in > the instruction encoding directly map to muxes between the phases, i.e. > selecting which register is passed from where to where. > > âeschaMy target has a lot in common with VLIW architectures, except for the 'VL' part. I'd like to essentually have similar low-level semantics without bundling instructions in order to take advantage of the llvm scheduling and register allocation capabilities (and also make the target simpler, not having to perform any but the simplest operations). Currently, I have it partially working by defining an ALU as a v4i32 register with 4 subregisters (in1, in2, out, flags), and then in my SelectionDAGISel::Select, I replace a node like 'ISD::ADD' like so: r3 = add r1, r2 into AluReg = temporary from ALURegClass Copy To AluReg, (IMPLICIT_DEF) AluIn1 = ALUIN1 (Copy From AluReg), r1 AluIn2 = ALUIN2 AluIn1, r2 AluResult = ADD AluIn2 res = temporary from GPRRegClass Copy To res (ALUOUT AluResult) r3 = res Where every value/register named Alu* is the same v4i32 alu register in various stages of update. I found the copy of implicit def through an alu seems to help llvm reason about the subreg liveness (using just implicit def, llvm seems to think the regs live forever, and using just a temp register, llvm complains it is used before it is defined), and the copy of the result through a gpr helps with register pressure (presumably because GPRs are more plentiful than alu registers). However, I still have an issue with register allocation. If, in my subclass of TargetLowering, I set the scheduling preference to Sched::RegPressure, things work ok most of the time, but I get extremely poor scheduling because every alu operation uses aluA and computes everything in order exactly as described above. If I use any other scheduling preference, I get much better scheduled code, but llvm moves the ALUIN1r/ALUIN2r instruction pair far from the ADD/ALUOUTr instruction pair and thus creates an excessive number of long live ranges. It then tries to spill the ALU registers, which isn't possible, and everything fails. Any advice on getting any 'real' scheduling to work without introducing unneeded spilling (unneeded because it IS possible to keep the 4 instructions grouped and thus never have an alu register live outside an actual alu operation, and up to N groups can be interleaved when I have N alu registers) would be greatly appreciated. I know I can bundle the instructions, but that (afaik) eliminates the possibility of the instruction scheduler interleaving multiple alu instructions, or mixing alu instructions with non-alu instructions.