Ricky Taylor via llvm-dev
2021-Dec-07 10:07 UTC
[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions
>From an M68k point-of-view, I like this. I'm a little uneasy about the`operand` node though. Is it possible that there will ever be an operand type which needs multiple encodings? I also wonder whether $dec mode and slicing inputs might be better handled by separate nodes. Would something like `(seq (flip 0b0101010) (slice my_operand.Base, 4, 8))`, be possible? As you say, I suspect that implementing an interpreter-style disassembler generator like the fixed length one would be fairly straight-forward (and much better than the M68k disassembler implementation I provided). Ricky, On Mon, 6 Dec 2021 at 03:34, Min-Yih Hsu <minyihh at uci.edu> wrote:> (This is a long proposal. If you prefer, here is the web version: > https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b) > > ## Background > CodeEmitterGen is a TableGen backend that generates instruction encoder > functions for `MCCodeEmitter` from a concise TableGen syntax. It is, > however, almost exclusively designed for targets that use fixed-length > instructions. It's nearly impossible to use this infrastructure to describe > instruction encoding scheme for ISAs with variable-length instructions, > like X86 and M68k. > > To have a better understanding on this problem, let's look at an example. > For a fixed-length instruction ISA, developers write the following TableGen > syntax to describe an instruction encoding: > ``` > class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction { > bits<32> Inst; > > bits<4> dst; > bits<16> imm; > let Inst{31-28} = 0b1011; > ... > let Inst{19-16} = dst; > let Inst{15-0} = imm; > } > ``` > The `Inst` field tells us the length of an instruction -- 32 bits in this > case. Each bit in this field describes the encoded value, which is either a > concrete value or symbolic one like `dst` and `imm` in the example above. > The `dst` and `imm` variables correspond to the output operand (`$dst`) and > the second input operand (`$imm`), respectively. Meaning, the encoder > function (generated by CodeEmitterGen) will eventually insert the encoding > for these two operands into the right bit ranges (bit 19\~16 for `dst` and > 15\~0 for `imm`). > > Though this TableGen syntax fits well for fixed-length instructions, it > imposes some difficulties to instructions with variable length and memory > poerands with complex addressing modes: > 1. The bit width of the `Inst` field is fixed. Though we can declare the > field with maximum instruction size in the ISA, it requires extra code to > adjust the final instruction size. > 2. Operand encoding can only be placed at fixed bit positions. However, > the size of an operand in a variable-length instruction might vary. > 3. In the situation where a single logical operand is consisting of > multiple MachineOperand-s / MCOperand-s, the current syntax cannot > reference a sub-operand. Which means we can only reference the entire > logical operand at places where we actually should put sub-operands. Making > the TG code less readable and bring more burden to the operand encoding > functions (because they don't know which sub-operand to encode). > > In short, we need a more flexible CodeEmitterGen infrastructure for > variable-length instructions: describe the instruction encoding in a > "position independent" fashion and be able to reference sub-operands with > ease. > > ## Proposal > We propose a new infrastructure, called VarLenCodeEmitterGen, to solve the > aforementioned shortcomings. It is consisting of new TableGen syntax and > some modifications to the existing CodeEmitterGen TableGen backend. > > Suppose we are dealing with an instruction `MyVarInst`: > ``` > class MyMemOperand<dag sub_ops> : Operand<iPTR> { > let MIOperandInfo = sub_ops; > } > > class MyVarInst<MyMemOperand memory_op> : Instruction { > let OutOperandList = (outs GR64:$dst); > let InOperandList = (ins memory_operand:$src); > } > ``` > It has the following encoding format: > ``` > 15 8 0 > ---------------------------------------------------- > | Fixed bits | Sub-operand 0 in source operand | > ---------------------------------------------------- > X 16 > ---------------------------------------------------- > | Sub-operand 1 in source operand | > ---------------------------------------------------- > X + 4 X + 1 > ------------------------------------ > | Destination register | > ------------------------------------ > ``` > We have two different kinds of memory operands: > ``` > def MemOp16 : MyMemOperand<(ops GR64:$reg, i16imm:$offset)>; > def MemOp16 : MyMemOperand<(ops GR64:$reg, i32imm:$offset)>; > > def FOO16 : MyVarInst<MemOp16>; > def FOO32 : MyVarInst<MemOp32>; > ``` > So the size of `FOO16` and `FOO32` will be 36 and 52 bits, respectively. > > To express the encoding, first, we modify `MyVarInst` and `MyMemOperand`: > ``` > class MyMemOperand<dag sub_ops> : Operand<iPTR> { > let MIOperandInfo = sub_ops; > dag Base; > dag Extension; > } > > class MyVarInst<MyMemOperand memory_op> : Instruction { > dag Inst; > > let OutOperandList = (outs GR64:$dst); > let InOperandList = (ins memory_op:$src); > > let Inst = (seq > (seq:$dec /*Fixed bits*/0b10110111, memory_op.Base), > memory_op.Extension, > // Destination register > (operand "$dst", 4) > ); > } > ``` > Then, we use a slightly different representation > for `MemOp16` and `MemOp32`: > ``` > class MemOp16<string op_name> : MyMemOperand<(ops GR64:$reg, > i16imm:$offset)> { > let Base = (operand "$"#op_name#".reg", 8); > let Extension = (operand "$"#op_name#".offset", 16); > } > > class MemOp32<string op_name> : MyMemOperand<(ops GR64:$reg, > i32imm:$offset)> { > let Base = (operand "$"#op_name#".reg", 8); > let Extension = (operand "$"#op_name#".offset", 32); > } > > def FOO16 : MyVarInst<MemOp16<"src">>; > def FOO32 : MyVarInst<MemOp32<"src">>; > ``` > > This new TableGen syntax uses `dag` rather than `bits<N>` for > the `Inst` field. Allowing instructions to place their operand (and > sub-operand) encodings without worrying about the actual bit positions. The > new syntax is underpinned by two new DAG operators: `seq` and `operand`. > > The `seq` operator sequentially places its arguments -- fragments of > encoding -- from LSB to MSB. If the operator is "tagged" by `$dec`, it goes > from MSB to LSB instead. The `operand` operator references the encoding of > an operand. Its first DAG argument is a string referencing the name of an > operand in either `InOperandList` or `OutOperandList` of an instruction. We > can also reference an sub-operand using syntax like `$<operand > name>.<sub-operand name>`. The second DAG argument for `operand` is the bit > width of the encoded operand. The other variant of `operand` is having two > arguments instead of one that follow the operand referencing string. More > specifically: > ``` > (operand "$src.reg", 8, 4) > ``` > In this case, 8 and 4 represents a bit range -- high bit and low bit, > respectively -- to the encoded `$src.reg` operand. > > Finally, a new sub-component added to the existing CodeEmitterGen TableGen > backend, VarLenCodeEmitterGen, will turn the above syntax into a C++ > encoder function -- `MCCodeEmitter::getBinaryCodeForInstr` -- that uses the > same mechanism as the fixed-length instruction version (except few details, > like it always uses APInt to store the result). > > We think the proposed solution has the following advantages: > - Flexible and versatile in terms of expressing instruction encodings. > - The TableGen syntax is easy to read, write and understand. > - Only adds a few new TableGen syntax. > - Tightly integrated with the existing CodeEmitterGen. > > ### Previous approaches > Both X86 and M68k -- the only two LLVM targets with variable-length > instructions -- are using custom instruction encoders. X86 leverages > TSFlags in `MCInst` to carry encoding info. Simply speaking, X86 enumerates > and numbers every possible combinations of operands and stores the > corresponding index into a segment of TSFlags for an instruction. This > approach, of course, requires none trivial amount of workforce to maintain. > > M68k, on the other hand, uses an obscured infrastructure called code > beads. It is conceptually similar to the VarLenCodeEmitterGen we're > proposing here -- concatenating encoding fragments. Except that the syntax > is bulky and it uses too many specialized TableGen infrastructures, > including a separate TableGen backend, that make the maintainence really > really hard. > > ## Patches > TableGen modifications: https://reviews.llvm.org/D115128 > > ## FAQ > - Do I need to toggle some flags -- either a command line flag or a > TableGen bit field -- to use the new code emitter scheme? > - No, having a `dag` type `Inst` field will automatically opt-in this > new code emitter scheme. > - Can I adopt this for fixed-length instructions? > - Absolutely yes. But it's not recommended because CodeEmitterGen can > generate more optimal encoder functions for fixed-length instructions. The > TableGen syntax of CodeEmitterGen makes more sense for fixed-length > instructions, too. > - Can X86 adopt this infrastructure? > - Theoritically, yes (In practice? I dunno). > - What about the disassembler? Can we TableGen-enerate the corresponding > disassembling functions? > - Since we have a structural description of the encoded instruction, > it's probably easier to create a disassembler from the new TableGen syntax. > But I haven't worked on that yet. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211207/5ee00468/attachment.html>
Min-Yih Hsu via llvm-dev
2021-Dec-07 11:54 UTC
[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions
Thanks for the feedbacks!> On Dec 7, 2021, at 6:07 PM, Ricky Taylor <rickytaylor26 at gmail.com> wrote: > > From an M68k point-of-view, I like this. I'm a little uneasy about the `operand` node though. Is it possible that there will ever be an operand type which needs multiple encodings?Do you mean an operand or sub-operand might have different encodings depend on its context / bit positions? I don’t know about other (future) architectures, M68k doesn’t have this issue: Each memory operand is consisting of (multiple) primitive construction(s) like registers and immediate, that is, sub-operands. While a sub-operand might be placed in arbitrary positions, each sub-operand has a fixed encoding. I can think of a solution for operands / sub-operands whose encodings depend on its context: passing custom encoder function name into `operand`. So if you write: ``` (operand “$foo”, 4, “encodeFoo”) ``` The foo operand at this particular bit position will be encoded by `encodeFoo`. (The current CodeEmitterGen has a similar stuff — via the `EncoderMethod` field in each operand. But that one is bit-position-invariant. So an operand can only use a single custom encoder regardless of its context / bit-position) BTW, this again shows the advantage of using DAG for carrying encoding info: It’s easy to augment with new parameters / features.> > I also wonder whether $dec mode and slicing inputs might be better handled by separate nodes. Would something like `(seq (flip 0b0101010) (slice my_operand.Base, 4, 8))`, be possible?IIUC, you’re proposing to replace`(seq (seq:$dec 0b0101010), (operand “…”, 4, 8))` with the new syntax. I think it’s a good idea because then we’re not overloading a single DAG operator. It’s trivial to change, too. Nevertheless, I haven’t implemented `seq:$dec` for fixed bits value so actually you can’t write `(seq:$dec 0b0101010)` (This is not hard to implement though).> > As you say, I suspect that implementing an interpreter-style disassembler generator like the fixed length one would be fairly straight-forward (and much better than the M68k disassembler implementation I provided).Good to hear! Cheers, -Min> > Ricky, > > On Mon, 6 Dec 2021 at 03:34, Min-Yih Hsu <minyihh at uci.edu <mailto:minyihh at uci.edu>> wrote: > (This is a long proposal. If you prefer, here is the web version: https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b <https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b>) > > ## Background > CodeEmitterGen is a TableGen backend that generates instruction encoder functions for `MCCodeEmitter` from a concise TableGen syntax. It is, however, almost exclusively designed for targets that use fixed-length instructions. It's nearly impossible to use this infrastructure to describe instruction encoding scheme for ISAs with variable-length instructions, like X86 and M68k. > > To have a better understanding on this problem, let's look at an example. For a fixed-length instruction ISA, developers write the following TableGen syntax to describe an instruction encoding: > ``` > class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction { > bits<32> Inst; > > bits<4> dst; > bits<16> imm; > let Inst{31-28} = 0b1011; > ... > let Inst{19-16} = dst; > let Inst{15-0} = imm; > } > ``` > The `Inst` field tells us the length of an instruction -- 32 bits in this case. Each bit in this field describes the encoded value, which is either a concrete value or symbolic one like `dst` and `imm` in the example above. The `dst` and `imm` variables correspond to the output operand (`$dst`) and the second input operand (`$imm`), respectively. Meaning, the encoder function (generated by CodeEmitterGen) will eventually insert the encoding for these two operands into the right bit ranges (bit 19\~16 for `dst` and 15\~0 for `imm`). > > Though this TableGen syntax fits well for fixed-length instructions, it imposes some difficulties to instructions with variable length and memory poerands with complex addressing modes: > 1. The bit width of the `Inst` field is fixed. Though we can declare the field with maximum instruction size in the ISA, it requires extra code to adjust the final instruction size. > 2. Operand encoding can only be placed at fixed bit positions. However, the size of an operand in a variable-length instruction might vary. > 3. In the situation where a single logical operand is consisting of multiple MachineOperand-s / MCOperand-s, the current syntax cannot reference a sub-operand. Which means we can only reference the entire logical operand at places where we actually should put sub-operands. Making the TG code less readable and bring more burden to the operand encoding functions (because they don't know which sub-operand to encode). > > In short, we need a more flexible CodeEmitterGen infrastructure for variable-length instructions: describe the instruction encoding in a "position independent" fashion and be able to reference sub-operands with ease. > > ## Proposal > We propose a new infrastructure, called VarLenCodeEmitterGen, to solve the aforementioned shortcomings. It is consisting of new TableGen syntax and some modifications to the existing CodeEmitterGen TableGen backend. > > Suppose we are dealing with an instruction `MyVarInst`: > ``` > class MyMemOperand<dag sub_ops> : Operand<iPTR> { > let MIOperandInfo = sub_ops; > } > > class MyVarInst<MyMemOperand memory_op> : Instruction { > let OutOperandList = (outs GR64:$dst); > let InOperandList = (ins memory_operand:$src); > } > ``` > It has the following encoding format: > ``` > 15 8 0 > ---------------------------------------------------- > | Fixed bits | Sub-operand 0 in source operand | > ---------------------------------------------------- > X 16 > ---------------------------------------------------- > | Sub-operand 1 in source operand | > ---------------------------------------------------- > X + 4 X + 1 > ------------------------------------ > | Destination register | > ------------------------------------ > ``` > We have two different kinds of memory operands: > ``` > def MemOp16 : MyMemOperand<(ops GR64:$reg, i16imm:$offset)>; > def MemOp16 : MyMemOperand<(ops GR64:$reg, i32imm:$offset)>; > > def FOO16 : MyVarInst<MemOp16>; > def FOO32 : MyVarInst<MemOp32>; > ``` > So the size of `FOO16` and `FOO32` will be 36 and 52 bits, respectively. > > To express the encoding, first, we modify `MyVarInst` and `MyMemOperand`: > ``` > class MyMemOperand<dag sub_ops> : Operand<iPTR> { > let MIOperandInfo = sub_ops; > dag Base; > dag Extension; > } > > class MyVarInst<MyMemOperand memory_op> : Instruction { > dag Inst; > > let OutOperandList = (outs GR64:$dst); > let InOperandList = (ins memory_op:$src); > > let Inst = (seq > (seq:$dec /*Fixed bits*/0b10110111, memory_op.Base), > memory_op.Extension, > // Destination register > (operand "$dst", 4) > ); > } > ``` > Then, we use a slightly different representation for `MemOp16` and `MemOp32`: > ``` > class MemOp16<string op_name> : MyMemOperand<(ops GR64:$reg, i16imm:$offset)> { > let Base = (operand "$"#op_name#".reg", 8); > let Extension = (operand "$"#op_name#".offset", 16); > } > > class MemOp32<string op_name> : MyMemOperand<(ops GR64:$reg, i32imm:$offset)> { > let Base = (operand "$"#op_name#".reg", 8); > let Extension = (operand "$"#op_name#".offset", 32); > } > > def FOO16 : MyVarInst<MemOp16<"src">>; > def FOO32 : MyVarInst<MemOp32<"src">>; > ``` > > This new TableGen syntax uses `dag` rather than `bits<N>` for the `Inst` field. Allowing instructions to place their operand (and sub-operand) encodings without worrying about the actual bit positions. The new syntax is underpinned by two new DAG operators: `seq` and `operand`. > > The `seq` operator sequentially places its arguments -- fragments of encoding -- from LSB to MSB. If the operator is "tagged" by `$dec`, it goes from MSB to LSB instead. The `operand` operator references the encoding of an operand. Its first DAG argument is a string referencing the name of an operand in either `InOperandList` or `OutOperandList` of an instruction. We can also reference an sub-operand using syntax like `$<operand name>.<sub-operand name>`. The second DAG argument for `operand` is the bit width of the encoded operand. The other variant of `operand` is having two arguments instead of one that follow the operand referencing string. More specifically: > ``` > (operand "$src.reg", 8, 4) > ``` > In this case, 8 and 4 represents a bit range -- high bit and low bit, respectively -- to the encoded `$src.reg` operand. > > Finally, a new sub-component added to the existing CodeEmitterGen TableGen backend, VarLenCodeEmitterGen, will turn the above syntax into a C++ encoder function -- `MCCodeEmitter::getBinaryCodeForInstr` -- that uses the same mechanism as the fixed-length instruction version (except few details, like it always uses APInt to store the result). > > We think the proposed solution has the following advantages: > - Flexible and versatile in terms of expressing instruction encodings. > - The TableGen syntax is easy to read, write and understand. > - Only adds a few new TableGen syntax. > - Tightly integrated with the existing CodeEmitterGen. > > ### Previous approaches > Both X86 and M68k -- the only two LLVM targets with variable-length instructions -- are using custom instruction encoders. X86 leverages TSFlags in `MCInst` to carry encoding info. Simply speaking, X86 enumerates and numbers every possible combinations of operands and stores the corresponding index into a segment of TSFlags for an instruction. This approach, of course, requires none trivial amount of workforce to maintain. > > M68k, on the other hand, uses an obscured infrastructure called code beads. It is conceptually similar to the VarLenCodeEmitterGen we're proposing here -- concatenating encoding fragments. Except that the syntax is bulky and it uses too many specialized TableGen infrastructures, including a separate TableGen backend, that make the maintainence really really hard. > > ## Patches > TableGen modifications: https://reviews.llvm.org/D115128 <https://reviews.llvm.org/D115128> > > ## FAQ > - Do I need to toggle some flags -- either a command line flag or a TableGen bit field -- to use the new code emitter scheme? > - No, having a `dag` type `Inst` field will automatically opt-in this new code emitter scheme. > - Can I adopt this for fixed-length instructions? > - Absolutely yes. But it's not recommended because CodeEmitterGen can generate more optimal encoder functions for fixed-length instructions. The TableGen syntax of CodeEmitterGen makes more sense for fixed-length instructions, too. > - Can X86 adopt this infrastructure? > - Theoritically, yes (In practice? I dunno). > - What about the disassembler? Can we TableGen-enerate the corresponding disassembling functions? > - Since we have a structural description of the encoded instruction, it's probably easier to create a disassembler from the new TableGen syntax. But I haven't worked on that yet. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211207/19272946/attachment.html>