thr3ads.net - llvm dev - [llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions [Dec 2021]

If this information is useful, please help other people find it:
Share via:

Min-Yih Hsu via llvm-dev

2021-Dec-06 03:34 UTC

[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions

(This is a long proposal. If you prefer, here is the web version:
https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b
<https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b>)

## Background
CodeEmitterGen is a TableGen backend that generates instruction encoder
functions for `MCCodeEmitter` from a concise TableGen syntax. It is, however,
almost exclusively designed for targets that use fixed-length instructions.
It's nearly impossible to use this infrastructure to describe instruction
encoding scheme for ISAs with variable-length instructions, like X86 and M68k.

To have a better understanding on this problem, let's look at an example.
For a fixed-length instruction ISA, developers write the following TableGen
syntax to describe an instruction encoding:
```
class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction {
    bits<32> Inst;

    bits<4> dst;
    bits<16> imm;
    let Inst{31-28} = 0b1011;
    ...
    let Inst{19-16} = dst;
    let Inst{15-0} = imm;
}
```
The `Inst` field tells us the length of an instruction -- 32 bits in this case.
Each bit in this field describes the encoded value, which is either a concrete
value or symbolic one like `dst` and `imm` in the example above. The `dst` and
`imm` variables correspond to the output operand (`$dst`) and the second input
operand (`$imm`), respectively. Meaning, the encoder function (generated by
CodeEmitterGen) will eventually insert the encoding for these two operands into
the right bit ranges (bit 19\~16 for `dst` and 15\~0 for `imm`).

Though this TableGen syntax fits well for fixed-length instructions, it imposes
some difficulties to instructions with variable length and memory poerands with
complex addressing modes:
  1. The bit width of the `Inst` field is fixed. Though we can declare the field
with maximum instruction size in the ISA, it requires extra code to adjust the
final instruction size.
  2. Operand encoding can only be placed at fixed bit positions. However, the
size of an operand in a variable-length instruction might vary.
  3. In the situation where a single logical operand is consisting of multiple
MachineOperand-s / MCOperand-s, the current syntax cannot reference a
sub-operand. Which means we can only reference the entire logical operand at
places where we actually should put sub-operands. Making the TG code less
readable and bring more burden to the operand encoding functions (because they
don't know which sub-operand to encode).

In short, we need a more flexible CodeEmitterGen infrastructure for
variable-length instructions: describe the instruction encoding in a
"position independent" fashion and be able to reference sub-operands
with ease.

## Proposal
We propose a new infrastructure, called VarLenCodeEmitterGen, to solve the
aforementioned shortcomings. It is consisting of new TableGen syntax and some
modifications to the existing CodeEmitterGen TableGen backend.

Suppose we are dealing with an instruction `MyVarInst`:
```
class MyMemOperand<dag sub_ops> : Operand<iPTR> {
    let MIOperandInfo = sub_ops;
}

class MyVarInst<MyMemOperand memory_op> : Instruction {
    let OutOperandList = (outs GR64:$dst);
    let InOperandList  = (ins memory_operand:$src);
}
```
It has the following encoding format:
```
15             8                                   0
----------------------------------------------------
|  Fixed bits  |  Sub-operand 0 in source operand  |
----------------------------------------------------
X                                                 16
----------------------------------------------------
|         Sub-operand 1 in source operand          |
----------------------------------------------------
                X + 4                          X + 1
                ------------------------------------
                |       Destination register       |
                ------------------------------------
```
We have two different kinds of memory operands:
```
def MemOp16 : MyMemOperand<(ops GR64:$reg, i16imm:$offset)>;
def MemOp16 : MyMemOperand<(ops GR64:$reg, i32imm:$offset)>;

def FOO16 : MyVarInst<MemOp16>;
def FOO32 : MyVarInst<MemOp32>;
```
So the size of `FOO16` and `FOO32` will be 36 and 52 bits, respectively.

To express the encoding, first, we modify `MyVarInst` and `MyMemOperand`:
```
class MyMemOperand<dag sub_ops> : Operand<iPTR> {
    let MIOperandInfo = sub_ops;
    dag Base;
    dag Extension;
}

class MyVarInst<MyMemOperand memory_op> : Instruction {
    dag Inst;

    let OutOperandList = (outs GR64:$dst);
    let InOperandList  = (ins memory_op:$src);

    let Inst = (seq
        (seq:$dec /*Fixed bits*/0b10110111, memory_op.Base),
        memory_op.Extension,
        // Destination register
        (operand "$dst", 4)
    );
}
```
Then, we use a slightly different representation for `MemOp16` and `MemOp32`:
```
class MemOp16<string op_name> : MyMemOperand<(ops GR64:$reg,
i16imm:$offset)> {
    let Base = (operand "$"#op_name#".reg", 8);
    let Extension = (operand "$"#op_name#".offset", 16);
}

class MemOp32<string op_name> : MyMemOperand<(ops GR64:$reg,
i32imm:$offset)> {
    let Base = (operand "$"#op_name#".reg", 8);
    let Extension = (operand "$"#op_name#".offset", 32);
}

def FOO16 : MyVarInst<MemOp16<"src">>;
def FOO32 : MyVarInst<MemOp32<"src">>;
```

This new TableGen syntax uses `dag` rather than `bits<N>` for the `Inst`
field. Allowing instructions to place their operand (and sub-operand) encodings
without worrying about the actual bit positions. The new syntax is underpinned
by two new DAG operators: `seq` and `operand`.

The `seq` operator sequentially places its arguments -- fragments of encoding --
from LSB to MSB. If the operator is "tagged" by `$dec`, it goes from
MSB to LSB instead. The `operand` operator references the encoding of an
operand. Its first DAG argument is a string referencing the name of an operand
in either `InOperandList` or `OutOperandList` of an instruction. We can also
reference an sub-operand using syntax like `$<operand
name>.<sub-operand name>`. The second DAG argument for `operand` is the
bit width of the encoded operand. The other variant of `operand` is having two
arguments instead of one that follow the operand referencing string. More
specifically:
```
(operand "$src.reg", 8, 4)
```
In this case, 8 and 4 represents a bit range -- high bit and low bit,
respectively -- to the encoded `$src.reg` operand.

Finally, a new sub-component added to the existing CodeEmitterGen TableGen
backend, VarLenCodeEmitterGen, will turn the above syntax into a C++ encoder
function -- `MCCodeEmitter::getBinaryCodeForInstr` -- that uses the same
mechanism as the fixed-length instruction version (except few details, like it
always uses APInt to store the result).

We think the proposed solution has the following advantages:
  - Flexible and versatile in terms of expressing instruction encodings.
  - The TableGen syntax is easy to read, write and understand.
  - Only adds a few new TableGen syntax.
  - Tightly integrated with the existing CodeEmitterGen.

### Previous approaches
Both X86 and M68k -- the only two LLVM targets with variable-length instructions
-- are using custom instruction encoders. X86 leverages TSFlags in `MCInst` to
carry encoding info. Simply speaking, X86 enumerates and numbers every possible
combinations of operands and stores the corresponding index into a segment of
TSFlags for an instruction. This approach, of course, requires none trivial
amount of workforce to maintain.

M68k, on the other hand, uses an obscured infrastructure called code beads. It
is conceptually similar to the VarLenCodeEmitterGen we're proposing here --
concatenating encoding fragments. Except that the syntax is bulky and it uses
too many specialized TableGen infrastructures, including a separate TableGen
backend, that make the maintainence really really hard.

## Patches
TableGen modifications: https://reviews.llvm.org/D115128

## FAQ
  - Do I need to toggle some flags -- either a command line flag or a TableGen
bit field -- to use the new code emitter scheme?
    - No, having a `dag` type `Inst` field will automatically opt-in this new
code emitter scheme.
  - Can I adopt this for fixed-length instructions?
    - Absolutely yes. But it's not recommended because CodeEmitterGen can
generate more optimal encoder functions for fixed-length instructions. The
TableGen syntax of CodeEmitterGen makes more sense for fixed-length
instructions, too.
  - Can X86 adopt this infrastructure?
    - Theoritically, yes (In practice? I dunno).
  - What about the disassembler? Can we TableGen-enerate the corresponding
disassembling functions?
    - Since we have a structural description of the encoded instruction,
it's probably easier to create a disassembler from the new TableGen syntax.
But I haven't worked on that yet.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211206/6e86c47d/attachment.html>

Ricky Taylor via llvm-dev

2021-Dec-07 10:07 UTC

head link

[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions

>From an M68k point-of-view, I like this. I'm a little uneasy about the`operand` node though. Is it possible that there will ever be an operand
type which needs multiple encodings?

I also wonder whether $dec mode and slicing inputs might be better handled
by separate nodes. Would something like `(seq (flip 0b0101010) (slice
my_operand.Base, 4, 8))`, be possible?

As you say, I suspect that implementing an interpreter-style disassembler
generator like the fixed length one would be fairly straight-forward (and
much better than the M68k disassembler implementation I provided).

Ricky,

On Mon, 6 Dec 2021 at 03:34, Min-Yih Hsu <minyihh at uci.edu> wrote:
> (This is a long proposal. If you prefer, here is the web version:
> https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b)
>
> ## Background
> CodeEmitterGen is a TableGen backend that generates instruction encoder
> functions for `MCCodeEmitter` from a concise TableGen syntax. It is,
> however, almost exclusively designed for targets that use fixed-length
> instructions. It's nearly impossible to use this infrastructure to
describe
> instruction encoding scheme for ISAs with variable-length instructions,
> like X86 and M68k.
>
> To have a better understanding on this problem, let's look at an
example.
> For a fixed-length instruction ISA, developers write the following TableGen
> syntax to describe an instruction encoding:
> ```
> class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction
{
>     bits<32> Inst;
>
>     bits<4> dst;
>     bits<16> imm;
>     let Inst{31-28} = 0b1011;
>     ...
>     let Inst{19-16} = dst;
>     let Inst{15-0} = imm;
> }
> ```
> The `Inst` field tells us the length of an instruction -- 32 bits in this
> case. Each bit in this field describes the encoded value, which is either a
> concrete value or symbolic one like `dst` and `imm` in the example above.
> The `dst` and `imm` variables correspond to the output operand (`$dst`) and
> the second input operand (`$imm`), respectively. Meaning, the encoder
> function (generated by CodeEmitterGen) will eventually insert the encoding
> for these two operands into the right bit ranges (bit 19\~16 for `dst` and
> 15\~0 for `imm`).
>
> Though this TableGen syntax fits well for fixed-length instructions, it
> imposes some difficulties to instructions with variable length and memory
> poerands with complex addressing modes:
>   1. The bit width of the `Inst` field is fixed. Though we can declare the
> field with maximum instruction size in the ISA, it requires extra code to
> adjust the final instruction size.
>   2. Operand encoding can only be placed at fixed bit positions. However,
> the size of an operand in a variable-length instruction might vary.
>   3. In the situation where a single logical operand is consisting of
> multiple MachineOperand-s / MCOperand-s, the current syntax cannot
> reference a sub-operand. Which means we can only reference the entire
> logical operand at places where we actually should put sub-operands. Making
> the TG code less readable and bring more burden to the operand encoding
> functions (because they don't know which sub-operand to encode).
>
> In short, we need a more flexible CodeEmitterGen infrastructure for
> variable-length instructions: describe the instruction encoding in a
> "position independent" fashion and be able to reference
sub-operands with
> ease.
>
> ## Proposal
> We propose a new infrastructure, called VarLenCodeEmitterGen, to solve the
> aforementioned shortcomings. It is consisting of new TableGen syntax and
> some modifications to the existing CodeEmitterGen TableGen backend.
>
> Suppose we are dealing with an instruction `MyVarInst`:
> ```
> class MyMemOperand<dag sub_ops> : Operand<iPTR> {
>     let MIOperandInfo = sub_ops;
> }
>
> class MyVarInst<MyMemOperand memory_op> : Instruction {
>     let OutOperandList = (outs GR64:$dst);
>     let InOperandList  = (ins memory_operand:$src);
> }
> ```
> It has the following encoding format:
> ```
> 15 8 0
> ----------------------------------------------------
> | Fixed bits | Sub-operand 0 in source operand |
> ----------------------------------------------------
> X 16
> ----------------------------------------------------
> | Sub-operand 1 in source operand |
> ----------------------------------------------------
> X + 4 X + 1
> ------------------------------------
> | Destination register |
> ------------------------------------
> ```
> We have two different kinds of memory operands:
> ```
> def MemOp16 : MyMemOperand<(ops GR64:$reg, i16imm:$offset)>;
> def MemOp16 : MyMemOperand<(ops GR64:$reg, i32imm:$offset)>;
>
> def FOO16 : MyVarInst<MemOp16>;
> def FOO32 : MyVarInst<MemOp32>;
> ```
> So the size of `FOO16` and `FOO32` will be 36 and 52 bits, respectively.
>
> To express the encoding, first, we modify `MyVarInst` and `MyMemOperand`:
> ```
> class MyMemOperand<dag sub_ops> : Operand<iPTR> {
>     let MIOperandInfo = sub_ops;
>     dag Base;
>     dag Extension;
> }
>
> class MyVarInst<MyMemOperand memory_op> : Instruction {
>     dag Inst;
>
>     let OutOperandList = (outs GR64:$dst);
>     let InOperandList  = (ins memory_op:$src);
>
>     let Inst = (seq
>         (seq:$dec /*Fixed bits*/0b10110111, memory_op.Base),
>         memory_op.Extension,
>         // Destination register
>         (operand "$dst", 4)
>     );
> }
> ```
> Then, we use a slightly different representation
> for `MemOp16` and `MemOp32`:
> ```
> class MemOp16<string op_name> : MyMemOperand<(ops GR64:$reg,
> i16imm:$offset)> {
>     let Base = (operand "$"#op_name#".reg", 8);
>     let Extension = (operand "$"#op_name#".offset",
16);
> }
>
> class MemOp32<string op_name> : MyMemOperand<(ops GR64:$reg,
> i32imm:$offset)> {
>     let Base = (operand "$"#op_name#".reg", 8);
>     let Extension = (operand "$"#op_name#".offset",
32);
> }
>
> def FOO16 : MyVarInst<MemOp16<"src">>;
> def FOO32 : MyVarInst<MemOp32<"src">>;
> ```
>
> This new TableGen syntax uses `dag` rather than `bits<N>` for
> the `Inst` field. Allowing instructions to place their operand (and
> sub-operand) encodings without worrying about the actual bit positions. The
> new syntax is underpinned by two new DAG operators: `seq` and `operand`.
>
> The `seq` operator sequentially places its arguments -- fragments of
> encoding -- from LSB to MSB. If the operator is "tagged" by
`$dec`, it goes
> from MSB to LSB instead. The `operand` operator references the encoding of
> an operand. Its first DAG argument is a string referencing the name of an
> operand in either `InOperandList` or `OutOperandList` of an instruction. We
> can also reference an sub-operand using syntax like `$<operand
> name>.<sub-operand name>`. The second DAG argument for `operand`
is the bit
> width of the encoded operand. The other variant of `operand` is having two
> arguments instead of one that follow the operand referencing string. More
> specifically:
> ```
> (operand "$src.reg", 8, 4)
> ```
> In this case, 8 and 4 represents a bit range -- high bit and low bit,
> respectively -- to the encoded `$src.reg` operand.
>
> Finally, a new sub-component added to the existing CodeEmitterGen TableGen
> backend, VarLenCodeEmitterGen, will turn the above syntax into a C++
> encoder function -- `MCCodeEmitter::getBinaryCodeForInstr` -- that uses the
> same mechanism as the fixed-length instruction version (except few details,
> like it always uses APInt to store the result).
>
> We think the proposed solution has the following advantages:
>   - Flexible and versatile in terms of expressing instruction encodings.
>   - The TableGen syntax is easy to read, write and understand.
>   - Only adds a few new TableGen syntax.
>   - Tightly integrated with the existing CodeEmitterGen.
>
> ### Previous approaches
> Both X86 and M68k -- the only two LLVM targets with variable-length
> instructions -- are using custom instruction encoders. X86 leverages
> TSFlags in `MCInst` to carry encoding info. Simply speaking, X86 enumerates
> and numbers every possible combinations of operands and stores the
> corresponding index into a segment of TSFlags for an instruction. This
> approach, of course, requires none trivial amount of workforce to maintain.
>
> M68k, on the other hand, uses an obscured infrastructure called code
> beads. It is conceptually similar to the VarLenCodeEmitterGen we're
> proposing here -- concatenating encoding fragments. Except that the syntax
> is bulky and it uses too many specialized TableGen infrastructures,
> including a separate TableGen backend, that make the maintainence really
> really hard.
>
> ## Patches
> TableGen modifications: https://reviews.llvm.org/D115128
>
> ## FAQ
>   - Do I need to toggle some flags -- either a command line flag or a
> TableGen bit field -- to use the new code emitter scheme?
>     - No, having a `dag` type `Inst` field will automatically opt-in this
> new code emitter scheme.
>   - Can I adopt this for fixed-length instructions?
>     - Absolutely yes. But it's not recommended because CodeEmitterGen
can
> generate more optimal encoder functions for fixed-length instructions. The
> TableGen syntax of CodeEmitterGen makes more sense for fixed-length
> instructions, too.
>   - Can X86 adopt this infrastructure?
>     - Theoritically, yes (In practice? I dunno).
>   - What about the disassembler? Can we TableGen-enerate the corresponding
> disassembling functions?
>     - Since we have a structural description of the encoded instruction,
> it's probably easier to create a disassembler from the new TableGen
syntax.
> But I haven't worked on that yet.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211207/5ee00468/attachment.html>

Min-Yih Hsu via llvm-dev

2021-Dec-09 14:04 UTC

head link

[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions

FYI: As a preview for this new CodeEmitterGen component, I’ve refactored some of
the M68k instructions using VarLenCodeEmitterGen:
https://reviews.llvm.org/D115234 <https://reviews.llvm.org/D115234>

-Min
> On Dec 6, 2021, at 11:34 AM, Min-Yih Hsu <minyihh at uci.edu> wrote:
> 
> (This is a long proposal. If you prefer, here is the web version:
https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b
<https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b>)
> 
> ## Background
> CodeEmitterGen is a TableGen backend that generates instruction encoder
functions for `MCCodeEmitter` from a concise TableGen syntax. It is, however,
almost exclusively designed for targets that use fixed-length instructions.
It's nearly impossible to use this infrastructure to describe instruction
encoding scheme for ISAs with variable-length instructions, like X86 and M68k.
> 
> To have a better understanding on this problem, let's look at an
example. For a fixed-length instruction ISA, developers write the following
TableGen syntax to describe an instruction encoding:
> ```
> class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction
{
>     bits<32> Inst;
> 
>     bits<4> dst;
>     bits<16> imm;
>     let Inst{31-28} = 0b1011;
>     ...
>     let Inst{19-16} = dst;
>     let Inst{15-0} = imm;
> }
> ```
> The `Inst` field tells us the length of an instruction -- 32 bits in this
case. Each bit in this field describes the encoded value, which is either a
concrete value or symbolic one like `dst` and `imm` in the example above. The
`dst` and `imm` variables correspond to the output operand (`$dst`) and the
second input operand (`$imm`), respectively. Meaning, the encoder function
(generated by CodeEmitterGen) will eventually insert the encoding for these two
operands into the right bit ranges (bit 19\~16 for `dst` and 15\~0 for `imm`).
> 
> Though this TableGen syntax fits well for fixed-length instructions, it
imposes some difficulties to instructions with variable length and memory
poerands with complex addressing modes:
>   1. The bit width of the `Inst` field is fixed. Though we can declare the
field with maximum instruction size in the ISA, it requires extra code to adjust
the final instruction size.
>   2. Operand encoding can only be placed at fixed bit positions. However,
the size of an operand in a variable-length instruction might vary.
>   3. In the situation where a single logical operand is consisting of
multiple MachineOperand-s / MCOperand-s, the current syntax cannot reference a
sub-operand. Which means we can only reference the entire logical operand at
places where we actually should put sub-operands. Making the TG code less
readable and bring more burden to the operand encoding functions (because they
don't know which sub-operand to encode).
> 
> In short, we need a more flexible CodeEmitterGen infrastructure for
variable-length instructions: describe the instruction encoding in a
"position independent" fashion and be able to reference sub-operands
with ease.
> 
> ## Proposal
> We propose a new infrastructure, called VarLenCodeEmitterGen, to solve the
aforementioned shortcomings. It is consisting of new TableGen syntax and some
modifications to the existing CodeEmitterGen TableGen backend.
> 
> Suppose we are dealing with an instruction `MyVarInst`:
> ```
> class MyMemOperand<dag sub_ops> : Operand<iPTR> {
>     let MIOperandInfo = sub_ops;
> }
> 
> class MyVarInst<MyMemOperand memory_op> : Instruction {
>     let OutOperandList = (outs GR64:$dst);
>     let InOperandList  = (ins memory_operand:$src);
> }
> ```
> It has the following encoding format:
> ```
> 15             8                                   0
> ----------------------------------------------------
> |  Fixed bits  |  Sub-operand 0 in source operand  |
> ----------------------------------------------------
> X                                                 16
> ----------------------------------------------------
> |         Sub-operand 1 in source operand          |
> ----------------------------------------------------
>                 X + 4                          X + 1
>                 ------------------------------------
>                 |       Destination register       |
>                 ------------------------------------
> ```
> We have two different kinds of memory operands:
> ```
> def MemOp16 : MyMemOperand<(ops GR64:$reg, i16imm:$offset)>;
> def MemOp16 : MyMemOperand<(ops GR64:$reg, i32imm:$offset)>;
> 
> def FOO16 : MyVarInst<MemOp16>;
> def FOO32 : MyVarInst<MemOp32>;
> ```
> So the size of `FOO16` and `FOO32` will be 36 and 52 bits, respectively.
> 
> To express the encoding, first, we modify `MyVarInst` and `MyMemOperand`:
> ```
> class MyMemOperand<dag sub_ops> : Operand<iPTR> {
>     let MIOperandInfo = sub_ops;
>     dag Base;
>     dag Extension;
> }
> 
> class MyVarInst<MyMemOperand memory_op> : Instruction {
>     dag Inst;
> 
>     let OutOperandList = (outs GR64:$dst);
>     let InOperandList  = (ins memory_op:$src);
> 
>     let Inst = (seq
>         (seq:$dec /*Fixed bits*/0b10110111, memory_op.Base),
>         memory_op.Extension,
>         // Destination register
>         (operand "$dst", 4)
>     );
> }
> ```
> Then, we use a slightly different representation for `MemOp16` and
`MemOp32`:
> ```
> class MemOp16<string op_name> : MyMemOperand<(ops GR64:$reg,
i16imm:$offset)> {
>     let Base = (operand "$"#op_name#".reg", 8);
>     let Extension = (operand "$"#op_name#".offset",
16);
> }
> 
> class MemOp32<string op_name> : MyMemOperand<(ops GR64:$reg,
i32imm:$offset)> {
>     let Base = (operand "$"#op_name#".reg", 8);
>     let Extension = (operand "$"#op_name#".offset",
32);
> }
> 
> def FOO16 : MyVarInst<MemOp16<"src">>;
> def FOO32 : MyVarInst<MemOp32<"src">>;
> ```
> 
> This new TableGen syntax uses `dag` rather than `bits<N>` for the
`Inst` field. Allowing instructions to place their operand (and sub-operand)
encodings without worrying about the actual bit positions. The new syntax is
underpinned by two new DAG operators: `seq` and `operand`.
> 
> The `seq` operator sequentially places its arguments -- fragments of
encoding -- from LSB to MSB. If the operator is "tagged" by `$dec`, it
goes from MSB to LSB instead. The `operand` operator references the encoding of
an operand. Its first DAG argument is a string referencing the name of an
operand in either `InOperandList` or `OutOperandList` of an instruction. We can
also reference an sub-operand using syntax like `$<operand
name>.<sub-operand name>`. The second DAG argument for `operand` is the
bit width of the encoded operand. The other variant of `operand` is having two
arguments instead of one that follow the operand referencing string. More
specifically:
> ```
> (operand "$src.reg", 8, 4)
> ```
> In this case, 8 and 4 represents a bit range -- high bit and low bit,
respectively -- to the encoded `$src.reg` operand.
> 
> Finally, a new sub-component added to the existing CodeEmitterGen TableGen
backend, VarLenCodeEmitterGen, will turn the above syntax into a C++ encoder
function -- `MCCodeEmitter::getBinaryCodeForInstr` -- that uses the same
mechanism as the fixed-length instruction version (except few details, like it
always uses APInt to store the result).
> 
> We think the proposed solution has the following advantages:
>   - Flexible and versatile in terms of expressing instruction encodings.
>   - The TableGen syntax is easy to read, write and understand.
>   - Only adds a few new TableGen syntax.
>   - Tightly integrated with the existing CodeEmitterGen.
> 
> ### Previous approaches
> Both X86 and M68k -- the only two LLVM targets with variable-length
instructions -- are using custom instruction encoders. X86 leverages TSFlags in
`MCInst` to carry encoding info. Simply speaking, X86 enumerates and numbers
every possible combinations of operands and stores the corresponding index into
a segment of TSFlags for an instruction. This approach, of course, requires none
trivial amount of workforce to maintain.
> 
> M68k, on the other hand, uses an obscured infrastructure called code beads.
It is conceptually similar to the VarLenCodeEmitterGen we're proposing here
-- concatenating encoding fragments. Except that the syntax is bulky and it uses
too many specialized TableGen infrastructures, including a separate TableGen
backend, that make the maintainence really really hard.
> 
> ## Patches
> TableGen modifications: https://reviews.llvm.org/D115128
<https://reviews.llvm.org/D115128>
> 
> ## FAQ
>   - Do I need to toggle some flags -- either a command line flag or a
TableGen bit field -- to use the new code emitter scheme?
>     - No, having a `dag` type `Inst` field will automatically opt-in this
new code emitter scheme.
>   - Can I adopt this for fixed-length instructions?
>     - Absolutely yes. But it's not recommended because CodeEmitterGen
can generate more optimal encoder functions for fixed-length instructions. The
TableGen syntax of CodeEmitterGen makes more sense for fixed-length
instructions, too.
>   - Can X86 adopt this infrastructure?
>     - Theoritically, yes (In practice? I dunno).
>   - What about the disassembler? Can we TableGen-enerate the corresponding
disassembling functions?
>     - Since we have a structural description of the encoded instruction,
it's probably easier to create a disassembler from the new TableGen syntax.
But I haven't worked on that yet.
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211209/a7b14895/attachment-0001.html>

llvm dev - Dec 2021 - [RFC] A new CodeEmitterGen infrastructure for variable-length instructions

[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions

[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions

[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions