A question for LLVM code generator developers: After having read through "The LLVM Target-Independent Code Generator" [1] I'm unclear about what precisely the objects MCInst and MCOperand represent. They sit in the space between assembly syntax and binary encodings, but which are they modeling? For example, a Thumb 2 branch instruction 'b' takes an immediate. That syntax "b #1234" can map to a couple different encodings. If it is an even number between -2048 and 2046, it can be encoded with a 16-bit instruction, otherwise a 32-bit instruction. If the MC objects are to model the syntax, then one would expect both encodings to have identical values in the MCOperand, a 32-bit signed integer. On the other hand, if MC objects are to model the encoding, one would expect the MCOperand for the 16-bit encoding to contain a number between -1024 and 1023. Which one is it? My intuition says the MCOperand should model the assembly syntax and contain the 32-bit signed integer, and that the EncoderMethod and DecoderMethod are responsible for mapping that high-level number to the low-level binary representation. If, however, the MCOperand models the encoding, then EncoderMethod and DecoderMethod glue need not exist, and that bit-twiddling logic would be pushed to whoever creates the MCOperand. Looking at the Thumb backend, I believe it has been written assuming the MC objects model the syntax, not the encoding, which matches my intuition. There has been some discussion on the llvm-commits list encouraging us to store the encoded value in the MCOperand. The justification, as I understand it, is that the MCOperand should not contain values that cannot be encoded. This effectively means that the MCOperands would be modeling the binary encoding, not the syntax. Are folks making this transition in other backends as well? [1] http://llvm.org/docs/CodeGenerator.html Thanks, Greg
Owen is correct in his descriptions. The MCOperand values are intended to model the instruction encoding. Where that doesn't match the assembly syntax, the asm parser (and codegen) and the instruction printer are responsible for encoding/decoding the values. For targets that predate the MC layer, this isn't always the case, leading to things being a bit confusing when just reading the code. Any new targets should absolutely consider the instruction encoding to be the canonical representation and map assembly syntax onto that, not the other way around. Regards, -Jim On Sep 26, 2012, at 11:26 AM, Greg Fitzgerald <garious at gmail.com> wrote:> A question for LLVM code generator developers: > > After having read through "The LLVM Target-Independent Code Generator" > [1] I'm unclear about what precisely the objects MCInst and MCOperand > represent. They sit in the space between assembly syntax and binary > encodings, but which are they modeling? For example, a Thumb 2 branch > instruction 'b' takes an immediate. That syntax "b #1234" can map to > a couple different encodings. If it is an even number between -2048 > and 2046, it can be encoded with a 16-bit instruction, otherwise a > 32-bit instruction. If the MC objects are to model the syntax, then > one would expect both encodings to have identical values in the > MCOperand, a 32-bit signed integer. On the other hand, if MC objects > are to model the encoding, one would expect the MCOperand for the > 16-bit encoding to contain a number between -1024 and 1023. Which one > is it? > > My intuition says the MCOperand should model the assembly syntax and > contain the 32-bit signed integer, and that the EncoderMethod and > DecoderMethod are responsible for mapping that high-level number to > the low-level binary representation. If, however, the MCOperand > models the encoding, then EncoderMethod and DecoderMethod glue need > not exist, and that bit-twiddling logic would be pushed to whoever > creates the MCOperand. > > Looking at the Thumb backend, I believe it has been written assuming > the MC objects model the syntax, not the encoding, which matches my > intuition. There has been some discussion on the llvm-commits list > encouraging us to store the encoded value in the MCOperand. The > justification, as I understand it, is that the MCOperand should not > contain values that cannot be encoded. This effectively means that > the MCOperands would be modeling the binary encoding, not the syntax. > Are folks making this transition in other backends as well? > > [1] http://llvm.org/docs/CodeGenerator.html > > Thanks, > Greg > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> the MCOperand should not contain values that cannot be encodedIn the case of pre-encoding a shifted immediate, we acknowledge that we've only moved the invalid encodings from ones that set the bottom bit to ones that set the top? Is there a backend that is implemented in this style I can use as a reference?> The MCOperand values are intended to model the instruction > encoding. Where that doesn't match the assembly syntax, > the asm parser (and codegen) and the instruction printer are > responsible for encoding/decoding the values.As my colleague and I try to implement instructions in the recommended style, we are finding it to be harder with the constraint of MCOperand needing to be pre-encoded. I've attached a diagram of my understanding of the code flow if using the recommended style versus using the MCOperand to model the syntax. How far off am I? [See attached] In the diagram with pre-encoding, a shared function EncodeImm() has to be referenced from 3 locations. As a newcomer to LLVM going after a simple encoding bug, I wasn't expecting to have to grok every client of MCOperand just to fix how it is encoded. To pre-encode, it seems the .td file needs to use a custom operand that inherits from a generic one for the only purpose of routing to the shared encoding function. Is there better alternative for getting from the LLVM target-independent IR to the pre-encoded MCOperand? Thanks, Greg On Wed, Sep 26, 2012 at 2:02 PM, Jim Grosbach <grosbach at apple.com> wrote:> Owen is correct in his descriptions. The MCOperand values are intended to model the instruction encoding. Where that doesn't match the assembly syntax, the asm parser (and codegen) and the instruction printer are responsible for encoding/decoding the values. > > For targets that predate the MC layer, this isn't always the case, leading to things being a bit confusing when just reading the code. Any new targets should absolutely consider the instruction encoding to be the canonical representation and map assembly syntax onto that, not the other way around. > > Regards, > -Jim > On Sep 26, 2012, at 11:26 AM, Greg Fitzgerald <garious at gmail.com> wrote: > >> A question for LLVM code generator developers: >> >> After having read through "The LLVM Target-Independent Code Generator" >> [1] I'm unclear about what precisely the objects MCInst and MCOperand >> represent. They sit in the space between assembly syntax and binary >> encodings, but which are they modeling? For example, a Thumb 2 branch >> instruction 'b' takes an immediate. That syntax "b #1234" can map to >> a couple different encodings. If it is an even number between -2048 >> and 2046, it can be encoded with a 16-bit instruction, otherwise a >> 32-bit instruction. If the MC objects are to model the syntax, then >> one would expect both encodings to have identical values in the >> MCOperand, a 32-bit signed integer. On the other hand, if MC objects >> are to model the encoding, one would expect the MCOperand for the >> 16-bit encoding to contain a number between -1024 and 1023. Which one >> is it? >> >> My intuition says the MCOperand should model the assembly syntax and >> contain the 32-bit signed integer, and that the EncoderMethod and >> DecoderMethod are responsible for mapping that high-level number to >> the low-level binary representation. If, however, the MCOperand >> models the encoding, then EncoderMethod and DecoderMethod glue need >> not exist, and that bit-twiddling logic would be pushed to whoever >> creates the MCOperand. >> >> Looking at the Thumb backend, I believe it has been written assuming >> the MC objects model the syntax, not the encoding, which matches my >> intuition. There has been some discussion on the llvm-commits list >> encouraging us to store the encoded value in the MCOperand. The >> justification, as I understand it, is that the MCOperand should not >> contain values that cannot be encoded. This effectively means that >> the MCOperands would be modeling the binary encoding, not the syntax. >> Are folks making this transition in other backends as well? >> >> [1] http://llvm.org/docs/CodeGenerator.html >> >> Thanks, >> Greg >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120926/21292fa6/attachment.html>