The following is a brief proposal for annotated assembly (and disassembly)
output. Kevin Enderby and I have been discussing this a bit and are interested
in getting broader feedback from interested folks.
    LLVM Rich Assembly Output
LLVM's (dis)assembly output is currently very raw. Consumers have limited
ability to introspect the instructions' textual representation or to
reformat for a more user friendly display. A lot of the actual instruction
semantics are contained in the MCInstrDesc for the opcode, but that's not
sufficient to reference into individual portions of the instruction text. For
clients like disassemblers, list file generators, and pretty-printers, more is
necessary than the raw instructions and the ability to print them.
The intent is for the vast majority of the new functionality to not require new
APIS, but to be in the assembly text itself via markup annotations. The markup
is simple enough in syntax to be robust even in the case of version mismatches
between consumers and producers. That is, the syntax generally does not carry
semantics beyond "this text has an annotation," so consumers can
simply ignore annotations they do not understand or do not care about.
** Instruction Annotations
Annoated assembly display will supply contextual markup to help clients more
efficiently implement things like pretty printers. Most markup will be target
independent, so clients can effectively provide good display without any target
specific knowledge.
Annotated assembly goes through the normal instruction printer, but optionally
includes contextual tags on portions of the instruction string. An annotation is
any '<' '>' delimited section of text(1).
annotation: '<' tag-name tag-modifier-list ':' annotated-text
'>'
tag-name: identifier
tag-modifier-list: comma delimited identifier list
The tag name is an identifier which gives the type of the annotation. For the
first pass, this will be very simple, with memory references, registers, and
immediates having the tag names "mem", "reg", and
"imm", respectively.
The tag modifier list is typically additional target-specific context, such as
register class.
Clients should accept and ignore any tag names or tag modifiers they do not
understand, allowing the annotations to grow in richness without breaking older
clients.
For example, a possible annotation of an ARM load of a stack-relative location
might be annotated as:
    ldr <reg gpr:r0>, <mem regoffset:[<reg gpr:sp>,
<imm:#4>]>
1: For assembly dialects in which '<' and/or '>' are legal
tokens, a literal token is escaped by following immediately with a repeat of the
character.  For example, a literal '<' character is output as
'<<' in an annotated assembly string.
** C API Details
Some intended consumers of this information use the C API, therefore a new C API
function for the disassembler will be added to disassemble an instruction with
annotations, "LLVMDisasmInstructionAnnotated.".
How is the client supposed to make use of this markup information? At first glance it seems like client code will just devolve into a pile of regex insanity. Why not use an existing standardized markup, like XML (not that I'm that fond of XML)? At a higher level, why not expose an API for iterating over (potentially annotated) tokens which can be programmatically inspected. So what you expose to clients is an AnnotatedAsmTok. Given an AnnotatedAsmTok, they can call "getAnnotation()", or "getRawText()". A textual representation which can be read into this form might be useful, but we should provide the parser. I guess what I think needs a bit more explanation is why you chose to go the "markup" route, instead of a normal programmatic API. Maybe you could also include a couple use cases that capture your "vision" for this functionality, and maybe a tiny bit of sample code doing something interesting with a very rough initial interface (if it seems more natural, since you're talking about a C API, you can just assume bindings and write the example in your scripting language of choice). -- Sean Silva On Fri, Oct 12, 2012 at 12:51 PM, Jim Grosbach <grosbach at apple.com> wrote:> The following is a brief proposal for annotated assembly (and disassembly) output. Kevin Enderby and I have been discussing this a bit and are interested in getting broader feedback from interested folks. > > LLVM Rich Assembly Output > > LLVM's (dis)assembly output is currently very raw. Consumers have limited ability to introspect the instructions' textual representation or to reformat for a more user friendly display. A lot of the actual instruction semantics are contained in the MCInstrDesc for the opcode, but that's not sufficient to reference into individual portions of the instruction text. For clients like disassemblers, list file generators, and pretty-printers, more is necessary than the raw instructions and the ability to print them. > > The intent is for the vast majority of the new functionality to not require new APIS, but to be in the assembly text itself via markup annotations. The markup is simple enough in syntax to be robust even in the case of version mismatches between consumers and producers. That is, the syntax generally does not carry semantics beyond "this text has an annotation," so consumers can simply ignore annotations they do not understand or do not care about. > > ** Instruction Annotations > > Annoated assembly display will supply contextual markup to help clients more efficiently implement things like pretty printers. Most markup will be target independent, so clients can effectively provide good display without any target specific knowledge. > > Annotated assembly goes through the normal instruction printer, but optionally includes contextual tags on portions of the instruction string. An annotation is any '<' '>' delimited section of text(1). > > annotation: '<' tag-name tag-modifier-list ':' annotated-text '>' > tag-name: identifier > tag-modifier-list: comma delimited identifier list > > The tag name is an identifier which gives the type of the annotation. For the first pass, this will be very simple, with memory references, registers, and immediates having the tag names "mem", "reg", and "imm", respectively. > > The tag modifier list is typically additional target-specific context, such as register class. > > Clients should accept and ignore any tag names or tag modifiers they do not understand, allowing the annotations to grow in richness without breaking older clients. > > For example, a possible annotation of an ARM load of a stack-relative location might be annotated as: > > ldr <reg gpr:r0>, <mem regoffset:[<reg gpr:sp>, <imm:#4>]> > > > 1: For assembly dialects in which '<' and/or '>' are legal tokens, a literal token is escaped by following immediately with a repeat of the character. For example, a literal '<' character is output as '<<' in an annotated assembly string. > > > ** C API Details > > Some intended consumers of this information use the C API, therefore a new C API function for the disassembler will be added to disassemble an instruction with annotations, "LLVMDisasmInstructionAnnotated.". > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Hi Sean, Thanks for the feedback! Exactly the sort of discussion I was hoping to get started. On Oct 12, 2012, at 10:12 AM, Sean Silva <silvas at purdue.edu> wrote:> How is the client supposed to make use of this markup information?Target-independent introspection of the assembly. A simple example is color-coded output in a GUI disassembly display. All registers show up one color, all memory references another, and immediates yet another, and other such simple things. More interestingly, the client could use the markup to simplify implementation of mouse-over introspection of register values without having to know anything about the assembly syntax. The only target hook required would be "get the value of the register named 'foo'" since identifying the register names in the asm string is handled by the markup. Or, getting a bit fancier, visualizing data assembly data flow with def-use chains for a register being marked with arrows, again likely triggered via mouseover of a register name. The key bit here is that this is doable without the client having any knowledge of the target assembly syntax itself.> At first glance it seems like client code will just devolve into a pile > of regex insanity. Why not use an existing standardized markup, like > XML (not that I'm that fond of XML)?Plain regex would be a very bad way to handle this. Client code should be very simple, just looking for the '<' characters to find annotations. A parser to recognize the markup and ignore it all should be almost trivial. XML is basically just massive overkill for this. The idea is a lightweight annotation system that a client can easily strip off while paying attention to the bits and pieces it cares about.> At a higher level, why not expose an API for iterating over > (potentially annotated) tokens which can be programmatically > inspected. So what you expose to clients is an AnnotatedAsmTok. Given > an AnnotatedAsmTok, they can call "getAnnotation()", or > "getRawText()". A textual representation which can be read into this > form might be useful, but we should provide the parser.We could. It's just outside the scope of what we're looking to do on the initial implementation. Note that it does get a bit more complicated since we're not just annotating tokens, but regions of text, and the annotations can (and often will be) nested.> I guess what I think needs a bit more explanation is why you chose to > go the "markup" route, instead of a normal programmatic API.To keep the surface area of the C API as minimal as possible and robust against changes in what's marked up and how. Consider the interface in EnhancedDisassembly.h, for an example of what we specifically want to avoid (and obsolete).> Maybe you > could also include a couple use cases that capture your "vision" for > this functionality, and maybe a tiny bit of sample code doing > something interesting with a very rough initial interface (if it seems > more natural, since you're talking about a C API, you can just assume > bindings and write the example in your scripting language of choice). >Does the description up above sufficiently answer this? FWIW, one of the bits of example "how do I use this?" code I want as part of the project is a pretty-printed disassembly. Specifically, llvm-objdump will produce annotated disassembly and there will be a standalone tool that will take that text as input and use the markup to produce a pretty-printed output (as HTML, ANSI color codes or whatever). A quick real-world example of where this can get used is colorized disassembly in LLDB without LLDB having to re-implement an assembly parser to do it. -Jim> -- Sean Silva > > On Fri, Oct 12, 2012 at 12:51 PM, Jim Grosbach <grosbach at apple.com> wrote: >> The following is a brief proposal for annotated assembly (and disassembly) output. Kevin Enderby and I have been discussing this a bit and are interested in getting broader feedback from interested folks. >> >> LLVM Rich Assembly Output >> >> LLVM's (dis)assembly output is currently very raw. Consumers have limited ability to introspect the instructions' textual representation or to reformat for a more user friendly display. A lot of the actual instruction semantics are contained in the MCInstrDesc for the opcode, but that's not sufficient to reference into individual portions of the instruction text. For clients like disassemblers, list file generators, and pretty-printers, more is necessary than the raw instructions and the ability to print them. >> >> The intent is for the vast majority of the new functionality to not require new APIS, but to be in the assembly text itself via markup annotations. The markup is simple enough in syntax to be robust even in the case of version mismatches between consumers and producers. That is, the syntax generally does not carry semantics beyond "this text has an annotation," so consumers can simply ignore annotations they do not understand or do not care about. >> >> ** Instruction Annotations >> >> Annoated assembly display will supply contextual markup to help clients more efficiently implement things like pretty printers. Most markup will be target independent, so clients can effectively provide good display without any target specific knowledge. >> >> Annotated assembly goes through the normal instruction printer, but optionally includes contextual tags on portions of the instruction string. An annotation is any '<' '>' delimited section of text(1). >> >> annotation: '<' tag-name tag-modifier-list ':' annotated-text '>' >> tag-name: identifier >> tag-modifier-list: comma delimited identifier list >> >> The tag name is an identifier which gives the type of the annotation. For the first pass, this will be very simple, with memory references, registers, and immediates having the tag names "mem", "reg", and "imm", respectively. >> >> The tag modifier list is typically additional target-specific context, such as register class. >> >> Clients should accept and ignore any tag names or tag modifiers they do not understand, allowing the annotations to grow in richness without breaking older clients. >> >> For example, a possible annotation of an ARM load of a stack-relative location might be annotated as: >> >> ldr <reg gpr:r0>, <mem regoffset:[<reg gpr:sp>, <imm:#4>]> >> >> >> 1: For assembly dialects in which '<' and/or '>' are legal tokens, a literal token is escaped by following immediately with a repeat of the character. For example, a literal '<' character is output as '<<' in an annotated assembly string. >> >> >> ** C API Details >> >> Some intended consumers of this information use the C API, therefore a new C API function for the disassembler will be added to disassemble an instruction with annotations, "LLVMDisasmInstructionAnnotated.". >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev