Hi --
I'm wondering if some of you old hands might offer me some guidance.
I've been working for some time with a virtual machine platform that is
based loosely on the instruction set of a Forth hardware processor that Charles
Moore designed sometime back. He used what he called a "MISC" or
"Minimal Instruction Set Computer" and the original instruction set I
was working off of (and have modified slightly in the process) had a total of 32
byte code primitives. He chose this limit because in his hardware design he was
encoding each instruction in 5 bits and storing four of them in a 20-bit word.
I've developed my platform for the intel processor family and thus have
32-bit words and store 4 8-bit instructions in one 32-bit word. I'm still
only using a fraction of the available 256 codes that can be represented by a
byte (last time I checked I was up to about 50 some of them), but like Moore,
rely heavily on the original 32.
I've constructed a typical Forth dictionary mechanism on top of this VM and
the code stream consists of a combination of four types of execution. Like moth
Forth envrionments there is an interpreter mode that involves immediate lookup
of words in a dictionary and execution of them as they are encountered during
parsing of the input stream. The remaining three execution types involve code
that has been compiled into these dictionary words (which happens when the
execution engine is switched from interpreter to compile mode). As in any
classic Forth system, once these compiled words are created they are stored in
the dictionary andare then available to the interpreter and to any other
compiled words that wish to use them. When executing a compiled word (either by
the interpreter or in the process of executing some other compiled word that
uses it),the execution engine always technically executes byte codes. But the
reason there are three distinct code execution types is that one of the byte
codes is a CALL primitive and is used to invoke higher-level Forth dictionary
words; another byte code is an ASM! primitive that drops into intel assembler
execution mode and executes intel code directly. So a typical compiled word
might be some combination of Forth-level calls to other dictionary words, some
sequence of VM byte codes that ultimately execute VM primitives, and intel
assembler bits that are executed directly by the processor. Just to complete
the picture, each VM byte code is itself implemented by a small hand-coded intel
assembly routine that among other things limits its register use to preserve
those registers reserved by the underlying execution engine. There's only a
few free registers available, as I reserve EAX for the top of the VM data stack,
EDI for the top of the VM return stack, EDX for the forth code instruction
pointer, ESI for the VM address register, and ESP for the Intel stack pointer.
This leaves EBX and ECX and EBP for general use, although some of my routines do
push some of the reserved registers for use while they are executing and then
pop them before returning to the dispatch loop.
Ok, so I am looking to build a compilation mechanism using LLVM and imagine
leveraging optimizations at two distinct levels. I'm sure there must be
opportunities to optimize my VM byte code sequences (for example eliminating
OP_PUSH and OP_POP or OP_RPUSH and OP_RPOP sequences or reordering stack use in
general to be more optimized and things like that). This is obviously something
I will need to do myself and probably won't leverage too much of LLVM to
help with (although if I'm wrong about this and there is support at this
level, I would love to understand what and where it is). Where I see making
most use of LLVM is in optimizing the assembly language sequences that end up
getting executed, whether its through sequences of VM byte codes or in my direct
assembler language codeing (I forgot to mention that I have built an in-line
assembler to be able to include intel assembler instructions in my code
streams). So what I see is, for example, taking a sequence of 20 byte code
instructions that might make up a word's definition and optimizing the
entire sequence of intel assembly instructions that these translate to. This
obviously has the potential to increase code size dramatically, since I could be
turning a sequence of 20 bytes into a much longer sequence of intel code. On
the other hand, the VM-level stack manipulation that makes up a substantial
amount of VM-level code can probably be optimized away through judicious use of
the full intel register set. And other VM instructions can probably benefit
from similar savings. The +, 2*, COM, and address register store and fetche
byte codes can each become a single assembler instruction instead of a call to a
byte code routine. Other savings would likely involve moving current memory
access to registers, eliminating calls and returns from successive byte code
routines, converting tail calls to jumps, etc. So in many cases the increased
code won't be so gigantic and will be more than worth it from an execution
cost savings standpoint.
So here are my questions (sorry to take so long to get to them). If given the
choice, am I generally better off taking my byte code level and translating it
to LLVM intermediate code and then driving new LLVM-optimized intel code
generation from there? Or given that I already have each of these byte codes
translated to intel assembler, am I better off just gathering all those
assembler sequences together and optimizing this at the assembly language level.
I'm going to guess the former, since working at the byte code level, I may
be able to substitute entire byte codes with more judicious use of registers and
end up with a much shorter piece of aggregated assembler to work with. But it
could also be the case that there are no good rules of thumb and that I simply
need to evaluate on a case by case basis. But if anyone does have any wisdom to
offer or sources they can direct me to, I would be most appreciative.
The other question is that a lot of my code isn't even written in byte code,
but rather directly in assembly language using my built-in assembler. For this
code, does LLVM offer opportunities to optimize, even if the source assembler
isn't something that has been generated from LLVM intermediate code? Also
I'm imagining that if I succeed in optimize compiling a particular word into
a sequence of assembly language, then any call to that word from another word
(through my CALL op) can be translated to a jmp instruction, therby eliminating
the substantial call overhead of my high-level forth code. Once again any
guidance would be greatly appreciated, even if it's just directing me to any
good source material that might be helpful
Thanks very much for slogging through this and for any assistance anyone might
be able to provide.
Regards,
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110902/f0f293dd/attachment.html>