On Jun 28, 2009, at 11:00 PM, David Terei wrote:
> Hi all,
>
> I'm working on using LLVM as a back-end for an existing compiler (GHC
> Haskell compiler)
Very cool!
> and one of the problems I'm having is pinning a
> global variable to a actual machine register. I've seen mixed
> terminology for this feature/idea, so what I mean by this is that I
> want to be able to put a global variable into a specified hardware
> register.
Lets separate two things here: 1) GCC's implementation of this feature
2) the semantic/perf effect of doing it.
For 1) GCC implements this feature (with the example code you gave) by
globally changing the allocatable register set for the backend and
pinning the value to the specified physical register. This is really
easy for GCC to do (yay, global variables for everyone, even the
backend) and has the "right effect". However, this implementation is
inappropriate in LLVM: if we wanted to take this approach, we'd have
to encode the set of pinned physregs in the top-level module structure
somewhere: this is not impossible, but it is kinda ugly.
#2 is the more interesting part of this. Ignoring GCC's
implementation of this, the semantic effect of this is that the
calling convention of the functions in the translation unit are
changed (so that the global is guaranteed to be in the specific
physreg on entrance/exit of the function) and the global is guaranteed
to be in the register in inline asms. Interestingly (to me at
least :), there is no guarantee that this value be in the physreg at a
random point in the function. There is no "defined" way to notice
this, so the compiler can cheat and reuse the register if it wants to
(e.g. spilling the temp value to the stack etc). While you could
notice this with a debugger, performance tool, etc, normal code should
be fine.
> This declaration should thus reserve that machine register
> for exclusive use by this global variable. This is used in GHC since
> it defines an abstract machine as part of its execution model, with
> this abstract machine consisting of several virtual registers. Due to
> the frequency the virtual registers are accessed it is best for
> performance that they be permanently assigned to a physical machine
> register.
Right. Coming back to "why do this", you want it because it is good
for performance: these values are accessed frequently enough that
going to globals (particularly for PIC code) is too expensive.
> A very simple example C program using this feature:
>
> --------------------------
> #include <stdio.h>
>
> register int R1 __asm__ ("esi");
>
> int main(void)
> {
> R1 = 3;
> printf("register: %d\n", R1);
> R1 *= 2;
> printf("register: %d\n", R1);
> return 0;
> }
> --------------------------
>
> llvm-gcc doesn't compile this program correctly, although according to
> the llvm-gcc release notes this extension was first supported by llvm-
> gcc in 1.9.
This program actually works for me if you build with -O, but it looks
like it is an accident that it works :). The implementation in llvm-
gcc could definitely be fixed in this case. However, the more
interesting example wouldn't work: if printf were some other function
and you read ESI in it.
If it were important to me to implement this, I'd implement this
extension by adding a new custom calling convention to the X86 backend
that always passed the first i32 value in ESI and always returned the
first i32 value in ESI. Given that, you could lower the above code to
something like this pseudo code:
{i32,i32} @main(i32 %in_esi) {
%esi = alloca i32
store in_esi -> esi
store 3 -> esi
esi1 = load esi
{esi2, dead} = call @printf(esi1, "register: %d\n", esi1);
store esi2 -> esi
esi3 = load esi
esi4 = esi3*2
store esi4 -> esi
esi5 = load esi
{esi6, dead} = call @printf(esi5, "register: %d\n", esi5);
store esi6 -> esi
esi7 = load esi
ret {esi7, 0}
}
Each of printf and main would be marked with the custom CC. After
running mem2reg on this, you'd get:
{i32,i32} @main(i32 %in_esi) {
{esi2, dead} = call @printf(3, "register: %d\n", 3);
esi4 = esi2*2
{esi6, dead} = call @printf(esi4, "register: %d\n", esi4);
ret {esi6, 0}
}
When lowered at codegen time, the regalloc would trivially eliminate
the copies into/out-of ESI and you'd get the code you desired.
No, I don't know of anyone planning to implement this, but it is
conceptually quite simple :)
-Chris