<moving this to llvmdev instead of commits>
On Jan 22, 2008, at 11:23 PM, Duncan Sands wrote:
>> Okay, well we already get many other x86-64 issues wrong already, but
>> Evan is chipping away at it. How do you pass an array by value in C?
>> Example please,
>
> I find the x86-64 ABI hard to interpret, but it seems to say that
> aggregates are classified recursively, so it looks like a struct
> containing a small integer array should be passed in integer
> registers.
Right. For x86-64 in particular, this happens, but only if the struct
is <= 128 bits.
> Also, it is easy in Ada: the compiler passes small arrays by value,
Ok, we should make sure this works when we think x86-64 is "done" :)
> Can you please clarify the roles of llvm-gcc and the code generators
> in getting ABI compatibility.
Sure! This also mixes into the discussion in PR1937. The basic
problem we have is that ABI decisions can be completely arbitrary, and
are defined in terms of the source level type system. In our desire
to preserve the source-language-independence of llvm, we can't just
give the code generators an AST and tell it to figure out what to do.
While it would theoretically be useful to hand the target an llvm type
and tell it to figure it out, this also doesn't work. The most
trivial example of this is that some ABIs require different handling
for _Complex double, and "struct {double,double}", both of which lower
to the same LLVM type. This means that the LLVM type system isn't
rich enough by itself to fully express how the target is supposed to
handle something.
Right now, the LLVM IR now has two ways to express argument passing +
return value:
1) pass/return with a first class value type, like i32, float, <4 x
i32>, etc.
2) pass/return with a pointer to the place to do things and use byval/
stret.
It's useful to notice that the formulation of something in the IR
doesn't force the code generator to do anything (e.g. x86-32 passes
almost everything in the stack regardless of whether you use scalars
or byval), but it does have an impact on the optimizers and compile
time.
For example, consider a target where everything is passed on the
stack. In this case, from a functionality perspective, it doesn't
matter whether you use byval to pass the argument or pass it as scalar
values. However, picking the right one *can* have code quality and
QOI impact. For example, if passing a 100K struct by value, it is
much better (in terms of compile time and generated code) to use byval
than the scalarize it and pass all the elements.
OTOH, passing an argument 'byval' on this target prevents it from
being SROA'd on the callee and caller side. If the argument is small
(say a 32-bit struct), this can cause significant performance
degradation. As a QOI issue, it is better to pass a small aggregate
like this as a scalar in this case.
In practice, most targets have more complex abi's than the theoretical
one above. For example, x86-32 passes scalar vectors in registers up
to a point, for example. On that target, the code generator contract
is that 'byval' arguments are always passed in memory, SSE-compatible
vectors are passed in XMM registers (up to a point), and everything
else is passed in memory.
This has somewhat interesting implications: it means that it is okay
to pass a {i32} struct as i32, and it means passing a _Complex float
as two floats is also fine (yay for SROA). However, it means that
that lowering a struct with two vectors in it into two vectors would
actually break the ABI because the codegen would pass them in XMM regs
instead of memory. This is a funny dance which means that the front-
end needs to be fully parameterized by the backend to do the lowering.
> When generating IR for x86-64, llvm-gcc
> sometimes chops by-value structs into pieces, and sometimes passes the
> struct as a byval parameter. Since it chops up all-integer structs,
> and this corresponds more or less to what the ABI says, I assumed this
> was an attempt to get ABI correctness. Especially as the code
> generators
> don't seem to bother themselves with following the details of the
> ABI (yet),
> and just push byval parameters onto the stack.
X86-64 is a much more complex abi than x86-32. The basic form of
correctness is that the code generator:
1. Lowers byval arguments to memory.
2. passes integer and fp and vector arguments in GPRs and XMM regs
where available.
This has an interesting impact on the C front-end. In particular, #1
is great for by value aggregates > 128 bits. However, aggregates <=
128 bits have a variety of possible cases, including:
1. some aggregates are passed in memory.
2. Others treat the aggregate as 2 64-bit hunks, where either 64-bit
hunk can be:
2a. Passed in a GPR.
2b. Passed in an XMM register.
If you consider a struct like {float,float,float,float}, the
interesting thing about this ABI is that it says this struct is passed
in 2 xmm regs, where two floats are each passed as the low two
elements of the XMM regs. To lower this struct optimally, llvm-gcc
should codegen this as two vector inserts + two xmm registers.
Codegen'ing it as a byval struct would be incorrect, because that
would pass it in the stack.
I moved a big digression to the end of the mail.
I'll be the first to admit that this solution is suboptimal, but it is
much better than what we had before. Unresolved issues include: what
alignment do we pass things on the stack with. Evan recently fought
with some crazy cases on x86-64 which currently require looking at the
LLVM Type. I'm not thrilled with this, but it seems like an ok thing
to do for now. If we find out it isn't, we'll have to extend the
model somehow.
>>> This is an optimization, not a correctness issue
>
> I guess this means that the plan is to teach the codegenerators how to
> pass any aggregate byval in an ABI conformant way (not the case
> right now),
> but still do some chopping up in the front-end to help the optimizers.
Right. Currently, x86-32 attempts to pass 32-bit and 64-bit structs
"better" than just using byval as an optimization for some common
small cases. However, the problem is that it doesn't generate
"nice"
accesses into the struct: it just bitcasts the pointer and does a
32/64-bit load, which can often prevent SROA itself. This needs to be
fixed to get really good code, but this is an optimization, not a
correctness issue. Disabling this and passing these structs byval on
x86-32 would generate equally correct code.
> Of course this chopping up needs to be done carefully so the final
> result
> squirted out by the codegenerators (once they are ABI conformant) is
> the
> same as if the chopping had not been done...
Right, and all this is target-specific, yuck. :)
> Is this chopping really a
> big win? Is it not possible to get an equivalent level of
> optimization
> by enhancing alias analysis?
Nope, AA isn't involved here, because you can't know who called you in
general. For example, consider this contrived example:
struct s { int x; };
int foo(struct s S) { return S.x; }
With byval, this turns into a load + return at the IR level. Without
byval this is just a return. There is no amount of alias analysis you
can do on this, because we don't know who calls it. Without changing
the prototype of the IR function to not be byval, you can't eliminate
the explicit load.
The Digression:
Incidentally, on x86-64, we're currently lowering this code to
suboptimal (but correct) code that passes this as two doubles and goes
through memory to get it into floats instead of using vector extracts:
struct a { float w, x, y, z; };
float foo(struct a b) { return b.w+b.x+b.y+b.z; }
%struct.a = type { float, float, float, float }
define float @foo(double %b.0, double %b.1) nounwind {
entry:
%b_addr = alloca { double, double } ; <{ double, double }*> [#uses=4]
%tmpcast = bitcast { double, double }* %b_addr to %struct.a* ; <
%struct.a*> [#uses=3]
%tmp1 = getelementptr { double, double }* %b_addr, i32 0, i32 0 ;
<double*> [#uses=1]
store double %b.0, double* %tmp1, align 8
%tmp3 = getelementptr { double, double }* %b_addr, i32 0, i32 1 ;
<double*> [#uses=1]
store double %b.1, double* %tmp3, align 8
%tmp5 = bitcast { double, double }* %b_addr to float* ; <float*>
[#uses=1]
%tmp6 = load float* %tmp5, align 8 ; <float> [#uses=1]
%tmp7 = getelementptr %struct.a* %tmpcast, i32 0, i32 1 ; <float*>
[#uses=1]
%tmp8 = load float* %tmp7, align 4 ; <float> [#uses=1]
%tmp9 = add float %tmp6, %tmp8 ; <float> [#uses=1]
%tmp10 = getelementptr %struct.a* %tmpcast, i32 0, i32 2 ; <float*>
[#uses=1]
%tmp11 = load float* %tmp10, align 4 ; <float> [#uses=1]
%tmp12 = add float %tmp9, %tmp11 ; <float> [#uses=1]
%tmp13 = getelementptr %struct.a* %tmpcast, i32 0, i32 3 ; <float*>
[#uses=1]
%tmp14 = load float* %tmp13, align 4 ; <float> [#uses=1]
%tmp15 = add float %tmp12, %tmp14 ; <float> [#uses=1]
ret float %tmp15
}
This yields correct but suboptimal code:
_foo:
subq $16, %rsp
movsd %xmm0, (%rsp)
movsd %xmm1, 8(%rsp)
movss (%rsp), %xmm0
addss 4(%rsp), %xmm0
addss 8(%rsp), %xmm0
addss 12(%rsp), %xmm0
addq $16, %rsp
ret
We really want:
_foo:
movaps %xmm0, %xmm2
shufps $1, %xmm2, %xmm2
addss %xmm2, %xmm0
addss %xmm1, %xmm0
shufps $1, %xmm1, %xmm1
addss %xmm1, %xmm0
ret
-Chris