The X86-64 calling convention (annoyingly) specifies that "struct x { float
a,b,c,d; }" is passed or returned in the low 2 elements of two separate XMM
registers. For example, returning that would return "a,b" in the low
elements of XMM0 and "c,d" in the low elements of XMM1. Both llvm-gcc
and clang currently generate atrocious IR for these structs, which you can see
if you compile this:
struct x { float a,b,c,d; };
struct x foo(struct x *P) { return *P; };
The machine code generated by llvm-gcc[*] for this is:
_foo:
movl (%rdi), %eax
movl 4(%rdi), %ecx
shlq $32, %rcx
addq %rax, %rcx
movd %rcx, %xmm0
movl 8(%rdi), %eax
movl 12(%rdi), %ecx
shlq $32, %rcx
addq %rax, %rcx
movd %rcx, %xmm1
ret
when we really just want:
_foo:
movq (%rdi), %xmm0
movq 8(%rdi), %xmm1
ret
I'm looking at having clang generate IR for this by passing and returning
the two halfs as v2f32 values, which they are, and doing insert/extracts in the
caller/callee. However, at the moment, the x86 backend is passing each element
of the v2f32 as an f32, instead of promoting the type and passing the v2f32 as
the low two elements of the v4f32. In the example above, this means it returns
each element in XMM0,XMM1,XMM2,XMM3 instead of just XMM0/1.
We already do this sort of vector promotion for operators in type legalization.
Is there any reason not to do it for the calling convention case? Is there
anyone interested in working on this? :)
-Chris
[*] Clang happens to generate good machine code for this case, but the IR is
still awful and it falls down hard on other similar cases.