thr3ads.net - llvm dev - [LLVMdev] Macro-op fusion experiment [Apr 2011]

If this information is useful, please help other people find it:
Share via:

Jakob Stoklund Olesen

2011-Apr-08 16:25 UTC

[LLVMdev] Macro-op fusion experiment

On Apr 8, 2011, at 3:29 AM, Nicolas Capens wrote:
> x86 processors use macro-op fusion to merge together two instructions and
execute them as one. So it's beneficial for the compiler to emit them as a
pair.
>  
> Currently only compare and jump instructions get fused though. And I was
wondering whether it also makes sense to fuse move and arithmetic instructions
together, to form non-destructive instructions (which x86 lacks for regular
instructions). For instance:
>                 8B C3 mov eax, ebx 
>                 03 C1 add eax, ecx
> becomes
>                 8B C3 03 C1 add eax, ebx, ecx
>  
> There's no difference in the binary encoding; it's just considered
one instruction at a logical level and inside the hardware (I'm assuming
x86's RISC internals actually use non-destructive micro-operations).
Most x86 implementations use register renaming these days, so micro-operations
are non-destructive, but they don't refer to architectural registers. They
refer to a larger number of real registers.

Register copies are mostly free to execute except they increase code size and
consume decoder resources. To my knowledge, they are not fused in the way you
describe.

Intel's optimization reference manual describes which instructions can be
fused. The Sandy Bridge processors fuse more pairs than previous generations,
but the second instruction is always a conditional branch.

There is no need to define pseudo-instructions to support this. If you want to
experiment, you could add a late pass that tries to form fusable pairs by
pushing instructions down to the conditional branch. This should happen after
register allocation where code is often inserted before a branch.

I would be interested to see the performance impact of such a pass.

/jakob

NAKAMURA Takumi

2011-Apr-08 16:56 UTC

head link

[LLVMdev] Macro-op fusion experiment

>>                 8B C3 mov eax, ebx
>>                 03 C1 add eax, ecx
>> becomes
>>                 8B C3 03 C1 add eax, ebx, ecx
In my understanding, twoaddr pass tends to emit such a sequence.

Though I don't have sandybridge, I have not measured.
Prior processors(intel and amd) might spend 1 ALU to execute "mov",
then mov - add must have dependency.

In contrast, the sequence below might be executed in parallel;
mov %ebx, %eax
add %ecx, %ebx
(I understand it might not be applicable in all cases)
Thoughts?

...Takumi

Jakob Stoklund Olesen

2011-Apr-08 17:27 UTC

head link

[LLVMdev] Macro-op fusion experiment

On Apr 8, 2011, at 9:56 AM, NAKAMURA Takumi wrote:
>>>                 8B C3 mov eax, ebx
>>>                 03 C1 add eax, ecx
>>> becomes
>>>                 8B C3 03 C1 add eax, ebx, ecx
> 
> In my understanding, twoaddr pass tends to emit such a sequence.
Yes, it always does, and the coalescer tries very hard to eliminate the copy.
> Though I don't have sandybridge, I have not measured.
> Prior processors(intel and amd) might spend 1 ALU to execute
"mov",
> then mov - add must have dependency.
I think you will find it is more complicated than that. A 'mov' usually
doesn't need an ALU resource.

You should read about the 'reservation station' style register renaming.

http://en.wikipedia.org/wiki/Register_renaming
http://www.intel.com/Assets/PDF/manual/248966.pdf

/jakob

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Apr 2011 - [LLVMdev] Macro-op fusion experiment

[LLVMdev] Macro-op fusion experiment

[LLVMdev] Macro-op fusion experiment

[LLVMdev] Macro-op fusion experiment

Possibly Parallel Threads