thr3ads.net - llvm dev - [LLVMdev] How does SSEDomainFix work? [May 2010]

If this information is useful, please help other people find it:
Share via:

NAKAMURA Takumi

2010-May-11 04:07 UTC

[LLVMdev] How does SSEDomainFix work?

Hello. This is my 1st post.

I have tried SSE execution domain fixup pass.
But I am not able to see any improvements.

I expect for the example below to use MOVDQA, PAND &c.
(On nehalem, ANDPS is extremely slower than PAND)

Please tell me if something would be wrong for me.

Thank you.
Takumi


Host: i386-mingw32
Build: trunk at 103373

foo.ll:
define <4 x i32> @foo(<4 x i32> %x, <4 x i32> %y, <4 x
i32> %z)
nounwind readnone {
entry:
  %0 = and <4 x i32> %x, %z
  %not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1>
  %1 = and <4 x i32> %not, %y
  %2 = xor <4 x i32> %0, %1
  ret <4 x i32> %2
}

define <2 x i64> @bar(<2 x i64> %x, <2 x i64> %y, <2 x
i64> %z)
nounwind readnone {
entry:
  %0 = and <2 x i64> %x, %z
  %not = xor <2 x i64> %z, <i64 -1, i64 -1>
  %1 = and <2 x i64> %not, %y
  %2 = xor <2 x i64> %0, %1
  ret <2 x i64> %2
}

$ llc -mcpu=nehalem -debug-pass=Structure foo.bc -o foo.s
(snip)
    Code Placement Optimizater
    SSE execution domain fixup
    Machine Natural Loop Construction
    X86 AT&T-Style Assembly Printer
    Delete Garbage Collector Information

foo.s: (edited)
_foo:
	movaps	%xmm0, %xmm3
	andps	%xmm2, %xmm3
	andnps	%xmm1, %xmm2
	movaps	%xmm2, %xmm0
	xorps	%xmm3, %xmm0
	ret

_bar:
	movaps	%xmm0, %xmm3
	andps	%xmm2, %xmm3
	andnps	%xmm1, %xmm2
	movaps	%xmm2, %xmm0
	xorps	%xmm3, %xmm0
	ret

Jakob Stoklund Olesen

2010-May-11 15:08 UTC

head link

[LLVMdev] How does SSEDomainFix work?

On May 10, 2010, at 9:07 PM, NAKAMURA Takumi wrote:
> Hello. This is my 1st post.
ようこそ！
> I have tried SSE execution domain fixup pass.
> But I am not able to see any improvements.
Did you actually measure runtime, or did you look at assembly?
> I expect for the example below to use MOVDQA, PAND &c.
> (On nehalem, ANDPS is extremely slower than PAND)
Are you sure? The andps and pand instructions are actually the same speed, but
on Nehalem there is a latency penalty for moving data between the int and float
domains.

The SSE execution domain pass tries to minimize the extra latency by switching
instructions.

In your examples, all the operations are available as either int or float
instructions. The instruction selector chooses the float instructions because
they are smaller. The SSE execution domain pass does not change them because
there are zero domain crossings, zero extra latency. Everything takes place in
the float domain which is just as fast.

If you use operations that are only available in one domain, the SSE execution
domain pass kicks in:

define <4 x i32> @intfoo(<4 x i32> %x, <4 x i32> %y, <4 x
i32> %z)
nounwind readnone {
entry:
 %0 = add <4 x i32> %x, %z
 %not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1>
 %1 = and <4 x i32> %not, %y
 %2 = xor <4 x i32> %0, %1
 ret <4 x i32> %2
}

_intfoo:
	movdqa	%xmm0, %xmm3
	paddd	%xmm2, %xmm3
	pandn	%xmm1, %xmm2
	movdqa	%xmm2, %xmm0
	pxor	%xmm3, %xmm0
	ret

All the instructions moved to the int domain because the add forced them.
> Please tell me if something would be wrong for me.
You should measure if LLVM's code is actually slower that the code you want.
If it is, I would like to hear.

Our weakness is the shufflevector instruction. It is selected into
shufps/pshufd/palign/... only by looking at patterns. The instruction selector
does not consider execution domains. This can be a problem because these
instructions cannot be freely interchanged by the SSE execution domain pass.

> foo.ll:
> define <4 x i32> @foo(<4 x i32> %x, <4 x i32> %y, <4 x
i32> %z)
> nounwind readnone {
> entry:
>  %0 = and <4 x i32> %x, %z
>  %not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1>
>  %1 = and <4 x i32> %not, %y
>  %2 = xor <4 x i32> %0, %1
>  ret <4 x i32> %2
> }
> $ llc -mcpu=nehalem -debug-pass=Structure foo.bc -o foo.s
> (snip)
>    Code Placement Optimizater
>    SSE execution domain fixup
>    Machine Natural Loop Construction
>    X86 AT&T-Style Assembly Printer
>    Delete Garbage Collector Information
> 
> foo.s: (edited)
> _foo:
> 	movaps	%xmm0, %xmm3
> 	andps	%xmm2, %xmm3
> 	andnps	%xmm1, %xmm2
> 	movaps	%xmm2, %xmm0
> 	xorps	%xmm3, %xmm0
> 	ret

NAKAMURA Takumi

2010-May-11 17:05 UTC

head link

[LLVMdev] How does SSEDomainFix work?

Dear Jakob-san,
> ようこそ！
:D

Thank you for reply. At first, I have to apologize you.
I misunderstood aim of SSEdomainfix.
Now I see what the pass does.

But anyway, the point that I would like to mention is "throughput"
rather than (inter-domain) latency.
In fact, FP operations are 3x slower than SI ops on Nehalem by my measurement.
It would be needed to prefer SI ops on Nehalem(and generic sse2), I think.
(Shorter instructions may be taken with -Os)

The attachment includes a simple(but stupid bogus) asm-C source and a
Win32 executable.
$ mingw32-gcc -msse2 -O4 -Wall -funroll-all-loops foo.c
It must be compiled on other x86 hosts.
But it would be needed to constrain processor's affinity to single ; )

Counts below are Cycles by million iteration on Core i7
982270 xorps
982231 movaps
371671 pxor
342628 movdqa

SI ops can be issued by 3-way but FP ops by only single way.
(as we know, they are nearly same on Conroe, Penryn)
Excuse me, loads by movdqa and movaps are not measured. : (


See also;
- Intel optimization manual
  http://www.intel.com/assets/pdf/manual/248966.pdf
- Agner's works
  http://agner.org/optimize/


Thank you,
Takumi


2010/5/12 Jakob Stoklund Olesen <stoklund at
2pi.dk>:>
> On May 10, 2010, at 9:07 PM, NAKAMURA Takumi wrote:
>
>> Hello. This is my 1st post.
>
> ようこそ！
>
>> I have tried SSE execution domain fixup pass.
>> But I am not able to see any improvements.
>
> Did you actually measure runtime, or did you look at assembly?
>
>> I expect for the example below to use MOVDQA, PAND &c.
>> (On nehalem, ANDPS is extremely slower than PAND)
>
> Are you sure? The andps and pand instructions are actually the same speed,
but on Nehalem there is a latency penalty for moving data between the int and
float domains.
>
> The SSE execution domain pass tries to minimize the extra latency by
switching instructions.
>
> In your examples, all the operations are available as either int or float
instructions. The instruction selector chooses the float instructions because
they are smaller. The SSE execution domain pass does not change them because
there are zero domain crossings, zero extra latency. Everything takes place in
the float domain which is just as fast.
>
> If you use operations that are only available in one domain, the SSE
execution domain pass kicks in:
>
> define <4 x i32> @intfoo(<4 x i32> %x, <4 x i32> %y,
<4 x i32> %z)
> nounwind readnone {
> entry:
>  %0 = add <4 x i32> %x, %z
>  %not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1>
>  %1 = and <4 x i32> %not, %y
>  %2 = xor <4 x i32> %0, %1
>  ret <4 x i32> %2
> }
>
> _intfoo:
>        movdqa  %xmm0, %xmm3
>        paddd   %xmm2, %xmm3
>        pandn   %xmm1, %xmm2
>        movdqa  %xmm2, %xmm0
>        pxor    %xmm3, %xmm0
>        ret
>
> All the instructions moved to the int domain because the add forced them.
>
>> Please tell me if something would be wrong for me.
>
> You should measure if LLVM's code is actually slower that the code you
want. If it is, I would like to hear.
>
> Our weakness is the shufflevector instruction. It is selected into
shufps/pshufd/palign/... only by looking at patterns. The instruction selector
does not consider execution domains. This can be a problem because these
instructions cannot be freely interchanged by the SSE execution domain pass.
>
>
>> foo.ll:
>> define <4 x i32> @foo(<4 x i32> %x, <4 x i32> %y,
<4 x i32> %z)
>> nounwind readnone {
>> entry:
>>  %0 = and <4 x i32> %x, %z
>>  %not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1>
>>  %1 = and <4 x i32> %not, %y
>>  %2 = xor <4 x i32> %0, %1
>>  ret <4 x i32> %2
>> }
>> $ llc -mcpu=nehalem -debug-pass=Structure foo.bc -o foo.s
>> (snip)
>>    Code Placement Optimizater
>>    SSE execution domain fixup
>>    Machine Natural Loop Construction
>>    X86 AT&T-Style Assembly Printer
>>    Delete Garbage Collector Information
>>
>> foo.s: (edited)
>> _foo:
>>       movaps  %xmm0, %xmm3
>>       andps   %xmm2, %xmm3
>>       andnps  %xmm1, %xmm2
>>       movaps  %xmm2, %xmm0
>>       xorps   %xmm3, %xmm0
>>       ret
>
>-------------- next part --------------
A non-text attachment was scrubbed...
Name: xmm.zip
Type: application/zip
Size: 2911 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20100512/d9715376/attachment.zip>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - May 2010 - [LLVMdev] How does SSEDomainFix work?

[LLVMdev] How does SSEDomainFix work?

[LLVMdev] How does SSEDomainFix work?

[LLVMdev] How does SSEDomainFix work?

Apparently Analagous Threads