thr3ads.net - llvm dev - [LLVMdev] Lowering to MMX [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Nicolas Capens

2011-Oct-20 15:42 UTC

[LLVMdev] Lowering to MMX

Hi all,

I'm working on a graphics project which uses LLVM for dynamic code 
generation, and I noticed a major performance regression when upgrading 
from LLVM 2.8 to 3.0-rc1 (LLVM 2.9 didn't support Win64 so I skipped it 
entirely).

I found out that the performance regression is due to removing support 
for lowering 64-bit vector operations to MMX, and using SSE2 instead. My 
code uses a mix of MMX intrinsics and v4i16 operations, so it ping-pongs 
back and forth between MMX and SSE2 instructions in the generated code.

To get more optimal code, I see three options, and I was wondering if 
someone could share some advice on which approach you think will work best:
1) I could use v8i16 or v4i32 instead of v4i16, but then the SSE 
register pressure would be significantly increased. I already use v4f32 
operations intensively so having the MMX registers available for 64-bit 
integer vector operations helps performance quite considerably on the 
register deprived x86 architecture. There's little to no opportunity for 
using v8i16 to perform two v4i16 operations simultaneously so that won't 
make up for the added register pressure. So I'm not keen to implement 
this option, unless anyone sees some advantages that I missed?
2) Since I use MMX intrinsics, I take care of inserting the appropriate 
EMMS instructions myself as well. So it's absolutely fine to have LLVM 
lower 64-bit operations into MMX instructions (the way it used to be in 
LLVM 2.8). Would it be straightforward to re-enable this? I noticed that 
revision 115243 removes the MMX lowering rules, but I don't know if the 
rest of LLVM 3.0 would still support them if I simply reverted them. 
Please note that I'm not an LLVM expert and I'd prefer not having to 
maintain local changes. Would there be any objection to having an 
'EnableMMX' flag (false by default)?
3) I believe all MMX instructions are available as intrinsics now? That 
would allow me to replace all straight LLVM operations with intrinsics. 
I'm just wondering what the downsides of that would be? I assume I won't
get any benefits from instruction combining, but things like dead code 
elimination still work?

Thank you for your time.

Cheers,
Nicolas

Bruno Cardoso Lopes

2011-Oct-25 01:30 UTC

head link

[LLVMdev] Lowering to MMX

Hi Nicolas,
> I found out that the performance regression is due to removing support
> for lowering 64-bit vector operations to MMX, and using SSE2 instead. My
> code uses a mix of MMX intrinsics and v4i16 operations, so it ping-pongs
> back and forth between MMX and SSE2 instructions in the generated code.
>
> To get more optimal code, I see three options, and I was wondering if
> someone could share some advice on which approach you think will work best:
> 1) I could use v8i16 or v4i32 instead of v4i16, but then the SSE
> register pressure would be significantly increased. I already use v4f32
> operations intensively so having the MMX registers available for 64-bit
> integer vector operations helps performance quite considerably on the
> register deprived x86 architecture. There's little to no opportunity
for
> using v8i16 to perform two v4i16 operations simultaneously so that
won't
> make up for the added register pressure. So I'm not keen to implement
> this option, unless anyone sees some advantages that I missed?
> 2) Since I use MMX intrinsics, I take care of inserting the appropriate
> EMMS instructions myself as well. So it's absolutely fine to have LLVM
> lower 64-bit operations into MMX instructions (the way it used to be in
> LLVM 2.8). Would it be straightforward to re-enable this? I noticed that
> revision 115243 removes the MMX lowering rules, but I don't know if the
> rest of LLVM 3.0 would still support them if I simply reverted them.
> Please note that I'm not an LLVM expert and I'd prefer not having
to
> maintain local changes. Would there be any objection to having an
> 'EnableMMX' flag (false by default)?
> 3) I believe all MMX instructions are available as intrinsics now? That
> would allow me to replace all straight LLVM operations with intrinsics.
> I'm just wondering what the downsides of that would be? I assume I
won't
> get any benefits from instruction combining, but things like dead code
> elimination still work?
AFAIK, the only way to get MMX instructions now is using the MMX
intrinsics with the new defined MMX specific vector types! So, if
you're really getting register pressure on 1), I would go for 3).


-- 
Bruno Cardoso Lopes
http://www.brunocardoso.cc

Bill Wendling

2011-Oct-25 01:50 UTC

head link

[LLVMdev] Lowering to MMX

On Oct 20, 2011, at 8:42 AM, Nicolas Capens wrote:
> Hi all,
> 
> I'm working on a graphics project which uses LLVM for dynamic code 
> generation, and I noticed a major performance regression when upgrading 
> from LLVM 2.8 to 3.0-rc1 (LLVM 2.9 didn't support Win64 so I skipped it
> entirely).
> 
> I found out that the performance regression is due to removing support 
> for lowering 64-bit vector operations to MMX, and using SSE2 instead. My 
> code uses a mix of MMX intrinsics and v4i16 operations, so it ping-pongs 
> back and forth between MMX and SSE2 instructions in the generated code.
> 
> To get more optimal code, I see three options, and I was wondering if 
> someone could share some advice on which approach you think will work best:
> 1) I could use v8i16 or v4i32 instead of v4i16, but then the SSE 
> register pressure would be significantly increased. I already use v4f32 
> operations intensively so having the MMX registers available for 64-bit 
> integer vector operations helps performance quite considerably on the 
> register deprived x86 architecture. There's little to no opportunity
for
> using v8i16 to perform two v4i16 operations simultaneously so that
won't
> make up for the added register pressure. So I'm not keen to implement 
> this option, unless anyone sees some advantages that I missed?
It's my understanding that SSE is by far superior to MMX for a number of
reasons, not the least of which is the need to use the expensive EMMS
instruction. Instead of guessing about the performance impact, I would encourage
you to test this out.
> 2) Since I use MMX intrinsics, I take care of inserting the appropriate 
> EMMS instructions myself as well. So it's absolutely fine to have LLVM 
> lower 64-bit operations into MMX instructions (the way it used to be in 
> LLVM 2.8). Would it be straightforward to re-enable this? I noticed that 
> revision 115243 removes the MMX lowering rules, but I don't know if the
> rest of LLVM 3.0 would still support them if I simply reverted them. 
> Please note that I'm not an LLVM expert and I'd prefer not having
to
> maintain local changes. Would there be any objection to having an 
> 'EnableMMX' flag (false by default)?
Having the EnableMMX flag is not an option. And the changes are significant, so
you wouldn't be able to re-enable the MMX stuff without a major overhaul of
the system.
> 3) I believe all MMX instructions are available as intrinsics now? That 
> would allow me to replace all straight LLVM operations with intrinsics. 
> I'm just wondering what the downsides of that would be? I assume I
won't
> get any benefits from instruction combining, but things like dead code 
> elimination still work?
Intrinsics are the only way to go if you want MMX code. We do as much as we can,
but to be honest optimizing for MMX is not a high priority for us.

-bw

Nicolas Capens

2011-Oct-25 16:24 UTC

head link

[LLVMdev] Lowering to MMX

Thanks Bruno. I started replacing 64-bit vector operations with explicit 
MMX intrinsics, and the results look fairly promising so far.

On 24/10/2011 9:30 PM, Bruno Cardoso Lopes wrote:> Hi Nicolas,
>
>> I found out that the performance regression is due to removing support
>> for lowering 64-bit vector operations to MMX, and using SSE2 instead.
My
>> code uses a mix of MMX intrinsics and v4i16 operations, so it
ping-pongs
>> back and forth between MMX and SSE2 instructions in the generated code.
>>
>> To get more optimal code, I see three options, and I was wondering if
>> someone could share some advice on which approach you think will work
best:
>> 1) I could use v8i16 or v4i32 instead of v4i16, but then the SSE
>> register pressure would be significantly increased. I already use v4f32
>> operations intensively so having the MMX registers available for 64-bit
>> integer vector operations helps performance quite considerably on the
>> register deprived x86 architecture. There's little to no
opportunity for
>> using v8i16 to perform two v4i16 operations simultaneously so that
won't
>> make up for the added register pressure. So I'm not keen to
implement
>> this option, unless anyone sees some advantages that I missed?
>> 2) Since I use MMX intrinsics, I take care of inserting the appropriate
>> EMMS instructions myself as well. So it's absolutely fine to have
LLVM
>> lower 64-bit operations into MMX instructions (the way it used to be in
>> LLVM 2.8). Would it be straightforward to re-enable this? I noticed
that
>> revision 115243 removes the MMX lowering rules, but I don't know if
the
>> rest of LLVM 3.0 would still support them if I simply reverted them.
>> Please note that I'm not an LLVM expert and I'd prefer not
having to
>> maintain local changes. Would there be any objection to having an
>> 'EnableMMX' flag (false by default)?
>> 3) I believe all MMX instructions are available as intrinsics now? That
>> would allow me to replace all straight LLVM operations with intrinsics.
>> I'm just wondering what the downsides of that would be? I assume I
won't
>> get any benefits from instruction combining, but things like dead code
>> elimination still work?
> AFAIK, the only way to get MMX instructions now is using the MMX
> intrinsics with the new defined MMX specific vector types! So, if
> you're really getting register pressure on 1), I would go for 3).
>
>

Nicolas Capens

2011-Oct-26 20:18 UTC

head link

[LLVMdev] Lowering to MMX

Hi Bill,

Comments inline:

On 24/10/2011 9:50 PM, Bill Wendling wrote:> On Oct 20, 2011, at 8:42 AM, Nicolas Capens wrote:
>
>> Hi all,
>>
>> I'm working on a graphics project which uses LLVM for dynamic code
>> generation, and I noticed a major performance regression when upgrading
>> from LLVM 2.8 to 3.0-rc1 (LLVM 2.9 didn't support Win64 so I
skipped it
>> entirely).
>>
>> I found out that the performance regression is due to removing support
>> for lowering 64-bit vector operations to MMX, and using SSE2 instead.
My
>> code uses a mix of MMX intrinsics and v4i16 operations, so it
ping-pongs
>> back and forth between MMX and SSE2 instructions in the generated code.
>>
>> To get more optimal code, I see three options, and I was wondering if
>> someone could share some advice on which approach you think will work
best:
>> 1) I could use v8i16 or v4i32 instead of v4i16, but then the SSE
>> register pressure would be significantly increased. I already use v4f32
>> operations intensively so having the MMX registers available for 64-bit
>> integer vector operations helps performance quite considerably on the
>> register deprived x86 architecture. There's little to no
opportunity for
>> using v8i16 to perform two v4i16 operations simultaneously so that
won't
>> make up for the added register pressure. So I'm not keen to
implement
>> this option, unless anyone sees some advantages that I missed?
> It's my understanding that SSE is by far superior to MMX for a number
of reasons, not the least of which is the need to use the expensive EMMS
instruction. Instead of guessing about the performance impact, I would encourage
you to test this out.I'm already explicitly using EMMS, where necessary. Basically when 
avoiding x87 (and avoiding library calls which may use x87), it's not 
needed. So there's no performance drawback for using MMX in my case.

I've verified that combining MMX and SSE2 is significantly faster than 
using SSE2 alone. It basically gives me access to 8 more registers for 
64-bit integer vector operations, leaving the SSE registers available 
for floating-point and wider integer operations. Upgrading from LLVM 2.8 
to 3.0 degraded performance by 30%, and a quick look at the assembly 
made it clear that register pressure is a major issue.>> 2) Since I use MMX intrinsics, I take care of inserting the appropriate
>> EMMS instructions myself as well. So it's absolutely fine to have
LLVM
>> lower 64-bit operations into MMX instructions (the way it used to be in
>> LLVM 2.8). Would it be straightforward to re-enable this? I noticed
that
>> revision 115243 removes the MMX lowering rules, but I don't know if
the
>> rest of LLVM 3.0 would still support them if I simply reverted them.
>> Please note that I'm not an LLVM expert and I'd prefer not
having to
>> maintain local changes. Would there be any objection to having an
>> 'EnableMMX' flag (false by default)?
> Having the EnableMMX flag is not an option. And the changes are
significant, so you wouldn't be able to re-enable the MMX stuff without a
major overhaul of the system.Ok, thanks for confirming that this would be too
complex.>> 3) I believe all MMX instructions are available as intrinsics now? That
>> would allow me to replace all straight LLVM operations with intrinsics.
>> I'm just wondering what the downsides of that would be? I assume I
won't
>> get any benefits from instruction combining, but things like dead code
>> elimination still work?
> Intrinsics are the only way to go if you want MMX code. We do as much as we
can, but to be honest optimizing for MMX is not a high priority for us.I fully understand that having LLVM insert EMMS instructions and trying 
to prevent it from degrading performance just wasn't worthwhile. 
Fortunately explicit use of MMX intrinsics is fine for my use.

I'm having one remaining issue though; I can't seem to generate the movd
instruction(s) (moving 32-bits of data in and out of the lower half of 
an MMX registers). Take for example the following LLVM IR:

define internal void @unpack(i8*, i8*) {
   %3 = bitcast i8* %1 to i32*
   %4 = load i32* %3, align 1
   %5 = insertelement <2 x i32> undef, i32 %4, i32 0
   %6 = bitcast <2 x i32> %5 to x86_mmx
   %7 = call x86_mmx @llvm.x86.mmx.punpcklbw(x86_mmx %6, x86_mmx %6)
   %8 = bitcast i8* %0 to x86_mmx*
   store x86_mmx %7, x86_mmx* %8, align 1
   ret void
}
declare x86_mmx @llvm.x86.mmx.punpcklbw(x86_mmx, x86_mmx) nounwind readnone

Which gives me the following assembly code:

  push        ebp
  mov         ebp,esp
  and         esp,0FFFFFFF0h
  sub         esp,20h
  mov         eax,dword ptr [ebp+0Ch]
  movd        xmm0,dword ptr [eax]
  movapd      xmmword ptr [esp],xmm0
  movq        mm0,mmword ptr [esp]
  punpcklbw   mm0,mm0
  mov         eax,dword ptr [ebp+8]
  movq        mmword ptr [eax],mm0
  emms
  mov         esp,ebp
  pop         ebp
  ret

The inner portion could look like this instead:

  movd        mm0,dword ptr [eax]
  punpcklbw   mm0,mm0

Should I be using other IR operations to get this result, or are the 
matching patterns missing? Or would it perhaps be best to make movd 
available as an intrinsic as well (note that it has four varieties for MMX)?

Thanks again,
Nicolas

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - Oct 2011 - [LLVMdev] Lowering to MMX

[LLVMdev] Lowering to MMX

[LLVMdev] Lowering to MMX

[LLVMdev] Lowering to MMX

[LLVMdev] Lowering to MMX

[LLVMdev] Lowering to MMX

Maybe Matching Threads