thr3ads.net - llvm dev - [LLVMdev] Simple NEON optimization [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2010-Nov-12 15:23 UTC

[LLVMdev] Simple NEON optimization

Hi folks, me again,

So, I want to implement a simple optimization in a NEON case I've seen
these days, most as a matter of exercise, but it also simplifies (just
a bit) the code generated.

The case is simple:

        uint32x2_t x, res;
        res = vceq_u32(x, vcreate_u32(0));

This will generate the following code:

        ; zero d16
        vmov.i32        d16, #0x0
        ; load a into d17
        movw    r0, :lower16:a
        movt    r0, :upper16:a
        vld1.32 {d17}, [r0]
        ; compare two registers
        vceq.i32        d17, d17, d16

But, because the vector is zero, and there is a NEON instruction to
compare against an immediate zero (VCEQZ), we could combine the two
instructions:

        ; load a into d17
        movw    r0, :lower16:a
        movt    r0, :upper16:a
        vld1.32 {d17}, [r0]
        ; compare two registers
        vceq.i32        d17, d17, #0

thus, saving the VMOV.

I know, it's not much, but it's a good start for me to get the hand of
writing such passes.

So, should I put this as a special case in NEON lowering or make it as
part of an optimization pass? Which classes should I look first?

-- 
cheers,
--renato

Bob Wilson

2010-Nov-12 17:52 UTC

head link

[LLVMdev] Simple NEON optimization

On Nov 12, 2010, at 7:23 AM, Renato Golin wrote:
> Hi folks, me again,
> 
> So, I want to implement a simple optimization in a NEON case I've seen
> these days, most as a matter of exercise, but it also simplifies (just
> a bit) the code generated.
> 
> The case is simple:
> 
>        uint32x2_t x, res;
>        res = vceq_u32(x, vcreate_u32(0));
> 
> This will generate the following code:
> 
>        ; zero d16
>        vmov.i32        d16, #0x0
>        ; load a into d17
>        movw    r0, :lower16:a
>        movt    r0, :upper16:a
>        vld1.32 {d17}, [r0]
>        ; compare two registers
>        vceq.i32        d17, d17, d16
> 
> But, because the vector is zero, and there is a NEON instruction to
> compare against an immediate zero (VCEQZ), we could combine the two
> instructions:
> 
>        ; load a into d17
>        movw    r0, :lower16:a
>        movt    r0, :upper16:a
>        vld1.32 {d17}, [r0]
>        ; compare two registers
>        vceq.i32        d17, d17, #0
> 
> thus, saving the VMOV.
> 
> I know, it's not much, but it's a good start for me to get the hand
of
> writing such passes.
This would be a nice optimization, and it's not a bad place to get started
down in the depths of llvm codegen....
> 
> So, should I put this as a special case in NEON lowering or make it as
> part of an optimization pass? Which classes should I look first?
I recommend implementing this as a target-specific DAG combine optimization.  We
already have target-specific DAG nodes for the relevant NEON comparison
operations (ARMISD::VCEQ, etc. -- see ARMISelLowering.h) as well as the vmov
(ARMISD::VMOVIMM).  You just need to teach the DAG combiner how to fold them
together.  Here's what you need to do (all of this code is in
ARMISelLowering.cpp):

0. (You don't actually need to do anything, but I'm just mentioning it
FYI.) For selection DAG nodes that are not target-specific, you need to inform
the DAG combiner that you want to do some target-specific combining.  Look for
calls to setTargetDAGCombine() for examples of this.  For this case, the
relevant nodes are all target-specific, so the DAG combiner will call the
target-specific combining hook anyway.

1. Add the ARMISD::VCEQ etc. nodes to the switch in
ARMTargetLowering::PerformDAGCombine.

2. Add a function to be called for those comparisons that checks if one operand
is an ARMISD::VMOVIMM node with an immediate of zero.  Note for future reference
that the actual operand of VMOVIMM is an encoded value that represents one of
the possible vector immediates for the "one register plus a modified
immediate" format.  In this case it doesn't matter because the
canonical encoding of a zero vector is just zero.  When you find that case, use
DAG.getNode() to return a new node for the compare against zero operation.  The
PerformShiftCombine function is a fairly simple example of what needs to be done
(although it's doing a completely different combination).

3. Write a testcase and make sure it works.

Thanks for offering to work on this!

Renato Golin

2010-Nov-12 18:42 UTC

head link

[LLVMdev] Simple NEON optimization

On 12 November 2010 17:52, Bob Wilson <bob.wilson at apple.com>
wrote:> I recommend implementing this as a target-specific DAG combine
optimization.  We already have target-specific DAG nodes for the relevant NEON
comparison operations (ARMISD::VCEQ, etc. -- see ARMISelLowering.h) as well as
the vmov (ARMISD::VMOVIMM).  You just need to teach the DAG combiner how to fold
them together.  Here's what you need to do (all of this code is in
ARMISelLowering.cpp):
Hi Bob,

I thought so... I'll get cracked and see if I can generate some simple
tests.

Thank you very much for the detailed explanation!

cheers,
--renato

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Nov 2010 - [LLVMdev] Simple NEON optimization

[LLVMdev] Simple NEON optimization

[LLVMdev] Simple NEON optimization

[LLVMdev] Simple NEON optimization

Reasonably Related Threads