thr3ads.net - llvm dev - [LLVMdev] Missing InstCombine optimization. [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Bader, Aleksey A

2013-Jun-04 13:27 UTC

[LLVMdev] Missing InstCombine optimization.

Hi Jakob,

I've a problem related to the commit #155362.

Consider the following snippet:
void bar(float* f) {
...
}
void foo(float* f, int idx) {
int hi = idx>>3;
int lo = idx&7;
bar(&f[hi*8+lo]); // hi*8 + lo == idx
bar(&f[hi*10+lo]);
}

Before 155362 revision InstCombine was able to optimize hi*8+lo to idx by
applying following patterns:

1.       hi*8 -> hi << 3

2.       ((idx >> 3) << 3) -> idx & -8

3.       hi*8+lo -> hi*8 | lo

4.       (idx & -8) | (idx & 7) -> idx & (-8 | 7) -> idx

After 155362 pattern #2 is deferred to DAGCombine stage, so InstCombine is
unable to apply pattern #4:
        4*. ((idx >> 3) << 3) | (idx & 7) -> idx //
SimplifyOr can't handle it.

So now LLVM IR contains a couple of redundant operations:
  %mul312 = shl nsw i32 %shr, 3 ; hi*8
  %add313 = or i32 %mul312, %and ; hi*8 + lo == idx

These few additional operations over index prevent our analysis pass from
recognizing memory access patterns and I believe could harm not only us.
I think 4* optimization can be safely done at LLVM IR level.
Can you suggest the best way to fix this issue? I suppose adding new pattern
just for particular 4* case is not the best way.

Thanks,
Aleksey

--------------------------------------------------------------------
Closed Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park, 
17 Krylatskaya Str., Bldg 4, Moscow 121614, 
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130604/e9b8f677/attachment.html>

Rafael Espíndola

2013-Jun-04 19:57 UTC

head link

[LLVMdev] Missing InstCombine optimization.

> 1.       hi*8 -> hi << 3
>
> 2.       ((idx >> 3) << 3) -> idx & -8
>
> 3.       hi*8+lo -> hi*8 | lo
>
> 4.       (idx & -8) | (idx & 7) -> idx & (-8 | 7) -> idx
>
>
>
> After 155362 pattern #2 is deferred to DAGCombine stage, so InstCombine is
> unable to apply pattern #4:
>
>         4*. ((idx >> 3) << 3) | (idx & 7) -> idx //
SimplifyOr can’t handle
> it.
>
>
>
> So now LLVM IR contains a couple of redundant operations:
>
>   %mul312 = shl nsw i32 %shr, 3 ; hi*8
>
>   %add313 = or i32 %mul312, %and ; hi*8 + lo == idx
>
>
>
> These few additional operations over index prevent our analysis pass from
> recognizing memory access patterns and I believe could harm not only us.
>
> I think 4* optimization can be safely done at LLVM IR level.
>
> Can you suggest the best way to fix this issue? I suppose adding new
pattern
> just for particular 4* case is not the best way.

We used handle arbitrary  ((idx >> A) << B), right? Maybe it is
enough
to handle   ((idx >> A) << A) in instcombine?

Cheers,
Rafael

Bader, Aleksey A

2013-Jun-04 20:39 UTC

head link

[LLVMdev] Missing InstCombine optimization.

Hi Rafael,
> We used handle arbitrary  ((idx >> A) << B), right? We used to handle three patterns, which were deferred:
(from Jakob's log message)
  These transformations are deferred:
  
    (X >>? C) << C   --> X & (-1 << C)  (When X
>> C has multiple uses)   <--- my case
    (X >>? C1) << C2 --> X << (C2-C1) & (-1 <<
C2)   (When C2 > C1)
    (X >>? C1) << C2 --> X >>? (C1-C2) & (-1 <<
C2)  (When C1 > C2)

So ((idx >> A) << A) (when idx >> A has multiple uses) was
intentionally deferred.
> Maybe it is enough to handle   ((idx >> A) << A) in
instcombine?It would be enough for my particular case, but apparently is not be enough for
some other cases.
I don't quite understand the motivation behind the original commit, that is
why I need your assistance on fixing this issue.
It seems like applying "((X >> C) << C) -> X & -C"
rule in instcombine is not efficient in some other cases.
Does anyone know such cases?
Otherwise the fix is very simple - just add back a few deleted lines of code.

Thanks,
Aleksey

-----Original Message-----
From: Rafael Espíndola [mailto:rafael.espindola at gmail.com] 
Sent: Tuesday, June 04, 2013 11:58 PM
To: Bader, Aleksey A
Cc: Jakob Stoklund Olesen; LLVM Developers Mailing List
Subject: Re: [LLVMdev] Missing InstCombine optimization.
> 1.       hi*8 -> hi << 3
>
> 2.       ((idx >> 3) << 3) -> idx & -8
>
> 3.       hi*8+lo -> hi*8 | lo
>
> 4.       (idx & -8) | (idx & 7) -> idx & (-8 | 7) -> idx
>
>
>
> After 155362 pattern #2 is deferred to DAGCombine stage, so 
> InstCombine is unable to apply pattern #4:
>
>         4*. ((idx >> 3) << 3) | (idx & 7) -> idx //
SimplifyOr can’t
> handle it.
>
>
>
> So now LLVM IR contains a couple of redundant operations:
>
>   %mul312 = shl nsw i32 %shr, 3 ; hi*8
>
>   %add313 = or i32 %mul312, %and ; hi*8 + lo == idx
>
>
>
> These few additional operations over index prevent our analysis pass 
> from recognizing memory access patterns and I believe could harm not only
us.
>
> I think 4* optimization can be safely done at LLVM IR level.
>
> Can you suggest the best way to fix this issue? I suppose adding new 
> pattern just for particular 4* case is not the best way.

We used handle arbitrary  ((idx >> A) << B), right? Maybe it is
enough
to handle   ((idx >> A) << A) in instcombine?

Cheers,
Rafael

--------------------------------------------------------------------
Closed Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park, 
17 Krylatskaya Str., Bldg 4, Moscow 121614, 
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Jakob Stoklund Olesen

2013-Jun-04 22:38 UTC

head link

[LLVMdev] Missing InstCombine optimization.

On Jun 4, 2013, at 6:27 AM, "Bader, Aleksey A" <aleksey.a.bader at
intel.com> wrote:
> Hi Jakob,
>  
> I’ve a problem related to the commit #155362.
>  
> Consider the following snippet:
> void bar(float* f) {
> …
> }
> void foo(float* f, int idx) {
> int hi = idx>>3;
> int lo = idx&7;
> bar(&f[hi*8+lo]); // hi*8 + lo == idx
> bar(&f[hi*10+lo]);
> }
>  
> Before 155362 revision InstCombine was able to optimize hi*8+lo to idx by
applying following patterns:
> 1.       hi*8 -> hi << 3
> 2.       ((idx >> 3) << 3) -> idx & -8
> 3.       hi*8+lo -> hi*8 | lo
> 4.       (idx & -8) | (idx & 7) -> idx & (-8 | 7) -> idx
>  
> After 155362 pattern #2 is deferred to DAGCombine stage, so InstCombine is
unable to apply pattern #4:
>         4*. ((idx >> 3) << 3) | (idx & 7) -> idx //
SimplifyOr can’t handle it.
Actually, your own code illustrates the problem with this transformation.
Suppose you were using SCEV to analyze the behavior of the to f[] memory
references:

  f[hi*8 + lo]
  f[hi*10 + lo]

If you rewrite 'hi*8 + lo' to ‘idx’, the affine relationship between the
two memory references is no longer visible, and SCEV will probably tell you that
the offsets are unrelated.

The fundamental problem is that there is no such thing as a canonical form of an
expression DAG. Some expression graphs, like this one, can take different forms
that each enable different analyses. Which one is correct?

I think that the right approach in this case is to preserve the relationships
that were already exposed in the original code. That is why pattern 4* is
disabled when the inner expression has multiple uses. It preserves the
relationships that are expressed through the ‘hi’ variable.

Thanks,
/jakob

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Jun 2013 - [LLVMdev] Missing InstCombine optimization.

[LLVMdev] Missing InstCombine optimization.

[LLVMdev] Missing InstCombine optimization.

[LLVMdev] Missing InstCombine optimization.

[LLVMdev] Missing InstCombine optimization.

Possibly Parallel Threads