thr3ads.net - llvm dev - [llvm-dev] Load combine pass [Sep 2016]

If this information is useful, please help other people find it:
Share via:

David Chisnall via llvm-dev

2016-Sep-29 10:10 UTC

[llvm-dev] Load combine pass

On 29 Sep 2016, at 01:25, Sanjoy Das <sanjoy at playingwithpointers.com>
wrote:> 
> Hi David,
> 
> David Chisnall via llvm-dev wrote:
> > On 28 Sep 2016, at 16:50, Philip Reames via llvm-dev<llvm-dev at
lists.llvm.org>  wrote:
> >> At this point, my general view is that widening transformations of
any kind should be done very late.  Ideally, this is something the backend would
do, but doing it as a CGP like fixup pass over the IR is also reasonable.
> >
> > I’m really glad to see that this is gone in GVN - it will reduce our
> > diffs a lot when we do the next import.  The GVN load widening is not
> > sound in the presence of hardware-enforced spacial memory safety, so
> > we ended up with the compiler inserting things that caused hardware
> > bounds checks to fail and had to disable it in a few places.
> >
> > If you’re reintroducing it, please can we have a backend option to
> > specify whether it’s valid to widen loads beyond the extents (for
> > example, for us it is not valid to widen an i16 load followed by an i8
> > load to an i32 load).  Ideally, we’d also want the back end to supply
> 
> Don't you have to mark the functions you generate as
> "Attribute::SanitizeAddress"?  We should definitely make the
> speculative form of this transform (i.e. "load i32, load i16"
-->
> "load i64") predicated on Attribute::SanitizeAddress.
Nope, we’re not using the address sanitiser.  Our architecture supports
byte-granularity bounds checking in hardware.

Note that even without this, for pure MIPS code without our extensions, load
widening generates significantly worse code than when it doesn’t happen.  I’m
actually finding it difficult to come up with a microarchitecture where a 16-bit
load followed by an 8-bit load from the same cache line would give worse
performance than a 32-bit load, a mask and a shift.  In an in-order design, it’s
more instructions to do the same work, and therefore slower.  In an out-of-order
design, the two loads within the cache line will likely be dispatched
simultaneously and you’ll have less pressure on the register rename engine.

It seems like something that will only give wins if it exposes optimisation
opportunities to other transforms and, as such, should probably be exposed as an
analysis so that the later transforms can do the combine if it actually makes
sense to do so.

David

Sanjoy Das via llvm-dev

2016-Sep-29 17:56 UTC

head link

[llvm-dev] Load combine pass

Hi David,

David Chisnall wrote:
 > Nope, we’re not using the address sanitiser.  Our architecture 
supports byte-granularity bounds checking in hardware.

I mentioned address sanitizer since (then) I thought your architecture
would have to prohibit the same kinds of transforms that address
sanitizer has to prohibit.

However, on second thought, I think I have a counter-example to my
statement above -- I suppose your architecture only checks bounds and
not that the location being loaded from is initialized?

 > Note that even without this, for pure MIPS code without our
 > extensions, load widening generates significantly worse code than when
 > it doesn’t happen.  I’m actually finding it difficult to come up with
 > a microarchitecture where a 16-bit load followed by an 8-bit load from
 > the same cache line would give worse performance than a 32-bit load, a
 > mask and a shift.  In an in-order design, it’s more instructions to do
 > the same work, and therefore slower.  In an out-of-order design, the
 > two loads within the cache line will likely be dispatched
 > simultaneously and you’ll have less pressure on the register rename
 > engine.

That makes sense, but what do you think of Artur's suggestion of
catching only the obvious patterns?  That is, catching only cases like

   i16* ptr = ...
   i32 val = ptr[0] | (ptr[1] << 16);

==> // subject to endianess

   i16* ptr = ...
   i32 val = *(i32*) ptr;

To me that seems like a win (or at least, not a loss) on any
architecture.  However, I will admit that I've only ever worked on x86
so I have a lot of blind spots here.

Thanks!
-- Sanjoy

David Chisnall via llvm-dev

2016-Sep-30 08:18 UTC

head link

[llvm-dev] Load combine pass

On 29 Sep 2016, at 18:56, Sanjoy Das <sanjoy at playingwithpointers.com>
wrote:> 
> That makes sense, but what do you think of Artur's suggestion of
> catching only the obvious patterns?  That is, catching only cases like
> 
>  i16* ptr = ...
>  i32 val = ptr[0] | (ptr[1] << 16);
> 
> ==> // subject to endianess
> 
>  i16* ptr = ...
>  i32 val = *(i32*) ptr;
> 
> To me that seems like a win (or at least, not a loss) on any
> architecture.  However, I will admit that I've only ever worked on x86
> so I have a lot of blind spots here.
This is one of the cases that actually is a loss on many MIPS platforms, unless
ptr[0] is 4-byte aligned.  If it is 2-byte aligned (which is the only place I’ve
seen this crop up in hot loops of benchmark code, which is the only place where
I’ve looked seriously to see why we’re generating bad code), we end up widening
two 2-byte, 2-byte-aligned, loads to a 4-byte, 2-byte-aligned load.  The latter
sequence is more expensive.

Even on x86, this holds dynamically: you’ll hit a slower microcode path on a lot
of x86 implementations if you’re dispatching an unaligned 4-byte load than if
you dispatch a pair of aligned 2-byte loads.  If the two two-byte loads are
adjacent and are 4-byte aligned (and therefore within the same cache line), then
you’ll often see micro-op fusion combine them back into a 4-byte load, depending
on the surrounding insturctions.

Even then, it’s only a win on x86 if val is the only use of the result.  We were
widening loads that had independent uses, so we ended up loading into one
register and then splitting out portions of the result, which put a lot more
pressure on the register rename unit[1].  This has a decidedly non-linear
performance outcome: often it’s fine, but if it’s in a loop and it crosses a
threshold then you see about a 50% performance decrease (on recent x86
microarchitectures).

David

[1] A few years ago, I would have said cache usage was the most important thing
to optimise for as a first pass.  With modern superscalar chips, register rename
pressure is now even more important.  This was really driven home to me a few
months ago when I made a loop run twice as fast on the two most recent Intel
microarchitectures by inserting an xor %rax, %rax at the top - explicitly
killing the false dependency between loop iterations made more of a difference
than all of the other optimisations that we’d done, combined.

llvm dev - Sep 2016 - Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass