On 29 Sep 2016, at 01:25, Sanjoy Das <sanjoy at playingwithpointers.com> wrote:> > Hi David, > > David Chisnall via llvm-dev wrote: > > On 28 Sep 2016, at 16:50, Philip Reames via llvm-dev<llvm-dev at lists.llvm.org> wrote: > >> At this point, my general view is that widening transformations of any kind should be done very late. Ideally, this is something the backend would do, but doing it as a CGP like fixup pass over the IR is also reasonable. > > > > I’m really glad to see that this is gone in GVN - it will reduce our > > diffs a lot when we do the next import. The GVN load widening is not > > sound in the presence of hardware-enforced spacial memory safety, so > > we ended up with the compiler inserting things that caused hardware > > bounds checks to fail and had to disable it in a few places. > > > > If you’re reintroducing it, please can we have a backend option to > > specify whether it’s valid to widen loads beyond the extents (for > > example, for us it is not valid to widen an i16 load followed by an i8 > > load to an i32 load). Ideally, we’d also want the back end to supply > > Don't you have to mark the functions you generate as > "Attribute::SanitizeAddress"? We should definitely make the > speculative form of this transform (i.e. "load i32, load i16" --> > "load i64") predicated on Attribute::SanitizeAddress.Nope, we’re not using the address sanitiser. Our architecture supports byte-granularity bounds checking in hardware. Note that even without this, for pure MIPS code without our extensions, load widening generates significantly worse code than when it doesn’t happen. I’m actually finding it difficult to come up with a microarchitecture where a 16-bit load followed by an 8-bit load from the same cache line would give worse performance than a 32-bit load, a mask and a shift. In an in-order design, it’s more instructions to do the same work, and therefore slower. In an out-of-order design, the two loads within the cache line will likely be dispatched simultaneously and you’ll have less pressure on the register rename engine. It seems like something that will only give wins if it exposes optimisation opportunities to other transforms and, as such, should probably be exposed as an analysis so that the later transforms can do the combine if it actually makes sense to do so. David
Hi David, David Chisnall wrote: > Nope, we’re not using the address sanitiser. Our architecture supports byte-granularity bounds checking in hardware. I mentioned address sanitizer since (then) I thought your architecture would have to prohibit the same kinds of transforms that address sanitizer has to prohibit. However, on second thought, I think I have a counter-example to my statement above -- I suppose your architecture only checks bounds and not that the location being loaded from is initialized? > Note that even without this, for pure MIPS code without our > extensions, load widening generates significantly worse code than when > it doesn’t happen. I’m actually finding it difficult to come up with > a microarchitecture where a 16-bit load followed by an 8-bit load from > the same cache line would give worse performance than a 32-bit load, a > mask and a shift. In an in-order design, it’s more instructions to do > the same work, and therefore slower. In an out-of-order design, the > two loads within the cache line will likely be dispatched > simultaneously and you’ll have less pressure on the register rename > engine. That makes sense, but what do you think of Artur's suggestion of catching only the obvious patterns? That is, catching only cases like i16* ptr = ... i32 val = ptr[0] | (ptr[1] << 16); ==> // subject to endianess i16* ptr = ... i32 val = *(i32*) ptr; To me that seems like a win (or at least, not a loss) on any architecture. However, I will admit that I've only ever worked on x86 so I have a lot of blind spots here. Thanks! -- Sanjoy
On 29 Sep 2016, at 18:56, Sanjoy Das <sanjoy at playingwithpointers.com> wrote:> > That makes sense, but what do you think of Artur's suggestion of > catching only the obvious patterns? That is, catching only cases like > > i16* ptr = ... > i32 val = ptr[0] | (ptr[1] << 16); > > ==> // subject to endianess > > i16* ptr = ... > i32 val = *(i32*) ptr; > > To me that seems like a win (or at least, not a loss) on any > architecture. However, I will admit that I've only ever worked on x86 > so I have a lot of blind spots here.This is one of the cases that actually is a loss on many MIPS platforms, unless ptr[0] is 4-byte aligned. If it is 2-byte aligned (which is the only place I’ve seen this crop up in hot loops of benchmark code, which is the only place where I’ve looked seriously to see why we’re generating bad code), we end up widening two 2-byte, 2-byte-aligned, loads to a 4-byte, 2-byte-aligned load. The latter sequence is more expensive. Even on x86, this holds dynamically: you’ll hit a slower microcode path on a lot of x86 implementations if you’re dispatching an unaligned 4-byte load than if you dispatch a pair of aligned 2-byte loads. If the two two-byte loads are adjacent and are 4-byte aligned (and therefore within the same cache line), then you’ll often see micro-op fusion combine them back into a 4-byte load, depending on the surrounding insturctions. Even then, it’s only a win on x86 if val is the only use of the result. We were widening loads that had independent uses, so we ended up loading into one register and then splitting out portions of the result, which put a lot more pressure on the register rename unit[1]. This has a decidedly non-linear performance outcome: often it’s fine, but if it’s in a loop and it crosses a threshold then you see about a 50% performance decrease (on recent x86 microarchitectures). David [1] A few years ago, I would have said cache usage was the most important thing to optimise for as a first pass. With modern superscalar chips, register rename pressure is now even more important. This was really driven home to me a few months ago when I made a loop run twice as fast on the two most recent Intel microarchitectures by inserting an xor %rax, %rax at the top - explicitly killing the false dependency between loop iterations made more of a difference than all of the other optimisations that we’d done, combined.