I'm looking at llvm-generated ARM code that has some unnecessary UXTB (zero extend) instructions, and it seems to me that doing type legalization as an entirely local transformation is not the best approach. I'm thinking in particular about legalizing integer types that need to be promoted to the target register size, i.e., i8 and i16 for ARM promoting to i32. Currently we sign-extend or zero-extend values of these types at every place where they are defined and used. The definitions are no problem. Loads from memory can specify whether the value should be zero or sign extended, and arithmetic operations are going to produce 32-bit values regardless. Uses (in different basic blocks from the defs) are a different matter. We add explicit zero extend and sign extend operations for every use, despite the fact that the actual register values will have already been extended to 32 bits when they were defined. It seems like we ought to have a pass to globally promote such types to the target register size. Has anyone looked at this issue before? Is there already a solution in place that we just need to adopt for ARM? Thoughts on what to do otherwise?
On Aug 18, 2010, at 9:22 AM, Bob Wilson wrote:> I'm looking at llvm-generated ARM code that has some unnecessary UXTB (zero extend) instructions, and it seems to me that doing type legalization as an entirely local transformation is not the best approach.That's true, but doing isel as a purely local approach isn't the best either :-). We'd really like to get to whole-function selection dags at some point.> I'm thinking in particular about legalizing integer types that need to be promoted to the target register size, i.e., i8 and i16 for ARM promoting to i32. > > Currently we sign-extend or zero-extend values of these types at every place where they are defined and used. The definitions are no problem. Loads from memory can specify whether the value should be zero or sign extended, and arithmetic operations are going to produce 32-bit values regardless. Uses (in different basic blocks from the defs) are a different matter. We add explicit zero extend and sign extend operations for every use, despite the fact that the actual register values will have already been extended to 32 bits when they were defined. > > It seems like we ought to have a pass to globally promote such types to the target register size. > > Has anyone looked at this issue before? Is there already a solution in place that we just need to adopt for ARM? Thoughts on what to do otherwise?There are a couple of different tradeoffs you have to consider here. First, I'm going to assume that the defined value isn't already zero extended (so that the zexts on the uses aren't purely redundant). We have code that is supposed to eliminate the purely redundant ones, at least when the definition is emitted before the uses. Some things to consider: When the input to the zext is spilled, the reload can be folded into the zext on almost all targets, making the zext free. When the zext *isn't* folded into a load, what you're really looking for is a code placement pass which tries to put the zexts in non-redundant (and non-partially redundant) places. This sort of code placement pass could be done at the LLVM IR level (as a prelegalization like you mention), it could be done as a pre-regalloc machine pass, or as a post-regalloc machine pass. The right answer depends on what and how much you care about this. If you're seeing fully redundant zexts, then I'd look into why machinecse isn't picking this up. If you're seeing partially redundant cases, then machine sink is missing something. If you're seeing reextends of already extended values, then it sounds like the heuristic to track that the live-out vreg is extended isn't working. I tend to think that it isn't worth the compile time to try to microoptimize out every compare, but I could be convinced otherwise if there are important use cases we're failing to handle. I also do think that whole-function selection dags will solve a lot of grossness (e.g. much of codegen prepare) with a very clean model. -Chris
On Aug 18, 2010, at 9:56 AM, Chris Lattner wrote:> On Aug 18, 2010, at 9:22 AM, Bob Wilson wrote: >> I'm looking at llvm-generated ARM code that has some unnecessary UXTB (zero extend) instructions, and it seems to me that doing type legalization as an entirely local transformation is not the best approach. > > That's true, but doing isel as a purely local approach isn't the best either :-). We'd really like to get to whole-function selection dags at some point. > >> I'm thinking in particular about legalizing integer types that need to be promoted to the target register size, i.e., i8 and i16 for ARM promoting to i32. >> >> Currently we sign-extend or zero-extend values of these types at every place where they are defined and used. The definitions are no problem. Loads from memory can specify whether the value should be zero or sign extended, and arithmetic operations are going to produce 32-bit values regardless. Uses (in different basic blocks from the defs) are a different matter. We add explicit zero extend and sign extend operations for every use, despite the fact that the actual register values will have already been extended to 32 bits when they were defined. >> >> It seems like we ought to have a pass to globally promote such types to the target register size. >> >> Has anyone looked at this issue before? Is there already a solution in place that we just need to adopt for ARM? Thoughts on what to do otherwise? > > There are a couple of different tradeoffs you have to consider here. First, I'm going to assume that the defined value isn't already zero extended (so that the zexts on the uses aren't purely redundant). We have code that is supposed to eliminate the purely redundant ones, at least when the definition is emitted before the uses. > > Some things to consider: When the input to the zext is spilled, the reload can be folded into the zext on almost all targets, making the zext free. When the zext *isn't* folded into a load, what you're really looking for is a code placement pass which tries to put the zexts in non-redundant (and non-partially redundant) places. > > This sort of code placement pass could be done at the LLVM IR level (as a prelegalization like you mention), it could be done as a pre-regalloc machine pass, or as a post-regalloc machine pass. > > The right answer depends on what and how much you care about this. If you're seeing fully redundant zexts, then I'd look into why machinecse isn't picking this up. If you're seeing partially redundant cases, then machine sink is missing something. If you're seeing reextends of already extended values, then it sounds like the heuristic to track that the live-out vreg is extended isn't working. > > I tend to think that it isn't worth the compile time to try to microoptimize out every compare, but I could be convinced otherwise if there are important use cases we're failing to handle. I also do think that whole-function selection dags will solve a lot of grossness (e.g. much of codegen prepare) with a very clean model.I'll take a look at Machine CSE and Machine Sink. Where is the heuristic for tracking live-out vregs that you mention? I'm definitely seeing a reextend of an already extended value. Worse, the value is spilled and the zext is not folded into the reload. For ARM and possibly other RISC-like targets, you simply can't define an i8 or i16 value -- those aren't legal types. Since those values will always be extended at the point where they are defined, the code placement problem is straightforward: you always want to fold the extends into the def, as long as the value is always extended the same way (not mixed sign and zero extends). Whole function selection DAGs would make that easy.
On Aug 18, 2010, at 9:56 AM, Chris Lattner wrote:> Some things to consider: When the input to the zext is spilled, the reload can be folded into the zext on almost all targets, making the zext free. When the zext *isn't* folded into a load, what you're really looking for is a code placement pass which tries to put the zexts in non-redundant (and non-partially redundant) places.That makes sense to me, but note that this is not currently implemented. All our RISC-like targets only support folding of COPY to load/store through the target-independent mechanisms. It should be fairly simple to add a foldMemoryOperandImpl override to ARM that folds load+zext into a zextload. /jakob