Renato Golin via llvm-dev
2016-Feb-08  15:41 UTC
[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets
Folks, I'm now looking at https://llvm.org/bugs/show_bug.cgi?id=16274, which seems to have some support in the vectorizer, but not as we need for this particular case. I may have missed something obvious, please let me know if there is a better way. As you already know, ARM has two FP instruction sets: VFP and NEON. VFP applies to single FP registers while NEON is a full SIMD. The problem is that NEON is not IEEE compliant on FP operations, while VFP is. Even if the target has NEON and the user has asked for it to be used, without -ffast-math and related arguments, we simply can't produce NEON instructions for FP operations. Different operations may have different non-compliance (inf, denormals, etc) and I haven't yet investigated the full support, but it's safe to start from blocking *all* FP operations on NEON when *any* FP restrictions are in place. We can expand for better support later, when the infrastructure is in place. As far as I could see, ffast-math is included in the vectorizer, but as an all-or-nothing, which is not what we want to do. So, I thought about two ways we could go about doing this: 1. The pragmatic way Add a cost "TCC_Impossible = AllOnes" to TCC and on ARM's cost model, check if fast-math is checked on FP ALU operations and return that if false. So, VFP costs would be less than NEON costs divided by their widths. This would make any vectorization beyond VFP instructions impossible is fast-math is not chosen, while still using VFP instructions in the loop, making it slightly faster. I'm sceptical to introducing the TCC_Impossible cost, as it seems a dirty trick. I'm open to other better solutions. 2. The thorough way Add a flag on TableGen on vector instructions meaning IEEE compliance for the different levels of support. Add a "fall-back" VFP instruction to each of them (either in TableGen or TTI). In the vectorizer, on FP ALU cost, add a check on fast-math && IEEE conformance. If failed, check the fall-back instruction's width and add the cost as that * Width/FallBackWidth. In the back-end, when emitting vector instructions, add the same check and emit (unroll) the NEON instructions into similar VFP ones, by checking it's fall-back instruction. This approach has the benefit of validating IEEE compliance at the instruction level, thus working for any other "vectorizer" out there, including out-of-tree ones (though this benefit is very limited). But it also can change code that it shouldn't, like inline asm or intrinsics. I have no solution to this particular problem. Any thoughts? cheers, --renato
James Molloy via llvm-dev
2016-Feb-08  16:33 UTC
[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets
Hi Renato, I think it’s important to distinguish between the loop vectorizer and already-existing vector IR + the SLP vectorizer here. The loop vectorizer does indeed require -ffast-math, but the IEEE-nonconformant transforms it does are far greater than using an ISA which may FTZ. It needs -ffast-math because any FP reductions necessarily have their execution order shuffled, due to executing some of them in parallel and reducing to scalar at the end. Therefore the LV doesn’t need to be changed - it will only work when “fast” is given and will only emit “fast” vector instructions. The SLP vectoriser however should theoretically take non-fast scalars and produce non-fast vectors. Similarly people will hand-write vector IR, or generate it from other frontends. Because of this, I think it’s important that we shouldn’t change the semantics of the IR currently. Making vector IR targeting ARM produce scalar instructions unless a modifier is given will undoubtedly cause problems down the line with frontends being out of sync or not being updated. Even worse, the symptom of this would just be “LLVM produces poor code for ARM” / “LLVM’s vector codegen is terrible for ARM” - performance errata and not conformance. That’s why I think changing to a full-strict-by-default approach would be bad for the project. It would also violate the principle of least surprise - I wrote vector instructions and picked a vector ISA… but they’re being scalarized? My experience is that the number of people who care about pull IEEE compatibility on ARMv7 hardware is limited, and the set of people who care about exact ULP constraints even more limited. I think we absolutely should make a solution that solves PR16274, but I think it would have to be opt-in, not opt-out. James> On 8 Feb 2016, at 15:41, Renato Golin <renato.golin at linaro.org> wrote: > > Folks, > > I'm now looking at https://llvm.org/bugs/show_bug.cgi?id=16274, which > seems to have some support in the vectorizer, but not as we need for > this particular case. I may have missed something obvious, please let > me know if there is a better way. > > As you already know, ARM has two FP instruction sets: VFP and NEON. > VFP applies to single FP registers while NEON is a full SIMD. The > problem is that NEON is not IEEE compliant on FP operations, while VFP > is. > > Even if the target has NEON and the user has asked for it to be used, > without -ffast-math and related arguments, we simply can't produce > NEON instructions for FP operations. Different operations may have > different non-compliance (inf, denormals, etc) and I haven't yet > investigated the full support, but it's safe to start from blocking > *all* FP operations on NEON when *any* FP restrictions are in place. > We can expand for better support later, when the infrastructure is in > place. > > As far as I could see, ffast-math is included in the vectorizer, but > as an all-or-nothing, which is not what we want to do. So, I thought > about two ways we could go about doing this: > > > 1. The pragmatic way > > Add a cost "TCC_Impossible = AllOnes" to TCC and on ARM's cost model, > check if fast-math is checked on FP ALU operations and return that if > false. So, VFP costs would be less than NEON costs divided by their > widths. > > This would make any vectorization beyond VFP instructions impossible > is fast-math is not chosen, while still using VFP instructions in the > loop, making it slightly faster. > > I'm sceptical to introducing the TCC_Impossible cost, as it seems a > dirty trick. I'm open to other better solutions. > > > 2. The thorough way > > Add a flag on TableGen on vector instructions meaning IEEE compliance > for the different levels of support. Add a "fall-back" VFP instruction > to each of them (either in TableGen or TTI). > > In the vectorizer, on FP ALU cost, add a check on fast-math && IEEE > conformance. If failed, check the fall-back instruction's width and > add the cost as that * Width/FallBackWidth. > > In the back-end, when emitting vector instructions, add the same check > and emit (unroll) the NEON instructions into similar VFP ones, by > checking it's fall-back instruction. > > This approach has the benefit of validating IEEE compliance at the > instruction level, thus working for any other "vectorizer" out there, > including out-of-tree ones (though this benefit is very limited). > > But it also can change code that it shouldn't, like inline asm or > intrinsics. I have no solution to this particular problem. > > Any thoughts? > > cheers, > --renato >
Renato Golin via llvm-dev
2016-Feb-08  19:15 UTC
[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets
On 8 February 2016 at 16:33, James Molloy <James.Molloy at arm.com> wrote:> The loop vectorizer does indeed require -ffast-math, but the IEEE-nonconformant transforms it does are far greater than using an ISA which may FTZ. It needs -ffast-math because any FP reductions necessarily have their execution order shuffled, due to executing some of them in parallel and reducing to scalar at the end. Therefore the LV doesn’t need to be changed - it will only work when “fast” is given and will only emit “fast” vector instructions.Good point. This seems to be a much more rigorous definition in the new 2008 standard. Right now, the loop vectorizer produces vector code without -ffast-math. Are you saying we should disable it altogether for all architectures that claim to follow the new standard? Inner loops can be "vectorized" by SLP using only VFP instructions. The implementation seem to have moved to Inst->hasUnsafeAlgebra(), so we may need to return false in the legalization phase if the flag is omitted and any instruction has unsafe algebra.> The SLP vectoriser however should theoretically take non-fast scalars and produce non-fast vectors. Similarly people will hand-write vector IR, or generate it from other frontends.We can't guarantee the semantics of the unsafe-math flag in any IR that was not generated by a front-end which knows about it. So, it follows that we'll stop vectorizing their basic blocks, and there could be some outcry. We need some general consensus if that's what people want. I don't think we do.> Because of this, I think it’s important that we shouldn’t change the semantics of the IR currently. Making vector IR targeting ARM produce scalar instructions unless a modifier is given will undoubtedly cause problems down the line with frontends being out of sync or not being updated. Even worse, the symptom of this would just be “LLVM produces poor code for ARM” / “LLVM’s vector codegen is terrible for ARM” - performance errata and not conformance. That’s why I think changing to a full-strict-by-default approach would be bad for the project. > It would also violate the principle of least surprise - I wrote vector instructions and picked a vector ISA… but they’re being scalarized?Right, this is opposing to marking an instruction with unsafe by default (ie my second option). If that's so, I agree with you that it's not trivial and may create more problems than it solves. Hand written IR, inline ASM and intrinsics should remain for what they are. So 16274 is probably a "won't fix"?> My experience is that the number of people who care about pull IEEE compatibility on ARMv7 hardware is limited, and the set of people who care about exact ULP constraints even more limited. I think we absolutely should make a solution that solves PR16274, but I think it would have to be opt-in, not opt-out.And I'm guessing this is related to SLP and others. If so, I agree. So, For 16275, the fix is to disable loop vect. for no-fast-math + hasUnsafeAlgebra. For 16274, disabling NEON emission in SLP would be one way, but we must avoid any fiddling with inline asm and intrinsics, so I don't think we should be doing that in any generic way. Certainly not related to the example, from IR to instruction. Makes sense? --renato