LLVM (via clang) currently translates target intrinsics to generic IR whenever it can. For example, on x86 it translates _mm_loadu_pd to a simple load instruction with an alignment of 1. The backend is then responsible for translating the load back to the corresponding machine instruction. The advantage of this is that it opens up such code to LLVM's optimizers, which can theoretically speed it up. The disadvantage is that it's pretty surprising when intrinsics designed for the sole purpose of giving programmers access to specific machine instructions is translated to something other than those instructions. LLVM's optimizers aren't perfect, and there are many aspects of performance which they don't understand, so they can also pessimize code. If the user has gone through the trouble of using target-specific intrinsics to ask for a specific sequence of machine instructions, is it really appropriate for the compiler to emit different instructions, using its own heuristics? Dan
On Nov 14, 2011, at 3:01 PM, Dan Gohman wrote:> LLVM (via clang) currently translates target intrinsics to generic IR > whenever it can. For example, on x86 it translates _mm_loadu_pd to a > simple load instruction with an alignment of 1. The backend is then > responsible for translating the load back to the corresponding > machine instruction. > > The advantage of this is that it opens up such code to LLVM's > optimizers, which can theoretically speed it up. > > The disadvantage is that it's pretty surprising when intrinsics > designed for the sole purpose of giving programmers access to specific > machine instructions is translated to something other than those > instructions. LLVM's optimizers aren't perfect, and there are many > aspects of performance which they don't understand, so they can also > pessimize code. > > If the user has gone through the trouble of using target-specific > intrinsics to ask for a specific sequence of machine instructions, > is it really appropriate for the compiler to emit different > instructions, using its own heuristics?There are several benefits to doing it this way: 1. Fewer intrinsics in the compiler, fewer patterns in the targets, less redundancy. 2. The compiler should know better than the user, because code is often written and forgotten about. The compiler can add value when building hand tuned and highly optimized SSE2 code for an SSE4 chip, for example. 3. If the compiler is pessimizing (e.g.) unaligned loads, then it is a serious bug that should be fixed, not something that should be worked around by adding intrinsics. Adding intrinsics just makes it much less likely that we'd find out about it and then be able to fix it. 4. In practice, if we had intrinsics for everything, I strongly suspect that a lot of generic patterns wouldn't get written. This would pessimize "portable" code using standard IR constructs. -Chris
Hi Dan, On 15/11/11 00:01, Dan Gohman wrote:> LLVM (via clang) currently translates target intrinsics to generic IR > whenever it can. For example, on x86 it translates _mm_loadu_pd to a > simple load instruction with an alignment of 1. The backend is then > responsible for translating the load back to the corresponding > machine instruction. > > The advantage of this is that it opens up such code to LLVM's > optimizers, which can theoretically speed it up. > > The disadvantage is that it's pretty surprising when intrinsics > designed for the sole purpose of giving programmers access to specific > machine instructions is translated to something other than those > instructions.gcc only supports a limited set of vector expressions. If you want to shuffle a vector, how do you do that? The only way (AFAIK) is to use a target intrinsic. Thus people can end up using target intrinsics because it's the only way they have to express vector operations, not because they absolutely want to have that particular instruction. LLVM's optimizers aren't perfect, and there are many> aspects of performance which they don't understand, so they can also > pessimize code.Such cases should be improved. They would never be noticed if everyone was using target intrinsics rather than generic IR.> If the user has gone through the trouble of using target-specific > intrinsics to ask for a specific sequence of machine instructions, > is it really appropriate for the compiler to emit different > instructions, using its own heuristics?This same question might come up in the future with inline asm. Thanks to the MC project I guess it may become feasible to parse peoples inline asm and do optimizations on it. Personally I'm in favour of that, but indeed there are dangers. Ciao, Duncan.
On Mon, 2011-11-14 at 15:41 -0800, Chris Lattner wrote:> On Nov 14, 2011, at 3:01 PM, Dan Gohman wrote: > > LLVM (via clang) currently translates target intrinsics to generic IR > > whenever it can. For example, on x86 it translates _mm_loadu_pd to a > > simple load instruction with an alignment of 1. The backend is then > > responsible for translating the load back to the corresponding > > machine instruction. > > > > The advantage of this is that it opens up such code to LLVM's > > optimizers, which can theoretically speed it up. > > > > The disadvantage is that it's pretty surprising when intrinsics > > designed for the sole purpose of giving programmers access to specific > > machine instructions is translated to something other than those > > instructions. LLVM's optimizers aren't perfect, and there are many > > aspects of performance which they don't understand, so they can also > > pessimize code. > > > > If the user has gone through the trouble of using target-specific > > intrinsics to ask for a specific sequence of machine instructions, > > is it really appropriate for the compiler to emit different > > instructions, using its own heuristics?In my personal opinion, this should be controlled via a compiler option. The default should be to omit the instructions as specified. This should at least be true at low optimization levels.> > There are several benefits to doing it this way: > > 1. Fewer intrinsics in the compiler, fewer patterns in the targets, less redundancy.I don't view limiting the number of intrinsics in LLVM as a worthwhile goal unto itself. The fact that specifying intrinsics is currently a fairly-verbose procedure (requiring updates in several different files) is something that we should fix via a more-intelligent tablegen setup.> > 2. The compiler should know better than the user, because code is often written and forgotten about. The compiler can add value when building hand tuned and highly optimized SSE2 code for an SSE4 chip, for example.This is a good use case for '-O4' -- This means that I've asked for something specific and the compiler may do something else instead. I think that '-O3' (and below) should make what I've specified as fast as possible. Since specifying '-O3' is a fairly standard default choice, I think it should provide the safer behavior.> > 3. If the compiler is pessimizing (e.g.) unaligned loads, then it is a serious bug that should be fixed, not something that should be worked around by adding intrinsics. Adding intrinsics just makes it much less likely that we'd find out about it and then be able to fix it. > > 4. In practice, if we had intrinsics for everything, I strongly suspect that a lot of generic patterns wouldn't get written. This would pessimize "portable" code using standard IR constructs. >We should work on providing a comprehensive set of generic vector builtins (similar to __builtin_shuffle) to cover other cases that can be represented in the IR directly (either as-is, or suitably extended). This has the benefit of working over many different architectures. And we could make sure that patterns will be written to support these generic builtins. -Hal> -Chris > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory