David Chisnall
2014-Jan-11 14:13 UTC
[LLVMdev] Memcpy expansion: InstCombine vs SelectionDAG
Hi all, We currently have code in InstCombine that tries to expand memcpy / memmove intrinsics that are copying something that fits in a single register to a load and store in the IR. We then have other code in SelectionDAG that expands small memcpy intrinsics to sequences of loads and stores. The InstCombine one is useful, as it exposes more optimisation opportunities, but unfortunately it is not semantically equivalent to the SelectionDAG expansion. We're seeing this in our back end where a 32-bit aligned 64-bit memcpy is generates efficient code when expanded near the back end (2 loads, 2 stores), but quite suboptimal code when the load is a result of InstCombine expansion. In the second case, it becomes two loads, a shift and an or (to construct the 64-bit value) and then a This simple program shows the problem: struct A { int a, b; } global; int mv(const struct A *b) { global = *b; return 1; } Compiled with clang at -O1 or above, the function becomes: define i32 @mv(%struct.A* nocapture readonly %b) #0 { entry: %0 = bitcast %struct.A* %b to i64* %1 = load i64* %0, align 4 store i64 %1, i64* bitcast (%struct.A* @global to i64*), align 4 ret i32 1 } The back end now pattern matches on the load, but doesn't know that the load is only ever used for a corresponding store. I can think of several ways of fixing this, but the ideal one is probably for InstCombine to know whether unaligned loads and stores are expensive for this target and emit a short sequence of loads and stores at the desired alignment if so. This code in InstCombine could actually benefit a lot from some target knowledge, as this transform may be applicable on some systems to vector or other types (for example, I bet there are a lot of C++ structures that are copied around a lot that would fit happily in an AVX register). As a small extension, if the operation can be more efficient if the alignment is better, and either the source or destination is on the stack, then it may make sense to increase the alignment of the alloca. A somewhat hacky alternative that would address this case would be to have a codegen prepare pass that would try to turn load-store sequences back into memcopy / memmove intrinsics (although the aliasing information present in the decision to use memmove may be lost from the load / store sequence), which we can then expand more efficiently in SelectionDAG. Do we have a sensible way of exposing this kind of target-specific information to optimisers other than the vector cost model? David
Quentin Colombet
2014-Jan-14 20:42 UTC
[LLVMdev] Memcpy expansion: InstCombine vs SelectionDAG
Hi David, On Jan 11, 2014, at 6:13 AM, David Chisnall <David.Chisnall at cl.cam.ac.uk> wrote:> Hi all, > > We currently have code in InstCombine that tries to expand memcpy / memmove intrinsics that are copying something that fits in a single register to a load and store in the IR. We then have other code in SelectionDAG that expands small memcpy intrinsics to sequences of loads and stores. The InstCombine one is useful, as it exposes more optimisation opportunities, but unfortunately it is not semantically equivalent to the SelectionDAG expansion.Could you be more specific? In particular, what is the problem?> > We're seeing this in our back end where a 32-bit aligned 64-bit memcpy is generates efficient code when expanded near the back end (2 loads, 2 stores), but quite suboptimal code when the load is a result of InstCombine expansion. In the second case, it becomes two loads, a shift and an or (to construct the 64-bit value) and then aWe are missing the interesting bits here :). What is next? A 64-bit store?> > This simple program shows the problem: > > struct A { > int a, b; > } global; > > int mv(const struct A *b) > { > global = *b; > return 1; > } > > Compiled with clang at -O1 or above, the function becomes: > > define i32 @mv(%struct.A* nocapture readonly %b) #0 { > entry: > %0 = bitcast %struct.A* %b to i64* > %1 = load i64* %0, align 4 > store i64 %1, i64* bitcast (%struct.A* @global to i64*), align 4 > ret i32 1 > } > > The back end now pattern matches on the load, but doesn't know that the load is only ever used for a corresponding store.I am not sure I follow what you are saying. In this example, the load (%1) is used only once and in the same basic block. You should have every information you need in SelectionDAG. That said, as you may know, SelectionDAG is a bottom-up approach. Thus, store is matched before the load gets a change to be matched. It is hard to give direction on how to fix that since we do not know anything about the target. In particular, is i64 a legal type or not? Based on your problem description, it is not clear since we are missing what the 2 loads are feeding but it sounds like they are feeding a 64-bit store. In short, what is the best lowering you would expect for your target? 1. 2 loads - 2 store? 2. A memcpy intrinsic? For #1, I am wondering if we could extend the slicing that takes place during DAG combine (look for SliceUpLoad in DAGCombiner) to slice the store part too if that is the problem.> I can think of several ways of fixing this, but the ideal one is probably for InstCombine to know whether unaligned loads and stores are expensive for this target and emit a short sequence of loads and stores at the desired alignment if so. This code in InstCombine could actually benefit a lot from some target knowledge, as this transform may be applicable on some systems to vector or other types (for example, I bet there are a lot of C++ structures that are copied around a lot that would fit happily in an AVX register).As far as I can tell, InstCombine is target independent and its purpose is to produce a canonical form. In that case, memcpy/memmove of chunk that are less that 8 bytes are canonicalized to 64-bit load/store.> As a small extension, if the operation can be more efficient if the alignment is better, and either the source or destination is on the stack, then it may make sense to increase the alignment of the alloca. > > A somewhat hacky alternative that would address this case would be to have a codegen prepare pass that would try to turn load-store sequences back into memcopy / memmove intrinsics (although the aliasing information present in the decision to use memmove may be lost from the load / store sequence), which we can then expand more efficiently in SelectionDAG. > > Do we have a sensible way of exposing this kind of target-specific information to optimisers other than the vector cost model?We usually use the TargetLowering class, you can look into the CodeGenPrepare pass :). Hope that helps. -Quentin> > David > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
David Chisnall
2014-Jan-14 20:54 UTC
[LLVMdev] Memcpy expansion: InstCombine vs SelectionDAG
On 14 Jan 2014, at 12:42, Quentin Colombet <qcolombet at apple.com> wrote:> Hi David, > > On Jan 11, 2014, at 6:13 AM, David Chisnall <David.Chisnall at cl.cam.ac.uk> wrote: > >> Hi all, >> >> We currently have code in InstCombine that tries to expand memcpy / memmove intrinsics that are copying something that fits in a single register to a load and store in the IR. We then have other code in SelectionDAG that expands small memcpy intrinsics to sequences of loads and stores. The InstCombine one is useful, as it exposes more optimisation opportunities, but unfortunately it is not semantically equivalent to the SelectionDAG expansion. > Could you be more specific? > In particular, what is the problem?That, for the reasons outlined in my mail, we generate suboptimal code for small structure copies on architectures where unaligned loads and stores are expensive.>> We're seeing this in our back end where a 32-bit aligned 64-bit memcpy is generates efficient code when expanded near the back end (2 loads, 2 stores), but quite suboptimal code when the load is a result of InstCombine expansion. In the second case, it becomes two loads, a shift and an or (to construct the 64-bit value) and then a > We are missing the interesting bits here :). > What is next? > A 64-bit store?Then two 32-bit stores (because unaligned 64-bit stores are not permitted).> This simple program shows the problem: >> >> struct A { >> int a, b; >> } global; >> >> int mv(const struct A *b) >> { >> global = *b; >> return 1; >> } >> >> Compiled with clang at -O1 or above, the function becomes: >> >> define i32 @mv(%struct.A* nocapture readonly %b) #0 { >> entry: >> %0 = bitcast %struct.A* %b to i64* >> %1 = load i64* %0, align 4 >> store i64 %1, i64* bitcast (%struct.A* @global to i64*), align 4 >> ret i32 1 >> } >> >> The back end now pattern matches on the load, but doesn't know that the load is only ever used for a corresponding store. > I am not sure I follow what you are saying. > In this example, the load (%1) is used only once and in the same basic block. You should have every information you need in SelectionDAG.Yes, I'd expect SelectionDAG should be able to do this.> That said, as you may know, SelectionDAG is a bottom-up approach. Thus, store is matched before the load gets a change to be matched. > > It is hard to give direction on how to fix that since we do not know anything about the target. In particular, is i64 a legal type or not?Yes, i64 is a legal type, but unaligned loads and stores are not supported by the architecture and so SelectionDAG will expand them to a more complex sequence.> Based on your problem description, it is not clear since we are missing what the 2 loads are feeding but it sounds like they are feeding a 64-bit store. > > In short, what is the best lowering you would expect for your target? > 1. 2 loads - 2 store? > 2. A memcpy intrinsic?A memcpy intrinsic will expand to 2 loads, 2 stores, if the alignment is less than 8, so they are equivalent.> For #1, I am wondering if we could extend the slicing that takes place during DAG combine (look for SliceUpLoad in DAGCombiner) to slice the store part too if that is the problem. > >> I can think of several ways of fixing this, but the ideal one is probably for InstCombine to know whether unaligned loads and stores are expensive for this target and emit a short sequence of loads and stores at the desired alignment if so. This code in InstCombine could actually benefit a lot from some target knowledge, as this transform may be applicable on some systems to vector or other types (for example, I bet there are a lot of C++ structures that are copied around a lot that would fit happily in an AVX register). > As far as I can tell, InstCombine is target independent and its purpose is to produce a canonical form. In that case, memcpy/memmove of chunk that are less that 8 bytes are canonicalized to 64-bit load/store.The canonical form, however, is inefficient for targets where unaligned 64-bit stores are expensive. The same applies to unaligned memory operations of other widths. SelectionDAG already has sensible code for handling unaligned memcpy intrinsics, delegating to the target the choice of the most efficient set of loads and stores. If InstCombine is replacing the memcpy with a load + store that is wider than something that the target can handle for the desired alignment, then this is a problem, because we then lack information about how to sensibly lower it.>> As a small extension, if the operation can be more efficient if the alignment is better, and either the source or destination is on the stack, then it may make sense to increase the alignment of the alloca. >> >> A somewhat hacky alternative that would address this case would be to have a codegen prepare pass that would try to turn load-store sequences back into memcopy / memmove intrinsics (although the aliasing information present in the decision to use memmove may be lost from the load / store sequence), which we can then expand more efficiently in SelectionDAG. >> >> Do we have a sensible way of exposing this kind of target-specific information to optimisers other than the vector cost model? > We usually use the TargetLowering class, you can look into the CodeGenPrepare pass :).Unless I miss something, TargetLowering is not available to InstCombine. David
Possibly Parallel Threads
- [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
- [RFC] Generalize llvm.memcpy / llvm.memmove intrinsics.
- [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
- [RFC] Generalize llvm.memcpy / llvm.memmove intrinsics.
- [RFC] Generalize llvm.memcpy / llvm.memmove intrinsics.