So this change did indeed have an effect! :) I’m seeing regressions in a number of benchmarks mainly due to a host of extra bitcasts that get introduced. Here’s the problem I’m seeing in a nutshell: 1) There is a Phi with input type double 2) Polly demotes the phi into a load/store of type double 3) InstCombine canonicalizes the load/store to use i64 instead of double 4) SROA removes the load/store & inserts a phi back in, using i64 as the type. Inserts bitcast to get to double. 5) The bitcast sticks around and eventually get translated into FMOVs (for AArch64 at least). The function findCommonType() in SROA.cpp is used to obtain the type that should be used for the new alloca that SROA wants to create. It’s decision process is essentially – if all loads/stores of alloca are the same, use that type; else use the corresponding integer type. This causes bitcasts to be inserted in a number of places, most all of which stick around. I’ve copied a reduced version of an instance of the problem below. I’m looking for comments on what others think is the right solution here. Make SROA more intelligent about picking the type? The code is below with all unnecessary code removed for easy consumption. Daniel Before Polly – Prepare code for polly we have code that looks like: while.cond473: ; preds = %while.cond473.outer78, %while.body475 %p_j_x452.0 = phi double [ %105, %while.body475 ], [ %p_j_x452.0.ph82, %while.cond473.outer78 ] while.body475: ; preds = %while.cond473 %sub480 = fsub fast double %64, %p_j_x452.0 %105 = load double* %x485, align 8, !tbaa !25 After Polly – Prepare code for polly we have: while.cond473: ; preds = %while.cond473.outer78, %while.body475 %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem while.body475: ; preds = %while.cond473 %sub480 = fsub fast double %64, %p_j_x452.0.reload %110 = load double* %x485, align 8, !tbaa !25 store double %110, double* %p_j_x452.0.reg2mem After Combine redundant instructions : while.cond473: ; preds = %while.cond473.outer78, %while.body475 %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem, align 8 while.body475: ; preds = %while.cond473 %sub480 = fsub fast double %74, %p_j_x452.0.reload %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0, i32 0 %194 = bitcast double* %x485 to i64* %195 = load i64* %194, align 8, !tbaa !25 %200 = bitcast double* %p_j_x452.0.reg2mem to i64* store i64 %195, i64* %200, align 8 After SROA : while.cond473: ; preds = %while.cond473.outer78, %while.body475 %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 = phi i64 [ %p_j_x452.0.ph73.reg2mem.sroa.0.0.load368, %while.cond473.outer78 ], [ %178, %while.body475 ] %173 = bitcast i64 %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 to double while.body475: ; preds = %while.cond473 %sub480 = fsub fast double %78, %173 %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0, i32 0 %177 = bitcast double* %x485 to i64* %178 = load i64* %177, align 8, !tbaa !25 From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Chandler Carruth Sent: Wednesday, January 21, 2015 8:32 PM To: Pete Cooper Cc: LLVM Developers Mailing List Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM On Wed, Jan 21, 2015 at 3:06 PM, Pete Cooper <peter_cooper at apple.com <mailto:peter_cooper at apple.com> > wrote: Sounds good to me. Integers it is then. FYI, thanks, I'm just going to commit this then. It seems we're all in essential agreement. We can revert it and take a more cautious approach if something terrible happens. =] -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150421/efc76068/attachment.html>
Hi Daniel Thanks for the excellent breakdown of whats going on here. Earlier in the thread on this I made this comment: "The first thing that springs to mind is that I don’t trust the backend to get this right. I don’t think it will understand when an i32 load/store would have been preferable to a float one or vice versa. I have no evidence of this, but given how strongly typed tablegen is, I don’t think it can make a good choice here. So I think we probably need to teach the backend how to undo whatever canonical form we choose if it has a reason to” Without seeing the machine instructions, its hard to be 100% certain, but the case you’ve found may be simple enough that the backend can actually fix this. However, such a fixup would be quite target specific (such a target would need different register classes for integers and doubles in this case), and we’d need such a pass for all targets which isn’t ideal. So i wouldn’t rule out a backend solution, but i have a preference for your suggestion to improve SROA. In this particular case, it makes sense for SROA to do effectively the same analysis InstCombine did here and work out when a load is just raw data vs when its data is used as a specific type. The relevant piece of InstCombine is this: // Try to canonicalize loads which are only ever stored to operate over // integers instead of any other type. We only do this when the loaded type // is sized and has a size exactly the same as its store size and the store // size is a legal integer type. if (!Ty->isIntegerTy() && Ty->isSized() && DL.isLegalInteger(DL.getTypeStoreSizeInBits(Ty)) && DL.getTypeStoreSizeInBits(Ty) == DL.getTypeSizeInBits(Ty)) { if (std::all_of(LI.user_begin(), LI.user_end(), [&LI](User *U) { auto *SI = dyn_cast<StoreInst>(U); return SI && SI->getPointerOperand() != &LI; })) { ... After ignoring load/stores which satisfy something like the above code, you can always fallback to the current code of choosing an integer type, so in the common case there won’t be any behavior difference. Cheers Pete> On Apr 21, 2015, at 9:18 AM, Daniel Stewart <stewartd at codeaurora.org> wrote: > > So this change did indeed have an effect! J > > I’m seeing regressions in a number of benchmarks mainly due to a host of extra bitcasts that get introduced. Here’s the problem I’m seeing in a nutshell: > > 1) There is a Phi with input type double > 2) Polly demotes the phi into a load/store of type double > 3) InstCombine canonicalizes the load/store to use i64 instead of double > 4) SROA removes the load/store & inserts a phi back in, using i64 as the type. Inserts bitcast to get to double. > 5) The bitcast sticks around and eventually get translated into FMOVs (for AArch64 at least). > > The function findCommonType() in SROA.cpp is used to obtain the type that should be used for the new alloca that SROA wants to create. It’s decision process is essentially – if all loads/stores of alloca are the same, use that type; else use the corresponding integer type. This causes bitcasts to be inserted in a number of places, most all of which stick around. > > I’ve copied a reduced version of an instance of the problem below. I’m looking for comments on what others think is the right solution here. Make SROA more intelligent about picking the type? > > The code is below with all unnecessary code removed for easy consumption. > > Daniel > Before Polly – Prepare code for polly we have code that looks like: > > while.cond473: ; preds = %while.cond473.outer78, %while.body475 > %p_j_x452.0 = phi double [ %105, %while.body475 ], [ %p_j_x452.0.ph82, %while.cond473.outer78 ] > > while.body475: ; preds = %while.cond473 > %sub480 = fsub fast double %64, %p_j_x452.0 > %105 = load double* %x485, align 8, !tbaa !25 > > After Polly – Prepare code for polly we have: > > while.cond473: ; preds = %while.cond473.outer78, %while.body475 > %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem > > while.body475: ; preds = %while.cond473 > %sub480 = fsub fast double %64, %p_j_x452.0.reload > %110 = load double* %x485, align 8, !tbaa !25 > store double %110, double* %p_j_x452.0.reg2mem > > After Combine redundant instructions : > > while.cond473: ; preds = %while.cond473.outer78, %while.body475 > %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem, align 8 > > while.body475: ; preds = %while.cond473 > %sub480 = fsub fast double %74, %p_j_x452.0.reload > %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0, i32 0 > %194 = bitcast double* %x485 to i64* > %195 = load i64* %194, align 8, !tbaa !25 > %200 = bitcast double* %p_j_x452.0.reg2mem to i64* > store i64 %195, i64* %200, align 8 > > After SROA : > > while.cond473: ; preds = %while.cond473.outer78, %while.body475 > %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 = phi i64 [ %p_j_x452.0.ph73.reg2mem.sroa.0.0.load368, %while.cond473.outer78 ], [ %178, %while.body475 ] > %173 = bitcast i64 %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 to double > > while.body475: ; preds = %while.cond473 > %sub480 = fsub fast double %78, %173 > %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0, i32 0 > %177 = bitcast double* %x485 to i64* > %178 = load i64* %177, align 8, !tbaa !25 > > > From: llvmdev-bounces at cs.uiuc.edu <mailto:llvmdev-bounces at cs.uiuc.edu> [mailto:llvmdev-bounces at cs.uiuc.edu <mailto:llvmdev-bounces at cs.uiuc.edu>] On Behalf Of Chandler Carruth > Sent: Wednesday, January 21, 2015 8:32 PM > To: Pete Cooper > Cc: LLVM Developers Mailing List > Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM > > > On Wed, Jan 21, 2015 at 3:06 PM, Pete Cooper <peter_cooper at apple.com <mailto:peter_cooper at apple.com>> wrote: >> Sounds good to me. Integers it is then. > > > FYI, thanks, I'm just going to commit this then. It seems we're all in essential agreement. We can revert it and take a more cautious approach if something terrible happens. =]-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150421/327165e4/attachment.html>
Thanks for the reply Pete. Unfortunately, I don’t think it is going to be as simple as ignoring those loads which only store. In findCommonType(), only one alloca is passed in at a time. So, while you could find those cases where that alloca was loaded from and stored elsewhere, you can’t find those places that store to that alloca from somewhere else (at least not easily that I can see). So in my particular example, I could catch the case of only load -> store. However, there are other stores that use the alloca address, but it is not readily apparent if they come directly & only from a load. Still trying to figure out the best way forward. Daniel From: Pete Cooper [mailto:peter_cooper at apple.com] Sent: Tuesday, April 21, 2015 2:00 PM To: Daniel Stewart Cc: LLVM Developers Mailing List; Chandler Carruth Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM Hi Daniel Thanks for the excellent breakdown of whats going on here. Earlier in the thread on this I made this comment: "The first thing that springs to mind is that I don’t trust the backend to get this right. I don’t think it will understand when an i32 load/store would have been preferable to a float one or vice versa. I have no evidence of this, but given how strongly typed tablegen is, I don’t think it can make a good choice here. So I think we probably need to teach the backend how to undo whatever canonical form we choose if it has a reason to” Without seeing the machine instructions, its hard to be 100% certain, but the case you’ve found may be simple enough that the backend can actually fix this. However, such a fixup would be quite target specific (such a target would need different register classes for integers and doubles in this case), and we’d need such a pass for all targets which isn’t ideal. So i wouldn’t rule out a backend solution, but i have a preference for your suggestion to improve SROA. In this particular case, it makes sense for SROA to do effectively the same analysis InstCombine did here and work out when a load is just raw data vs when its data is used as a specific type. The relevant piece of InstCombine is this: // Try to canonicalize loads which are only ever stored to operate over // integers instead of any other type. We only do this when the loaded type // is sized and has a size exactly the same as its store size and the store // size is a legal integer type. if (!Ty->isIntegerTy() && Ty->isSized() && DL.isLegalInteger(DL.getTypeStoreSizeInBits(Ty)) && DL.getTypeStoreSizeInBits(Ty) == DL.getTypeSizeInBits(Ty)) { if (std::all_of(LI.user_begin(), LI.user_end(), [&LI](User *U) { auto *SI = dyn_cast<StoreInst>(U); return SI && SI->getPointerOperand() != &LI; })) { ... After ignoring load/stores which satisfy something like the above code, you can always fallback to the current code of choosing an integer type, so in the common case there won’t be any behavior difference. Cheers Pete On Apr 21, 2015, at 9:18 AM, Daniel Stewart <stewartd at codeaurora.org <mailto:stewartd at codeaurora.org> > wrote: So this change did indeed have an effect! :) I’m seeing regressions in a number of benchmarks mainly due to a host of extra bitcasts that get introduced. Here’s the problem I’m seeing in a nutshell: 1) There is a Phi with input type double 2) Polly demotes the phi into a load/store of type double 3) InstCombine canonicalizes the load/store to use i64 instead of double 4) SROA removes the load/store & inserts a phi back in, using i64 as the type. Inserts bitcast to get to double. 5) The bitcast sticks around and eventually get translated into FMOVs (for AArch64 at least). The function findCommonType() in SROA.cpp is used to obtain the type that should be used for the new alloca that SROA wants to create. It’s decision process is essentially – if all loads/stores of alloca are the same, use that type; else use the corresponding integer type. This causes bitcasts to be inserted in a number of places, most all of which stick around. I’ve copied a reduced version of an instance of the problem below. I’m looking for comments on what others think is the right solution here. Make SROA more intelligent about picking the type? The code is below with all unnecessary code removed for easy consumption. Daniel Before Polly – Prepare code for polly we have code that looks like: while.cond473: ; preds = %while.cond473.outer78, %while.body475 %p_j_x452.0 = phi double [ %105, %while.body475 ], [ %p_j_x452.0.ph82, %while.cond473.outer78 ] while.body475: ; preds = %while.cond473 %sub480 = fsub fast double %64, %p_j_x452.0 %105 = load double* %x485, align 8, !tbaa !25 After Polly – Prepare code for polly we have: while.cond473: ; preds = %while.cond473.outer78, %while.body475 %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem while.body475: ; preds = %while.cond473 %sub480 = fsub fast double %64, %p_j_x452.0.reload %110 = load double* %x485, align 8, !tbaa !25 store double %110, double* %p_j_x452.0.reg2mem After Combine redundant instructions : while.cond473: ; preds = %while.cond473.outer78, %while.body475 %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem, align 8 while.body475: ; preds = %while.cond473 %sub480 = fsub fast double %74, %p_j_x452.0.reload %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0, i32 0 %194 = bitcast double* %x485 to i64* %195 = load i64* %194, align 8, !tbaa !25 %200 = bitcast double* %p_j_x452.0.reg2mem to i64* store i64 %195, i64* %200, align 8 After SROA : while.cond473: ; preds = %while.cond473.outer78, %while.body475 %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 = phi i64 [ %p_j_x452.0.ph73.reg2mem.sroa.0.0.load368, %while.cond473.outer78 ], [ %178, %while.body475 ] %173 = bitcast i64 %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 to double while.body475: ; preds = %while.cond473 %sub480 = fsub fast double %78, %173 %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0, i32 0 %177 = bitcast double* %x485 to i64* %178 = load i64* %177, align 8, !tbaa !25 From: <mailto:llvmdev-bounces at cs.uiuc.edu> llvmdev-bounces at cs.uiuc.edu [ <mailto:llvmdev-bounces at cs.uiuc.edu> mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Chandler Carruth Sent: Wednesday, January 21, 2015 8:32 PM To: Pete Cooper Cc: LLVM Developers Mailing List Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM On Wed, Jan 21, 2015 at 3:06 PM, Pete Cooper < <mailto:peter_cooper at apple.com> peter_cooper at apple.com> wrote: Sounds good to me. Integers it is then. FYI, thanks, I'm just going to commit this then. It seems we're all in essential agreement. We can revert it and take a more cautious approach if something terrible happens. =] -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150423/059fbc7c/attachment.html>