Sergei Larin
2013-Mar-01 16:24 UTC
[LLVMdev] Interesting post increment situation in DAG combiner
Hal, (and everyone who might care about post increment generation)... I have an interesting question/observation. Consider this vector loop. void vec_add_const(unsigned N, short __attribute__ ((aligned (16))) *A, short __attribute__ ((aligned (16))) val) { unsigned i,j; for (i=0; i<N; i++) { for (j=0; j<N; j++) { A[i*N+j] += val; } } } The innermost loop looks like this right before the DAG selection begins. p.loop_body.us65: ; preds %p.loop_body.lr.ph.us78, %p.loop_body.us65 %p_arrayidx.us69.phi = phi i16* [ %p_arrayidx.us69.gep, %p.loop_body.lr.ph.us78 ], [ %p_arrayidx.us69.inc, %p.loop_body.us65 ] %p.loopiv48.us66 = phi i32 [ 0, %p.loop_body.lr.ph.us78 ], [ %p.next_loopiv.us67, %p.loop_body.us65 ] %vector_ptr.us70 = bitcast i16* %p_arrayidx.us69.phi to <4 x i16>* %p.next_loopiv.us67 = add nsw i32 %p.loopiv48.us66, 4 <<<<<<<<<<<<<<<<<< IV %_p_vec_full.us71 = load <4 x i16>* %vector_ptr.us70, align 16 <<<<<<<<<<<<<<<<<<<Load %add5p_vec.us72 = add <4 x i16> %_p_vec_full.us71, %5 store <4 x i16> %add5p_vec.us72, <4 x i16>* %vector_ptr.us70, align 16 <<<<<<<<<<<<<<<Store %p_arrayidx.us69.inc = getelementptr i16* %p_arrayidx.us69.phi, i32 4 <<<<<<<<<<<<<<< Common Ptr %11 = icmp slt i32 %p.next_loopiv.us67, %leftover_lb br i1 %11, label %p.loop_body.us65, label %p.loop_header38.preheader.us84 When it gets to the DAG Combiner, in CombineToPostIndexedLoadStore() two opportunities for post inc are recognized - the load and the store. Now, you can easily see that in this case you would want the store to get the post inc, not the load, but since the DAG combiner simply scans top-down, the opposite happens. So here is the question - do you recognize this as a deficiency, and can you see the same in PPC? The fix is code trivial, but it would introduce a general concept of a primitive cost function to the PostInc candidacy selection in DAG combine. If you recognize the issue, I will post a patch with more details, but if I am missing the big picture here, please advise. Sergei --- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Hal Finkel
2013-Mar-01 22:52 UTC
[LLVMdev] Interesting post increment situation in DAG combiner
----- Original Message -----> From: "Sergei Larin" <slarin at codeaurora.org> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: llvmdev at cs.uiuc.edu > Sent: Friday, March 1, 2013 10:24:39 AM > Subject: Interesting post increment situation in DAG combiner > > Hal, (and everyone who might care about post increment generation)...Sergei, Perhaps this is a problem that I wish that I had ;) -- PPC does not have post-increment loads and stores, only pre-increment ones. Nevertheless, I think that the situation is similar... For one thing, I recently committed an enhancement to pre-increment generation that causes later constant offsets to use the new incremented address instead of the old address. I thought that this would not be an issue for post-increment generation, but it seems that I was wrong: you'd have the same problem here: if you post-increment the load and not the store, then you might need an extra register to hold the original base address for the store. In this case, you'd not have a problem if you just chose to post-increment the store instead, but the general problem still exists. Regarding the selection of which load/store to pre/post increment, I think that this is also a general issue. At least for pre-increment generation I've seen it make some odd choices. In short, I do recognize the issue, and I'm curious to see your patch. If it can improve pre-increment selection as well, that would help me too. Thanks again, Hal> > I have an interesting question/observation. Consider this vector > loop. > > void vec_add_const(unsigned N, short __attribute__ ((aligned (16))) > *A, > short __attribute__ ((aligned > (16))) val) > { > unsigned i,j; > for (i=0; i<N; i++) { > for (j=0; j<N; j++) { > A[i*N+j] += val; > } > } > } > > The innermost loop looks like this right before the DAG selection > begins. > > p.loop_body.us65: ; preds > %p.loop_body.lr.ph.us78, %p.loop_body.us65 > %p_arrayidx.us69.phi = phi i16* [ %p_arrayidx.us69.gep, > %p.loop_body.lr.ph.us78 ], [ %p_arrayidx.us69.inc, %p.loop_body.us65 > ] > %p.loopiv48.us66 = phi i32 [ 0, %p.loop_body.lr.ph.us78 ], [ > %p.next_loopiv.us67, %p.loop_body.us65 ] > %vector_ptr.us70 = bitcast i16* %p_arrayidx.us69.phi to <4 x i16>* > %p.next_loopiv.us67 = add nsw i32 %p.loopiv48.us66, 4 > <<<<<<<<<<<<<<<<<< > IV > %_p_vec_full.us71 = load <4 x i16>* %vector_ptr.us70, align 16 > <<<<<<<<<<<<<<<<<<<Load > %add5p_vec.us72 = add <4 x i16> %_p_vec_full.us71, %5 > store <4 x i16> %add5p_vec.us72, <4 x i16>* %vector_ptr.us70, align > 16 > <<<<<<<<<<<<<<<Store > %p_arrayidx.us69.inc = getelementptr i16* %p_arrayidx.us69.phi, i32 > 4 > <<<<<<<<<<<<<<< Common Ptr > %11 = icmp slt i32 %p.next_loopiv.us67, %leftover_lb > br i1 %11, label %p.loop_body.us65, label > %p.loop_header38.preheader.us84 > > When it gets to the DAG Combiner, in CombineToPostIndexedLoadStore() > two > opportunities for post inc are recognized - the load and the store. > Now, you can easily see that in this case you would want the store to > get > the post inc, not the load, but since the DAG combiner simply scans > top-down, the opposite happens. > > So here is the question - do you recognize this as a deficiency, > and can > you see the same in PPC? The fix is code trivial, but it would > introduce a > general concept of a primitive cost function to the PostInc candidacy > selection in DAG combine. If you recognize the issue, I will post a > patch > with more details, but if I am missing the big picture here, please > advise. > > Sergei > > > --- > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, > hosted by > The Linux Foundation > > > >
Sergei Larin
2013-Mar-01 23:16 UTC
[LLVMdev] Interesting post increment situation in DAG combiner
Hal, Here is my patch for the post inc case. I think it is symmetrically applicable to the pre-inc, but I have not tested it for that. I think you can clearly see my intent here - I simply select the "latest" candidate when multiple are available. Who else might be interested in this? Sergei --- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation> -----Original Message----- > From: Hal Finkel [mailto:hfinkel at anl.gov] > Sent: Friday, March 01, 2013 4:53 PM > To: Sergei Larin > Cc: llvmdev at cs.uiuc.edu > Subject: Re: Interesting post increment situation in DAG combiner > > ----- Original Message ----- > > From: "Sergei Larin" <slarin at codeaurora.org> > > To: "Hal Finkel" <hfinkel at anl.gov> > > Cc: llvmdev at cs.uiuc.edu > > Sent: Friday, March 1, 2013 10:24:39 AM > > Subject: Interesting post increment situation in DAG combiner > > > > Hal, (and everyone who might care about post increment generation)... > > Sergei, > > Perhaps this is a problem that I wish that I had ;) -- PPC does not > have post-increment loads and stores, only pre-increment ones. > Nevertheless, I think that the situation is similar... > > For one thing, I recently committed an enhancement to pre-increment > generation that causes later constant offsets to use the new > incremented address instead of the old address. I thought that this > would not be an issue for post-increment generation, but it seems that > I was wrong: you'd have the same problem here: if you post-increment > the load and not the store, then you might need an extra register to > hold the original base address for the store. In this case, you'd not > have a problem if you just chose to post-increment the store instead, > but the general problem still exists. > > Regarding the selection of which load/store to pre/post increment, I > think that this is also a general issue. At least for pre-increment > generation I've seen it make some odd choices. > > In short, I do recognize the issue, and I'm curious to see your patch. > If it can improve pre-increment selection as well, that would help me > too. > > Thanks again, > Hal > > > > > I have an interesting question/observation. Consider this vector > loop. > > > > void vec_add_const(unsigned N, short __attribute__ ((aligned (16))) > > *A, > > short __attribute__ ((aligned > > (16))) val) { unsigned i,j; for > > (i=0; i<N; i++) { > > for (j=0; j<N; j++) { > > A[i*N+j] += val; > > } > > } > > } > > > > The innermost loop looks like this right before the DAG selection > > begins. > > > > p.loop_body.us65: ; preds > > %p.loop_body.lr.ph.us78, %p.loop_body.us65 > > %p_arrayidx.us69.phi = phi i16* [ %p_arrayidx.us69.gep, > > %p.loop_body.lr.ph.us78 ], [ %p_arrayidx.us69.inc, %p.loop_body.us65 > ] > > %p.loopiv48.us66 = phi i32 [ 0, %p.loop_body.lr.ph.us78 ], [ > > %p.next_loopiv.us67, %p.loop_body.us65 ] > > %vector_ptr.us70 = bitcast i16* %p_arrayidx.us69.phi to <4 x i16>* > > %p.next_loopiv.us67 = add nsw i32 %p.loopiv48.us66, 4 > > <<<<<<<<<<<<<<<<<< > > IV > > %_p_vec_full.us71 = load <4 x i16>* %vector_ptr.us70, align 16 > > <<<<<<<<<<<<<<<<<<<Load > > %add5p_vec.us72 = add <4 x i16> %_p_vec_full.us71, %5 > > store <4 x i16> %add5p_vec.us72, <4 x i16>* %vector_ptr.us70, align > > 16 > > <<<<<<<<<<<<<<<Store > > %p_arrayidx.us69.inc = getelementptr i16* %p_arrayidx.us69.phi, i32 > > 4 > > <<<<<<<<<<<<<<< Common Ptr > > %11 = icmp slt i32 %p.next_loopiv.us67, %leftover_lb > > br i1 %11, label %p.loop_body.us65, label > > %p.loop_header38.preheader.us84 > > > > When it gets to the DAG Combiner, in CombineToPostIndexedLoadStore() > > two opportunities for post inc are recognized - the load and the > > store. > > Now, you can easily see that in this case you would want the store to > > get the post inc, not the load, but since the DAG combiner simply > > scans top-down, the opposite happens. > > > > So here is the question - do you recognize this as a deficiency, > > and can > > you see the same in PPC? The fix is code trivial, but it would > > introduce a general concept of a primitive cost function to the > > PostInc candidacy selection in DAG combine. If you recognize the > > issue, I will post a patch with more details, but if I am missing the > > big picture here, please advise. > > > > Sergei > > > > > > --- > > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, > > hosted by The Linux Foundation > > > > > > > >-------------- next part -------------- A non-text attachment was scrubbed... Name: DagCombiner_PostInc.patch Type: application/octet-stream Size: 3444 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130301/97fc5162/attachment.obj>
Possibly Parallel Threads
- [LLVMdev] Interesting post increment situation in DAG combiner
- [LLVMdev] Interesting post increment situation in DAG combiner
- [LLVMdev] parallel loop metadata simplification
- [LLVMdev] parallel loop metadata simplification
- [LLVMdev] How to unroll reduction loop with caching accumulator on register?