Dmitry Mikushin
2013-Mar-11 15:49 UTC
[LLVMdev] How to unroll reduction loop with caching accumulator on register?
Dear all, Attached notunrolled.ll is a module containing reduction kernel. What I'm trying to do is to unroll it in such way, that partial reduction on unrolled iterations would be performed on register, and then stored to memory only once. Currently llvm's unroller together with all standard optimizations produce code, which stores value to memory after every unrolled iteration, which is much less efficient. Do you have an idea which combination of opt passes may help to cache unrolled loop stores on a register? Many thanks, - D. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130311/eba8318b/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: notunrolled.ll Type: application/octet-stream Size: 2454 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130311/eba8318b/attachment.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: unrolled.ll Type: application/octet-stream Size: 6617 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130311/eba8318b/attachment-0001.obj>
Dmitry Mikushin
2013-Mar-11 18:33 UTC
[LLVMdev] How to unroll reduction loop with caching accumulator on register?
I tried to manually assign each of 3 arrays a unique TBAA node. But it does not seem to help: alias analysis still considers arrays as may-alias, which most likely prevents the desired optimization. Below is the sample code with TBAA metadata inserted. Could you please suggest what might be wrong with it? Many thanks, - D. marcusmae at M17xR4:~/forge/llvm$ opt -time-passes -enable-tbaa -tbaa -print-alias-sets -O3 check.ll -o - -S Alias Set Tracker: 1 alias sets for 3 pointer values. AliasSet[0x39046c0, 3] may alias, Mod/Ref Pointers: (float* inttoptr (i64 47380979712 to float*), 4), (float* %p_newGEPInst9.cloned, 4), (float* %p_newGEPInst12.cloned, 4) ; ModuleID = 'check.ll' target datalayout "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" target triple = "nvptx64-unknown-unknown" @__kernelgen_version = constant [15 x i8] c"0.2/1654:1675M\00" define ptx_kernel void @__kernelgen_matvec_loop_7(i32* nocapture) #0 { "Loop Function Root": %tid.x = tail call ptx_device i32 @llvm.nvvm.read.ptx.sreg.tid.x() %ctaid.x = tail call ptx_device i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() %PositionOfBlockInGrid.x = shl i32 %ctaid.x, 9 %BlockLB.Add.ThreadPosInBlock.x = add i32 %PositionOfBlockInGrid.x, %tid.x %isThreadLBgtLoopUB.x = icmp sgt i32 %BlockLB.Add.ThreadPosInBlock.x, 65535 br i1 %isThreadLBgtLoopUB.x, label %CUDA.AfterLoop.x, label %CUDA.LoopHeader.x.preheader CUDA.LoopHeader.x.preheader: ; preds = %"Loop Function Root" %1 = sext i32 %BlockLB.Add.ThreadPosInBlock.x to i64 store float 0.000000e+00, float* inttoptr (i64 47380979712 to float*), align 8192, !tbaa !0 %p_.moved.to.4.cloned = shl nsw i64 %1, 9 br label %polly.loop_body CUDA.AfterLoop.x: ; preds %polly.loop_body, %"Loop Function Root" ret void polly.loop_body: ; preds %polly.loop_body, %CUDA.LoopHeader.x.preheader %_p_scalar_ = phi float [ 0.000000e+00, %CUDA.LoopHeader.x.preheader ], [ %p_8, %polly.loop_body ] %polly.loopiv10 = phi i64 [ 0, %CUDA.LoopHeader.x.preheader ], [ %polly.next_loopiv, %polly.loop_body ] %polly.next_loopiv = add i64 %polly.loopiv10, 1 %p_ = add i64 %polly.loopiv10, %p_.moved.to.4.cloned %p_newGEPInst9.cloned = getelementptr float* inttoptr (i64 47246749696 to float*), i64 %p_ %p_newGEPInst12.cloned = getelementptr float* inttoptr (i64 47380971520 to float*), i64 %polly.loopiv10 %_p_scalar_5 = load float* %p_newGEPInst9.cloned, align 4, !tbaa !1 %_p_scalar_6 = load float* %p_newGEPInst12.cloned, align 4, !tbaa !2 %p_7 = fmul float %_p_scalar_5, %_p_scalar_6 %p_8 = fadd float %_p_scalar_, %p_7 store float %p_8, float* inttoptr (i64 47380979712 to float*), align 8192, !tbaa !0 %exitcond = icmp eq i64 %polly.next_loopiv, 512 br i1 %exitcond, label %CUDA.AfterLoop.x, label %polly.loop_body } declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #1 attributes #0 = { alwaysinline nounwind } attributes #1 = { nounwind readnone } !0 = metadata !{metadata !"output", null} !1 = metadata !{metadata !"input1", null} !2 = metadata !{metadata !"input2", null} ===-------------------------------------------------------------------------== ... Pass execution timing report ... ===-------------------------------------------------------------------------== Total Execution Time: 0.0080 seconds (0.0082 wall clock) ---User Time--- --User+System-- ---Wall Time--- --- Name --- 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 24.5%) Print module to stderr 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0006 ( 7.9%) Induction Variable Simplification 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0006 ( 7.7%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0004 ( 5.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0004 ( 5.1%) Alias Set Printer 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0003 ( 3.8%) Combine redundant instructions 0.0040 ( 50.0%) 0.0040 ( 50.0%) 0.0003 ( 3.8%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0003 ( 3.8%) Global Value Numbering 0.0040 ( 50.0%) 0.0040 ( 50.0%) 0.0003 ( 3.7%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 2.9%) Early CSE 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 2.0%) Reassociate expressions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.7%) Early CSE 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.6%) Natural Loop Information 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.6%) Interprocedural Sparse Conditional Constant Propagation 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.4%) Loop Invariant Code Motion 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.4%) Module Verifier 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.2%) Simplify the CFG 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.1%) Value Propagation 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.0%) Sparse Conditional Constant Propagation 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.0%) Canonicalize natural loops 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.0%) Dead Store Elimination 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.9%) Module Verifier 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.8%) Value Propagation 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.8%) Simplify the CFG 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.7%) Deduce function attributes 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.7%) Remove unused exception handling info 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.6%) Simplify the CFG 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.6%) Jump Threading 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Simplify the CFG 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Simplify the CFG 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Dominator Tree Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Function Integration/Inlining 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.5%) Jump Threading 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.5%) Canonicalize natural loops 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.5%) Unswitch loops 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.4%) MemCpy Optimization 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.4%) Dominator Tree Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.4%) Loop-Closed SSA Form Pass 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Recognize loop idioms 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Scalar Evolution Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Basic CallGraph Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree Construction 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Unroll loops 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Aggressive Dead Code Elimination 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Global Variable Optimizer 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Loop-Closed SSA Form Pass 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Loop-Closed SSA Form Pass 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Inline Cost Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Tail Call Elimination 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Lazy Value Information Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Lazy Value Information Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Dead Argument Elimination 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Dead Global Elimination 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) No target information 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Target independent code generator's TTI 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Merge Duplicate Global Constants 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Simplify well-known library calls 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Memory Dependence Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Delete dead loops 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) SROA 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Memory Dependence Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Basic Alias Analysis (stateless AA impl) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) SROA 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Memory Dependence Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Lower 'expect' Intrinsics 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Rotate Loops 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Promote 'by reference' arguments to scalars 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Preliminary module verification 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) No Alias Analysis (always returns 'may' alias) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) No target information 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Target Library Information 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Strip Unused Function Prototypes 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) No Alias Analysis (always returns 'may' alias) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Type-Based Alias Analysis 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Preliminary module verification 0.0080 (100.0%) 0.0080 (100.0%) 0.0082 (100.0%) Total 2013/3/11 Dmitry Mikushin <dmitry at kernelgen.org>> Dear all, > > Attached notunrolled.ll is a module containing reduction kernel. What I'm > trying to do is to unroll it in such way, that partial reduction on > unrolled iterations would be performed on register, and then stored to > memory only once. Currently llvm's unroller together with all standard > optimizations produce code, which stores value to memory after every > unrolled iteration, which is much less efficient. Do you have an idea which > combination of opt passes may help to cache unrolled loop stores on a > register? > > Many thanks, > - D. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130311/2141e8f6/attachment.html>
Dmitry Mikushin
2013-Mar-27 00:11 UTC
[LLVMdev] How to unroll reduction loop with caching accumulator on register?
Just for record, here's what I was doing wrong. !0 = metadata !{metadata !"output", null} !1 = metadata !{metadata !"input1", null} !2 = metadata !{metadata !"input2", null} should be !0 = metadata !{ } !1 = metadata !{ metadata !"output", metadata !0 } !2 = metadata !{ metadata !"input1", metadata !0 } !3 = metadata !{ metadata !"input2", metadata !0 } with the corresponding renaming of nodes. With this metadata, opt -O3 successfully pull store out of the loop: ; ModuleID = 'check.ll' target datalayout "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" target triple = "nvptx64-unknown-unknown" @__kernelgen_version = constant [15 x i8] c"0.2/1654:1675M\00" define ptx_kernel void @__kernelgen_matvec_loop_7(i32* nocapture) nounwind alwaysinline { "Loop Function Root": %tid.x = tail call ptx_device i32 @llvm.nvvm.read.ptx.sreg.tid.x() %ctaid.x = tail call ptx_device i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() %PositionOfBlockInGrid.x = shl i32 %ctaid.x, 9 %BlockLB.Add.ThreadPosInBlock.x = add i32 %PositionOfBlockInGrid.x, %tid.x %isThreadLBgtLoopUB.x = icmp sgt i32 %BlockLB.Add.ThreadPosInBlock.x, 65535 br i1 %isThreadLBgtLoopUB.x, label %CUDA.AfterLoop.x, label %CUDA.LoopHeader.x.preheader CUDA.LoopHeader.x.preheader: ; preds = %"Loop Function Root" %1 = sext i32 %BlockLB.Add.ThreadPosInBlock.x to i64 store float 0.000000e+00, float* inttoptr (i64 47380979712 to float*), align 8192, !tbaa !0 %p_.moved.to.4.cloned = shl nsw i64 %1, 9 br label %polly.loop_body CUDA.AfterLoop.x.loopexit: ; preds = %polly.loop_body store float %p_8, float* inttoptr (i64 47380979712 to float*), align 8192 br label %CUDA.AfterLoop.x CUDA.AfterLoop.x: ; preds %CUDA.AfterLoop.x.loopexit, %"Loop Function Root" ret void polly.loop_body: ; preds %polly.loop_body, %CUDA.LoopHeader.x.preheader %_p_scalar_ = phi float [ 0.000000e+00, %CUDA.LoopHeader.x.preheader ], [ %p_8, %polly.loop_body ] %polly.loopiv10 = phi i64 [ 0, %CUDA.LoopHeader.x.preheader ], [ %polly.next_loopiv, %polly.loop_body ] %polly.next_loopiv = add i64 %polly.loopiv10, 1 %p_ = add i64 %polly.loopiv10, %p_.moved.to.4.cloned %p_newGEPInst9.cloned = getelementptr float* inttoptr (i64 47246749696 to float*), i64 %p_ %p_newGEPInst12.cloned = getelementptr float* inttoptr (i64 47380971520 to float*), i64 %polly.loopiv10 %_p_scalar_5 = load float* %p_newGEPInst9.cloned, align 4, !tbaa !2 %_p_scalar_6 = load float* %p_newGEPInst12.cloned, align 4, !tbaa !3 %p_7 = fmul float %_p_scalar_5, %_p_scalar_6 %p_8 = fadd float %_p_scalar_, %p_7 %exitcond = icmp eq i64 %polly.next_loopiv, 512 br i1 %exitcond, label %CUDA.AfterLoop.x.loopexit, label %polly.loop_body } declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() nounwind readnone declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() nounwind readnone !0 = metadata !{metadata !"output", metadata !1} !1 = metadata !{} !2 = metadata !{metadata !"input1", metadata !1} !3 = metadata !{metadata !"input2", metadata !1} 2013/3/11 Dmitry Mikushin <dmitry at kernelgen.org>> I tried to manually assign each of 3 arrays a unique TBAA node. But it > does not seem to help: alias analysis still considers arrays as may-alias, > which most likely prevents the desired optimization. Below is the sample > code with TBAA metadata inserted. Could you please suggest what might be > wrong with it? > > Many thanks, > - D. > > marcusmae at M17xR4:~/forge/llvm$ opt -time-passes -enable-tbaa -tbaa > -print-alias-sets -O3 check.ll -o - -S > Alias Set Tracker: 1 alias sets for 3 pointer values. > AliasSet[0x39046c0, 3] may alias, Mod/Ref Pointers: (float* inttoptr > (i64 47380979712 to float*), 4), (float* %p_newGEPInst9.cloned, 4), (float* > %p_newGEPInst12.cloned, 4) > > ; ModuleID = 'check.ll' > target datalayout > "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" > target triple = "nvptx64-unknown-unknown" > > @__kernelgen_version = constant [15 x i8] c"0.2/1654:1675M\00" > > define ptx_kernel void @__kernelgen_matvec_loop_7(i32* nocapture) #0 { > "Loop Function Root": > %tid.x = tail call ptx_device i32 @llvm.nvvm.read.ptx.sreg.tid.x() > %ctaid.x = tail call ptx_device i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() > %PositionOfBlockInGrid.x = shl i32 %ctaid.x, 9 > %BlockLB.Add.ThreadPosInBlock.x = add i32 %PositionOfBlockInGrid.x, > %tid.x > %isThreadLBgtLoopUB.x = icmp sgt i32 %BlockLB.Add.ThreadPosInBlock.x, > 65535 > br i1 %isThreadLBgtLoopUB.x, label %CUDA.AfterLoop.x, label > %CUDA.LoopHeader.x.preheader > > CUDA.LoopHeader.x.preheader: ; preds = %"Loop > Function Root" > %1 = sext i32 %BlockLB.Add.ThreadPosInBlock.x to i64 > store float 0.000000e+00, float* inttoptr (i64 47380979712 to float*), > align 8192, !tbaa !0 > %p_.moved.to.4.cloned = shl nsw i64 %1, 9 > br label %polly.loop_body > > CUDA.AfterLoop.x: ; preds > %polly.loop_body, %"Loop Function Root" > ret void > > polly.loop_body: ; preds > %polly.loop_body, %CUDA.LoopHeader.x.preheader > %_p_scalar_ = phi float [ 0.000000e+00, %CUDA.LoopHeader.x.preheader ], > [ %p_8, %polly.loop_body ] > %polly.loopiv10 = phi i64 [ 0, %CUDA.LoopHeader.x.preheader ], [ > %polly.next_loopiv, %polly.loop_body ] > %polly.next_loopiv = add i64 %polly.loopiv10, 1 > %p_ = add i64 %polly.loopiv10, %p_.moved.to.4.cloned > %p_newGEPInst9.cloned = getelementptr float* inttoptr (i64 47246749696 > to float*), i64 %p_ > %p_newGEPInst12.cloned = getelementptr float* inttoptr (i64 47380971520 > to float*), i64 %polly.loopiv10 > %_p_scalar_5 = load float* %p_newGEPInst9.cloned, align 4, !tbaa !1 > %_p_scalar_6 = load float* %p_newGEPInst12.cloned, align 4, !tbaa !2 > %p_7 = fmul float %_p_scalar_5, %_p_scalar_6 > %p_8 = fadd float %_p_scalar_, %p_7 > store float %p_8, float* inttoptr (i64 47380979712 to float*), align > 8192, !tbaa !0 > %exitcond = icmp eq i64 %polly.next_loopiv, 512 > br i1 %exitcond, label %CUDA.AfterLoop.x, label %polly.loop_body > } > > declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1 > > declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #1 > > attributes #0 = { alwaysinline nounwind } > attributes #1 = { nounwind readnone } > > !0 = metadata !{metadata !"output", null} > !1 = metadata !{metadata !"input1", null} > !2 = metadata !{metadata !"input2", null} > > ===-------------------------------------------------------------------------==> ... Pass execution timing report ... > > ===-------------------------------------------------------------------------==> Total Execution Time: 0.0080 seconds (0.0082 wall clock) > > ---User Time--- --User+System-- ---Wall Time--- --- Name --- > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 24.5%) Print module to > stderr > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0006 ( 7.9%) Induction Variable > Simplification > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0006 ( 7.7%) Combine redundant > instructions > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0004 ( 5.2%) Combine redundant > instructions > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0004 ( 5.1%) Alias Set Printer > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0003 ( 3.8%) Combine redundant > instructions > 0.0040 ( 50.0%) 0.0040 ( 50.0%) 0.0003 ( 3.8%) Combine redundant > instructions > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0003 ( 3.8%) Global Value > Numbering > 0.0040 ( 50.0%) 0.0040 ( 50.0%) 0.0003 ( 3.7%) Combine redundant > instructions > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 2.9%) Early CSE > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 2.0%) Reassociate > expressions > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.7%) Early CSE > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.6%) Natural Loop > Information > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.6%) Interprocedural > Sparse Conditional Constant Propagation > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.4%) Loop Invariant > Code Motion > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.4%) Module Verifier > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.2%) Simplify the CFG > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.1%) Value Propagation > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.0%) Sparse Conditional > Constant Propagation > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.0%) Canonicalize > natural loops > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.0%) Dead Store > Elimination > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.9%) Module Verifier > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.8%) Value Propagation > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.8%) Simplify the CFG > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.7%) Deduce function > attributes > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.7%) Remove unused > exception handling info > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.6%) Simplify the CFG > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.6%) Jump Threading > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Simplify the CFG > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Simplify the CFG > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Dominator Tree > Construction > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Function > Integration/Inlining > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.5%) Jump Threading > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.5%) Canonicalize > natural loops > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.5%) Unswitch loops > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.4%) MemCpy Optimization > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.4%) Dominator Tree > Construction > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.4%) Loop-Closed SSA > Form Pass > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Recognize loop > idioms > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree > Construction > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Scalar Evolution > Analysis > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree > Construction > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Basic CallGraph > Construction > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree > Construction > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree > Construction > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Unroll loops > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Aggressive Dead > Code Elimination > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Global Variable > Optimizer > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Loop-Closed SSA > Form Pass > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Loop-Closed SSA > Form Pass > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Inline Cost > Analysis > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Tail Call > Elimination > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Lazy Value > Information Analysis > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Lazy Value > Information Analysis > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Dead Argument > Elimination > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Dead Global > Elimination > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) No target > information > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Target independent > code generator's TTI > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Merge Duplicate > Global Constants > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Simplify > well-known library calls > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Memory Dependence > Analysis > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Delete dead loops > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) SROA > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Memory Dependence > Analysis > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Basic Alias > Analysis (stateless AA impl) > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) SROA > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Memory Dependence > Analysis > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Lower 'expect' > Intrinsics > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Rotate Loops > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Promote 'by > reference' arguments to scalars > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Preliminary module > verification > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) No Alias Analysis > (always returns 'may' alias) > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) No target > information > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Target Library > Information > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Strip Unused > Function Prototypes > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) No Alias Analysis > (always returns 'may' alias) > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Type-Based Alias > Analysis > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Preliminary module > verification > 0.0080 (100.0%) 0.0080 (100.0%) 0.0082 (100.0%) Total > > > 2013/3/11 Dmitry Mikushin <dmitry at kernelgen.org> > >> Dear all, >> >> Attached notunrolled.ll is a module containing reduction kernel. What I'm >> trying to do is to unroll it in such way, that partial reduction on >> unrolled iterations would be performed on register, and then stored to >> memory only once. Currently llvm's unroller together with all standard >> optimizations produce code, which stores value to memory after every >> unrolled iteration, which is much less efficient. Do you have an idea which >> combination of opt passes may help to cache unrolled loop stores on a >> register? >> >> Many thanks, >> - D. >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130327/cbc95b17/attachment.html>
Possibly Parallel Threads
- [LLVMdev] How to unroll reduction loop with caching accumulator on register?
- [LLVMdev] Interesting post increment situation in DAG combiner
- [LLVMdev] Interesting post increment situation in DAG combiner
- [LLVMdev] Interesting post increment situation in DAG combiner
- [LLVMdev] parallel loop metadata simplification