Evan Cheng <evan.cheng <at> apple.com> writes: > Eli is right. We do need to see some benchmark numbers and understand how the pass will fit in the target > independent optimizer. While we encourage contribution, we typically don't commit new passes unless it > introduce new functionalities that have active clients. It would also help if you provide us with compile > time numbers. > > Evan Evan, Let's restart this conversation ... I propose that I test OSR versus LSR compile time performance via opt shell.ll -mem2reg -licm -osr -instcombine -stats -time-passes -disable-output and opt shell.ll -mem2reg -licm -loop-reduce -instcombine -stats -time-passes -disable-output Caveats which may skew any compile time testing: 1. LST is target dependent and was developed to run closer to code generation time. Thus, we are testing LSR out of its normal context. Also as mentioned previously, -instcombine is not normally called after LSR (though LSR does create opportunities for -instcombine) 2. OSR finds loops in the SSA graph and will find more reduction opportunities than LSR. Thus, OSR may spend more time doing strength reduction, since OSR does more work. 3. From Dan Gohman's reply, "LSR's primary goal is to reduce integer register pressure, with strength reduction being just one of its available tools." Thus, LSR may be doing more than just strength reduction. 4. It is nontrivial to compare OSR and LSR optimizations in a given program, especially the larger tests in the test suite where the compile time differences may be the largest. These caveats makes it a bit like comparing apples and oranges. I created a simple test for the shell sort with a triple nested loop. I hand verified that both OSR and LSR performed the same strength reductions. Here are the results: % opt shell.ll -mem2reg -licm -osr -instcombine -stats -time-passes -disable-output ===-------------------------------------------------------------------------=== ... Statistics Collected ... ===-------------------------------------------------------------------------=== 5 instcombine - Number of dead inst eliminated 1 instcombine - Number of instructions sunk 12 instcombine - Number of insts combined 5 mem2reg - Number of PHI nodes inserted 8 mem2reg - Number of alloca's promoted 5 mem2reg - Number of alloca's promoted with a single store 9 osr - Number of operators reduced ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 0.0058 seconds (0.0058 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 0.0018 ( 34.5%) 0.0001 ( 26.5%) 0.0020 ( 33.8%) 0.0020 ( 34.1%) Combine redundant instructions 0.0007 ( 13.2%) 0.0001 ( 25.5%) 0.0008 ( 14.2%) 0.0008 ( 14.2%) Promote Memory to Register 0.0007 ( 13.8%) 0.0000 ( 6.6%) 0.0008 ( 13.3%) 0.0008 ( 13.2%) Operator Strength Reduction 0.0005 ( 9.2%) 0.0001 ( 16.1%) 0.0006 ( 9.7%) 0.0006 ( 9.6%) Dominator Tree Construction 0.0005 ( 8.7%) 0.0000 ( 5.3%) 0.0005 ( 8.4%) 0.0005 ( 8.4%) Module Verifier 0.0003 ( 6.3%) 0.0000 ( 4.0%) 0.0004 ( 6.2%) 0.0004 ( 6.1%) Loop Invariant Code Motion 0.0003 ( 5.4%) 0.0000 ( 3.6%) 0.0003 ( 5.3%) 0.0003 ( 5.3%) Natural Loop Information 0.0002 ( 4.2%) 0.0000 ( 5.3%) 0.0003 ( 4.3%) 0.0003 ( 4.3%) Dominance Frontier Construction 0.0002 ( 4.3%) 0.0000 ( 4.9%) 0.0003 ( 4.4%) 0.0002 ( 4.3%) Canonicalize natural loops 0.0000 ( 0.2%) 0.0000 ( 1.7%) 0.0000 ( 0.3%) 0.0000 ( 0.3%) No Alias Analysis (always returns 'may' alias) 0.0000 ( 0.2%) 0.0000 ( 0.4%) 0.0000 ( 0.2%) 0.0000 ( 0.2%) Preliminary module verification 0.0053 (100.0%) 0.0005 (100.0%) 0.0058 (100.0%) 0.0058 (100.0%) Total % opt shell.ll -mem2reg -licm -loop-reduce -instcombine -stats -time-passes -disable-output ===-------------------------------------------------------------------------=== ... Statistics Collected ... ===-------------------------------------------------------------------------=== 7 instcombine - Number of dead inst eliminated 2 instcombine - Number of instructions sunk 19 instcombine - Number of insts combined 5 mem2reg - Number of PHI nodes inserted 8 mem2reg - Number of alloca's promoted 5 mem2reg - Number of alloca's promoted with a single store 2 scalar-evolution - Number of loops with predictable loop counts 2 scalar-evolution - Number of loops without predictable loop counts ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 0.0114 seconds (0.0114 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 0.0050 ( 46.0%) 0.0001 ( 26.6%) 0.0051 ( 45.1%) 0.0051 ( 45.1%) Loop Strength Reduction 0.0020 ( 18.3%) 0.0001 ( 17.2%) 0.0021 ( 18.3%) 0.0021 ( 18.3%) Combine redundant instructions 0.0013 ( 12.0%) 0.0001 ( 12.0%) 0.0014 ( 12.0%) 0.0014 ( 12.0%) Induction Variable Users 0.0005 ( 4.5%) 0.0001 ( 14.8%) 0.0006 ( 5.0%) 0.0006 ( 5.0%) Promote Memory to Register 0.0005 ( 4.8%) 0.0000 ( 5.6%) 0.0005 ( 4.8%) 0.0005 ( 4.8%) Natural Loop Information 0.0003 ( 3.0%) 0.0000 ( 2.6%) 0.0003 ( 3.0%) 0.0003 ( 3.0%) Loop Invariant Code Motion 0.0003 ( 2.4%) 0.0000 ( 3.6%) 0.0003 ( 2.5%) 0.0003 ( 2.5%) Module Verifier 0.0003 ( 2.4%) 0.0000 ( 1.6%) 0.0003 ( 2.4%) 0.0003 ( 2.4%) Induction Variable Users 0.0002 ( 2.1%) 0.0000 ( 2.6%) 0.0002 ( 2.1%) 0.0002 ( 2.2%) Canonicalize natural loops 0.0002 ( 1.7%) 0.0000 ( 7.2%) 0.0002 ( 1.9%) 0.0002 ( 1.9%) Dominator Tree Construction 0.0002 ( 1.6%) 0.0000 ( 1.4%) 0.0002 ( 1.6%) 0.0002 ( 1.6%) Canonicalize natural loops 0.0001 ( 0.8%) 0.0000 ( 1.8%) 0.0001 ( 0.9%) 0.0001 ( 0.9%) Dominance Frontier Construction 0.0000 ( 0.3%) 0.0000 ( 1.2%) 0.0000 ( 0.3%) 0.0000 ( 0.3%) Scalar Evolution Analysis 0.0000 ( 0.1%) 0.0000 ( 1.6%) 0.0000 ( 0.2%) 0.0000 ( 0.2%) No Alias Analysis (always returns 'may' alias) 0.0000 ( 0.0%) 0.0000 ( 0.2%) 0.0000 ( 0.1%) 0.0000 ( 0.0%) Preliminary module verification 0.0109 (100.0%) 0.0005 (100.0%) 0.0114 (100.0%) 0.0114 (100.0%) Total If you think that the above is a fair methodology for comparing OSR and LSR compile times, I can run additional tests. Which tests are an open question. Brian