Riyaz Puthiyapurayil via llvm-dev
2019-Jan-29 19:27 UTC
[llvm-dev] Early Tail Duplication Inefficiency
I have a file for which clang-7 takes over 2 hours to compile with -O3. For the same file, clang-5 takes less than 2 minutes (which is also high IMHO). I will try to create a test case (but it is pretty simple, it only contains initializations of many arrays of structs where the structs are of the following form: struct Foo { EnumType1 e1; // there are 700+ enum labels std::string s1; EnumType2 e2; // 5 possible values for e2 std::string s2; std::string s3; }; // A large array with 10K+ elements Foo array1[] = { { EnumType1Label1, "some string", EnumType2Label1, "another string", "yet another string" }, : : }; // 11 more arrays like above but most of them have only a few hundred elements : : I would like to know if a similar problem had been reported before. A quick search didn't find anything... Clang-5 -ftime-report shows: ===-------------------------------------------------------------------------== ... Pass execution timing report ... ===-------------------------------------------------------------------------== Total Execution Time: 175.9183 seconds (176.1324 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 38.9181 ( 22.4%) 0.0070 ( 0.4%) 38.9251 ( 22.1%) 38.9712 ( 22.1%) Simple Register Coalescing 34.0788 ( 19.6%) 0.7859 ( 41.5%) 34.8647 ( 19.8%) 34.9069 ( 19.8%) SROA 18.8351 ( 10.8%) 0.0070 ( 0.4%) 18.8421 ( 10.7%) 18.8652 ( 10.7%) Function Integration/Inlining 18.0393 ( 10.4%) 0.0010 ( 0.1%) 18.0403 ( 10.3%) 18.0624 ( 10.3%) Branch Probability Basic Block Placement 17.5163 ( 10.1%) 0.3060 ( 16.2%) 17.8223 ( 10.1%) 17.8458 ( 10.1%) Merge disjoint stack slots 14.4318 ( 8.3%) 0.0000 ( 0.0%) 14.4318 ( 8.2%) 14.4495 ( 8.2%) Control Flow Optimizer 6.5960 ( 3.8%) 0.6219 ( 32.9%) 7.2179 ( 4.1%) 7.2315 ( 4.1%) X86 DAG->DAG Instruction Selection 2.1577 ( 1.2%) 0.0040 ( 0.2%) 2.1617 ( 1.2%) 2.1643 ( 1.2%) Greedy Register Allocator 0.9539 ( 0.5%) 0.0000 ( 0.0%) 0.9539 ( 0.5%) 0.9543 ( 0.5%) Combine redundant instructions : : Clang-7 -ftime-report shows: ===-------------------------------------------------------------------------== ... Pass execution timing report ... ===-------------------------------------------------------------------------== Total Execution Time: 7920.0840 seconds (8021.0405 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 6660.8174 ( 88.0%) 209.5201 ( 60.3%) 6870.3375 ( 86.7%) 6957.8224 ( 86.7%) Early Tail Duplication 674.2655 ( 8.9%) 0.0550 ( 0.0%) 674.3205 ( 8.5%) 675.1329 ( 8.4%) Jump Threading 89.1534 ( 1.2%) 8.1488 ( 2.3%) 97.3022 ( 1.2%) 97.6368 ( 1.2%) Merge disjoint stack slots 2.4886 ( 0.0%) 73.3249 ( 21.1%) 75.8135 ( 1.0%) 79.9594 ( 1.0%) Eliminate PHI nodes for register allocation 9.2116 ( 0.1%) 52.7120 ( 15.2%) 61.9236 ( 0.8%) 62.1655 ( 0.8%) Slot index numbering 34.4118 ( 0.5%) 2.5066 ( 0.7%) 36.9184 ( 0.5%) 44.2757 ( 0.6%) SROA 35.2266 ( 0.5%) 0.0000 ( 0.0%) 35.2266 ( 0.4%) 35.2803 ( 0.4%) Simple Register Coalescing 18.0892 ( 0.2%) 0.0020 ( 0.0%) 18.0912 ( 0.2%) 18.1253 ( 0.2%) Function Integration/Inlining 7.2959 ( 0.1%) 0.5739 ( 0.2%) 7.8698 ( 0.1%) 7.9301 ( 0.1%) X86 DAG->DAG Instruction Selection 6.5990 ( 0.1%) 0.0000 ( 0.0%) 6.5990 ( 0.1%) 6.6072 ( 0.1%) Branch Probability Basic Block Placement 2.8736 ( 0.0%) 0.0060 ( 0.0%) 2.8796 ( 0.0%) 2.8831 ( 0.0%) Greedy Register Allocator 2.0147 ( 0.0%) 0.1890 ( 0.1%) 2.2037 ( 0.0%) 2.2095 ( 0.0%) Global Value Numbering 2.1347 ( 0.0%) 0.0010 ( 0.0%) 2.1357 ( 0.0%) 2.1456 ( 0.0%) Call-site splitting 1.5878 ( 0.0%) 0.0010 ( 0.0%) 1.5888 ( 0.0%) 1.6079 ( 0.0%) Combine redundant instructions 1.4358 ( 0.0%) 0.0020 ( 0.0%) 1.4378 ( 0.0%) 1.4569 ( 0.0%) Two-Address instruction pass 0.9689 ( 0.0%) 0.0000 ( 0.0%) 0.9689 ( 0.0%) 0.9748 ( 0.0%) Live Interval Analysis : : -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190129/680d1b0f/attachment.html>
Riyaz Puthiyapurayil via llvm-dev
2019-Jan-30 18:00 UTC
[llvm-dev] Early Tail Duplication Inefficiency
I didn't see any response on this. Is there any way to turn off early tail duplication with a clang-7 option (other than completely turning off all optimizations)? The issue is reproducible with a very simple test case. Clang-7 with optimizations turned on takes hours compared to minutes with clang-5.0. Here is a simple cooked up test (the real-life example is of course different but this simple test exposes the same inefficiency): // test.cpp #include <string> struct Foo { std::string s1; std::string s2; std::string s3; }; Foo Array[] = { { "0", "0", "0" }, { "1", "1", "1" }, : : : { "9999", "9999", "9999" } }; Compile: % clang++ -c -O3 test.cpp ... Takes hours! From: Riyaz Puthiyapurayil Sent: Tuesday, January 29, 2019 11:27 AM To: 'llvm-dev' <llvm-dev at lists.llvm.org> Subject: Early Tail Duplication Inefficiency I have a file for which clang-7 takes over 2 hours to compile with -O3. For the same file, clang-5 takes less than 2 minutes (which is also high IMHO). I will try to create a test case (but it is pretty simple, it only contains initializations of many arrays of structs where the structs are of the following form: struct Foo { EnumType1 e1; // there are 700+ enum labels std::string s1; EnumType2 e2; // 5 possible values for e2 std::string s2; std::string s3; }; // A large array with 10K+ elements Foo array1[] = { { EnumType1Label1, "some string", EnumType2Label1, "another string", "yet another string" }, : : }; // 11 more arrays like above but most of them have only a few hundred elements : : I would like to know if a similar problem had been reported before. A quick search didn't find anything... Clang-5 -ftime-report shows: ===-------------------------------------------------------------------------== ... Pass execution timing report ... ===-------------------------------------------------------------------------== Total Execution Time: 175.9183 seconds (176.1324 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 38.9181 ( 22.4%) 0.0070 ( 0.4%) 38.9251 ( 22.1%) 38.9712 ( 22.1%) Simple Register Coalescing 34.0788 ( 19.6%) 0.7859 ( 41.5%) 34.8647 ( 19.8%) 34.9069 ( 19.8%) SROA 18.8351 ( 10.8%) 0.0070 ( 0.4%) 18.8421 ( 10.7%) 18.8652 ( 10.7%) Function Integration/Inlining 18.0393 ( 10.4%) 0.0010 ( 0.1%) 18.0403 ( 10.3%) 18.0624 ( 10.3%) Branch Probability Basic Block Placement 17.5163 ( 10.1%) 0.3060 ( 16.2%) 17.8223 ( 10.1%) 17.8458 ( 10.1%) Merge disjoint stack slots 14.4318 ( 8.3%) 0.0000 ( 0.0%) 14.4318 ( 8.2%) 14.4495 ( 8.2%) Control Flow Optimizer 6.5960 ( 3.8%) 0.6219 ( 32.9%) 7.2179 ( 4.1%) 7.2315 ( 4.1%) X86 DAG->DAG Instruction Selection 2.1577 ( 1.2%) 0.0040 ( 0.2%) 2.1617 ( 1.2%) 2.1643 ( 1.2%) Greedy Register Allocator 0.9539 ( 0.5%) 0.0000 ( 0.0%) 0.9539 ( 0.5%) 0.9543 ( 0.5%) Combine redundant instructions : : Clang-7 -ftime-report shows: ===-------------------------------------------------------------------------== ... Pass execution timing report ... ===-------------------------------------------------------------------------== Total Execution Time: 7920.0840 seconds (8021.0405 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 6660.8174 ( 88.0%) 209.5201 ( 60.3%) 6870.3375 ( 86.7%) 6957.8224 ( 86.7%) Early Tail Duplication 674.2655 ( 8.9%) 0.0550 ( 0.0%) 674.3205 ( 8.5%) 675.1329 ( 8.4%) Jump Threading 89.1534 ( 1.2%) 8.1488 ( 2.3%) 97.3022 ( 1.2%) 97.6368 ( 1.2%) Merge disjoint stack slots 2.4886 ( 0.0%) 73.3249 ( 21.1%) 75.8135 ( 1.0%) 79.9594 ( 1.0%) Eliminate PHI nodes for register allocation 9.2116 ( 0.1%) 52.7120 ( 15.2%) 61.9236 ( 0.8%) 62.1655 ( 0.8%) Slot index numbering 34.4118 ( 0.5%) 2.5066 ( 0.7%) 36.9184 ( 0.5%) 44.2757 ( 0.6%) SROA 35.2266 ( 0.5%) 0.0000 ( 0.0%) 35.2266 ( 0.4%) 35.2803 ( 0.4%) Simple Register Coalescing 18.0892 ( 0.2%) 0.0020 ( 0.0%) 18.0912 ( 0.2%) 18.1253 ( 0.2%) Function Integration/Inlining 7.2959 ( 0.1%) 0.5739 ( 0.2%) 7.8698 ( 0.1%) 7.9301 ( 0.1%) X86 DAG->DAG Instruction Selection 6.5990 ( 0.1%) 0.0000 ( 0.0%) 6.5990 ( 0.1%) 6.6072 ( 0.1%) Branch Probability Basic Block Placement 2.8736 ( 0.0%) 0.0060 ( 0.0%) 2.8796 ( 0.0%) 2.8831 ( 0.0%) Greedy Register Allocator 2.0147 ( 0.0%) 0.1890 ( 0.1%) 2.2037 ( 0.0%) 2.2095 ( 0.0%) Global Value Numbering 2.1347 ( 0.0%) 0.0010 ( 0.0%) 2.1357 ( 0.0%) 2.1456 ( 0.0%) Call-site splitting 1.5878 ( 0.0%) 0.0010 ( 0.0%) 1.5888 ( 0.0%) 1.6079 ( 0.0%) Combine redundant instructions 1.4358 ( 0.0%) 0.0020 ( 0.0%) 1.4378 ( 0.0%) 1.4569 ( 0.0%) Two-Address instruction pass 0.9689 ( 0.0%) 0.0000 ( 0.0%) 0.9689 ( 0.0%) 0.9748 ( 0.0%) Live Interval Analysis : : -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190130/18d750fb/attachment.html>
Craig Topper via llvm-dev
2019-Jan-30 18:19 UTC
[llvm-dev] Early Tail Duplication Inefficiency
Try passing "-mllvm disable-early-taildup=true" to clang ~Craig On Wed, Jan 30, 2019 at 10:01 AM Riyaz Puthiyapurayil via llvm-dev < llvm-dev at lists.llvm.org> wrote:> I didn’t see any response on this. Is there any way to turn off early tail > duplication with a clang-7 option (other than completely turning off all > optimizations)? The issue is reproducible with a very simple test case. > Clang-7 with optimizations turned on takes hours compared to minutes with > clang-5.0. Here is a simple cooked up test (the real-life example is of > course different but this simple test exposes the same inefficiency): > > > > // test.cpp > > #include <string> > > > > struct Foo { > > std::string s1; > > std::string s2; > > std::string s3; > > }; > > > > Foo Array[] = { > > { “0”, “0”, “0” }, > > { “1”, “1”, “1” }, > > : > > : > > : > > { “9999”, “9999”, “9999” } > > }; > > > > Compile: > > > > % clang++ -c -O3 test.cpp > > … > > Takes hours! > > > > *From:* Riyaz Puthiyapurayil > *Sent:* Tuesday, January 29, 2019 11:27 AM > *To:* 'llvm-dev' <llvm-dev at lists.llvm.org> > *Subject:* Early Tail Duplication Inefficiency > > > > I have a file for which clang-7 takes over 2 hours to compile with -O3. > For the same file, clang-5 takes less than 2 minutes (which is also high > IMHO). I will try to create a test case (but it is pretty simple, it only > contains initializations of many arrays of structs where the structs are of > the following form: > > > > struct Foo { > > EnumType1 e1; // there are 700+ enum labels > > std::string s1; > > EnumType2 e2; // 5 possible values for e2 > > std::string s2; > > std::string s3; > > }; > > > > // A large array with 10K+ elements > > Foo array1[] = { > > { EnumType1Label1, “some string”, EnumType2Label1, “another string”, > “yet another string” }, > > : > > : > > }; > > > > // 11 more arrays like above but most of them have only a few hundred > elements > > : > > : > > > > I would like to know if a similar problem had been reported before. A > quick search didn’t find anything… > > > > Clang-5 -ftime-report shows: > > > > > ===-------------------------------------------------------------------------==> > ... Pass execution timing report ... > > > ===-------------------------------------------------------------------------==> > Total Execution Time: 175.9183 seconds (176.1324 wall clock) > > > > ---User Time--- --System Time-- --User+System-- ---Wall Time--- > --- Name --- > > 38.9181 ( 22.4%) 0.0070 ( 0.4%) 38.9251 ( 22.1%) 38.9712 ( 22.1%) > Simple Register Coalescing > > 34.0788 ( 19.6%) 0.7859 ( 41.5%) 34.8647 ( 19.8%) 34.9069 ( 19.8%) > SROA > > 18.8351 ( 10.8%) 0.0070 ( 0.4%) 18.8421 ( 10.7%) 18.8652 ( 10.7%) > Function Integration/Inlining > > 18.0393 ( 10.4%) 0.0010 ( 0.1%) 18.0403 ( 10.3%) 18.0624 ( 10.3%) > Branch Probability Basic Block Placement > > 17.5163 ( 10.1%) 0.3060 ( 16.2%) 17.8223 ( 10.1%) 17.8458 ( 10.1%) > Merge disjoint stack slots > > 14.4318 ( 8.3%) 0.0000 ( 0.0%) 14.4318 ( 8.2%) 14.4495 ( 8.2%) > Control Flow Optimizer > > 6.5960 ( 3.8%) 0.6219 ( 32.9%) 7.2179 ( 4.1%) 7.2315 ( 4.1%) > X86 DAG->DAG Instruction Selection > > 2.1577 ( 1.2%) 0.0040 ( 0.2%) 2.1617 ( 1.2%) 2.1643 ( 1.2%) > Greedy Register Allocator > > 0.9539 ( 0.5%) 0.0000 ( 0.0%) 0.9539 ( 0.5%) 0.9543 ( 0.5%) > Combine redundant instructions > > : > > : > > > > Clang-7 -ftime-report shows: > > > ===-------------------------------------------------------------------------==> > ... Pass execution timing report ... > > > ===-------------------------------------------------------------------------==> > Total Execution Time: 7920.0840 seconds (8021.0405 wall clock) > > > > ---User Time--- --System Time-- --User+System-- ---Wall Time--- > --- Name --- > > 6660.8174 ( 88.0%) 209.5201 ( 60.3%) 6870.3375 ( 86.7%) 6957.8224 ( > 86.7%) Early Tail Duplication > > 674.2655 ( 8.9%) 0.0550 ( 0.0%) 674.3205 ( 8.5%) 675.1329 ( > 8.4%) Jump Threading > > 89.1534 ( 1.2%) 8.1488 ( 2.3%) 97.3022 ( 1.2%) 97.6368 ( 1.2%) > Merge disjoint stack slots > > 2.4886 ( 0.0%) 73.3249 ( 21.1%) 75.8135 ( 1.0%) 79.9594 ( 1.0%) > Eliminate PHI nodes for register allocation > > 9.2116 ( 0.1%) 52.7120 ( 15.2%) 61.9236 ( 0.8%) 62.1655 ( 0.8%) > Slot index numbering > > 34.4118 ( 0.5%) 2.5066 ( 0.7%) 36.9184 ( 0.5%) 44.2757 ( 0.6%) > SROA > > 35.2266 ( 0.5%) 0.0000 ( 0.0%) 35.2266 ( 0.4%) 35.2803 ( 0.4%) > Simple Register Coalescing > > 18.0892 ( 0.2%) 0.0020 ( 0.0%) 18.0912 ( 0.2%) 18.1253 ( 0.2%) > Function Integration/Inlining > > 7.2959 ( 0.1%) 0.5739 ( 0.2%) 7.8698 ( 0.1%) 7.9301 ( 0.1%) > X86 DAG->DAG Instruction Selection > > 6.5990 ( 0.1%) 0.0000 ( 0.0%) 6.5990 ( 0.1%) 6.6072 ( 0.1%) > Branch Probability Basic Block Placement > > 2.8736 ( 0.0%) 0.0060 ( 0.0%) 2.8796 ( 0.0%) 2.8831 ( 0.0%) > Greedy Register Allocator > > 2.0147 ( 0.0%) 0.1890 ( 0.1%) 2.2037 ( 0.0%) 2.2095 ( 0.0%) > Global Value Numbering > > 2.1347 ( 0.0%) 0.0010 ( 0.0%) 2.1357 ( 0.0%) 2.1456 ( 0.0%) > Call-site splitting > > 1.5878 ( 0.0%) 0.0010 ( 0.0%) 1.5888 ( 0.0%) 1.6079 ( 0.0%) > Combine redundant instructions > > 1.4358 ( 0.0%) 0.0020 ( 0.0%) 1.4378 ( 0.0%) 1.4569 ( 0.0%) > Two-Address instruction pass > > 0.9689 ( 0.0%) 0.0000 ( 0.0%) 0.9689 ( 0.0%) 0.9748 ( 0.0%) > Live Interval Analysis > > : > > : > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190130/b64f57a5/attachment-0001.html>
Florian Hahn via llvm-dev
2019-Jan-30 20:02 UTC
[llvm-dev] Early Tail Duplication Inefficiency
> On Jan 30, 2019, at 18:00, Riyaz Puthiyapurayil via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > I didn’t see any response on this. Is there any way to turn off early tail duplication with a clang-7 option (other than completely turning off all optimizations)? The issue is reproducible with a very simple test case. Clang-7 with optimizations turned on takes hours compared to minutes with clang-5.0. Here is a simple cooked up test (the real-life example is of course different but this simple test exposes the same inefficiency): > > // test.cpp > #include <string> > > struct Foo { > std::string s1; > std::string s2; > std::string s3; > }; > > Foo Array[] = { > { “0”, “0”, “0” }, > { “1”, “1”, “1” }, > : > : > : > { “9999”, “9999”, “9999” } > }; > > Compile: > > % clang++ -c -O3 test.cpp > …I think a bug report would be useful to track the issue. It would be great if you could file one here: https://bugs.llvm.org/ <https://bugs.llvm.org/> Ideally with a reproducer. Thanks, Florian -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190130/d5682229/attachment.html>