Paul C. Anagnostopoulos via llvm-dev
2020-Nov-12 16:23 UTC
[llvm-dev] Musings on the TableGen -emit-dag-isel backend
A rather notorious aspect of TableGen is the time required to run the -emit-dag-isel backend on some targets, including AMDGPU and X86. I added a timing feature to TableGen and timed the AMDGPU run. ===-------------------------------------------------------------------------== TableGen Phase Timing ===-------------------------------------------------------------------------== Total Execution Time: 733.6103 seconds (733.8740 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 645.0017 ( 87.9%) 0.2340 (100.0%) 645.2357 ( 88.0%) 645.2709 ( 87.9%) Emit matcher table 70.4501 ( 9.6%) 0.0000 ( 0.0%) 70.4501 ( 9.6%) 70.5510 ( 9.6%) Convert to matchers 14.6329 ( 2.0%) 0.0000 ( 0.0%) 14.6329 ( 2.0%) 14.7638 ( 2.0%) Parse, build records 2.1996 ( 0.3%) 0.0000 ( 0.0%) 2.1996 ( 0.3%) 2.1871 ( 0.3%) Sort patterns 1.0920 ( 0.1%) 0.0000 ( 0.0%) 1.0920 ( 0.1%) 1.0961 ( 0.1%) Optimize matchers 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0050 ( 0.0%) Write output 733.3763 (100.0%) 0.2340 (100.0%) 733.6103 (100.0%) 733.8740 (100.0%) Total As you can see, most of the time is spent emitting the C++ code. A simple step toward reducing the time is to use the --omit-comments option. However, I am informed that trying to debug the pattern matching table without comments is a hopeless task. ===-------------------------------------------------------------------------== TableGen Phase Timing ===-------------------------------------------------------------------------== Total Execution Time: 162.9274 seconds (162.9173 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 75.0833 ( 46.1%) 0.0468 ( 42.9%) 75.1301 ( 46.1%) 75.1313 ( 46.1%) Emit matcher table 69.7948 ( 42.9%) 0.0000 ( 0.0%) 69.7948 ( 42.8%) 69.8050 ( 42.8%) Convert to matchers 14.6173 ( 9.0%) 0.0468 ( 42.9%) 14.6641 ( 9.0%) 14.6668 ( 9.0%) Parse, build records 2.2308 ( 1.4%) 0.0000 ( 0.0%) 2.2308 ( 1.4%) 2.2191 ( 1.4%) Sort patterns 1.0920 ( 0.7%) 0.0000 ( 0.0%) 1.0920 ( 0.7%) 1.0921 ( 0.7%) Optimize matchers 0.0000 ( 0.0%) 0.0156 ( 14.3%) 0.0156 ( 0.0%) 0.0030 ( 0.0%) Write output 162.8182 (100.0%) 0.1092 (100.0%) 162.9274 (100.0%) 162.9173 (100.0%) Total Emitting the C++ code for most pattern operators is straightforward. However, three operators are more time-consuming: Matcher::Scope, SwitchOpcode, and SwitchType. These operators take a list of child patterns, each of which is emitted as a 1- to 3-byte size followed by the child's pattern bytes. The size is coded as a variable-length sequence of bytes, as is every integer in the matcher table. (Just for interest, the average number of children per Scope operator is about 2.7.) In order to minimize the length of the size, the backend performs a sort of relaxation algorithm, where it first tries a 1-byte size. If that fails, it tries the actual required number of bytes. Trying involves calculating the offset in the table for the child and then recursively generating the entire child, passing it a string buffer as its output stream. When the size is finally determined, that string buffer is appended to the actual buffer passed to the generation function. Why does the number of bytes in the size matter to the child? Because the offset of the child is determined by that size, and the offset is passed to the child and then included in many comments emitted by the child. If the size is wrong, then the offset is wrong, and then the comments are wrong. So it's clear that repetitive code generating is done because of the relaxation algorithm. How bad is it? It turns out that the algorithm is about O(1.7^(n-1)), where n is the depth of the pattern matching tree. The depth of the AMDGPU tree is 13. Here are some interesting statistics: Number of pattern operators at depth 11: 35,000 Number of times those operators are regenerated: 20 million I think we have found the problem. So what can be done? I tried a quick and dirty experiment. I forced all the child sizes to occupy a 3-byte length, so that the relaxation algorithm was no longer necesseary. The results are shown here. ===-------------------------------------------------------------------------== TableGen Phase Timing ===-------------------------------------------------------------------------== Total Execution Time: 90.7302 seconds (90.8242 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 69.7324 ( 76.9%) 0.0000 ( 0.0%) 69.7324 ( 76.9%) 69.7320 ( 76.8%) Convert to matchers 14.5705 ( 16.1%) 0.0312 ( 33.3%) 14.6017 ( 16.1%) 14.6958 ( 16.2%) Parse, build records 3.0576 ( 3.4%) 0.0624 ( 66.7%) 3.1200 ( 3.4%) 3.1192 ( 3.4%) Emit matcher table 2.1840 ( 2.4%) 0.0000 ( 0.0%) 2.1840 ( 2.4%) 2.1891 ( 2.4%) Sort patterns 1.0920 ( 1.2%) 0.0000 ( 0.0%) 1.0920 ( 1.2%) 1.0831 ( 1.2%) Optimize matchers 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0050 ( 0.0%) Write output 90.6366 (100.0%) 0.0936 (100.0%) 90.7302 (100.0%) 90.8242 (100.0%) Total Now the matcher emitter phase is insignificant! Unfortunately, the size of the matchter table increases from 451,430 bytes to 469,612 bytes, an increase of 18,182 bytes or 4%. So one solution is not to care about the 4% increase and always use 3-byte child sizes. A second solution is to add an option that specifies the starting number of child size bytes and retains the relaxation algorithm. When building the system, we would specify --child-size-bytes=1. When building for debugging, you could specify --child-size-bytes=3. Comments and suggestion gratefully accepted.
Krzysztof Parzyszek via llvm-dev
2020-Nov-12 16:43 UTC
[llvm-dev] Musings on the TableGen -emit-dag-isel backend
This is great! Thanks Paul! I think that the 9x reduction in compile-time is well worth the 4% size increase. TableGen's run-time has been a sore point and a source of complaints for quite some time. -- Krzysztof Parzyszek kparzysz at quicinc.com AI tools development -----Original Message----- From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Paul C. Anagnostopoulos via llvm-dev Sent: Thursday, November 12, 2020 10:23 AM To: llvm-dev at lists.llvm.org Subject: [EXT] [llvm-dev] Musings on the TableGen -emit-dag-isel backend A rather notorious aspect of TableGen is the time required to run the -emit-dag-isel backend on some targets, including AMDGPU and X86. I added a timing feature to TableGen and timed the AMDGPU run. ===-------------------------------------------------------------------------== TableGen Phase Timing ===-------------------------------------------------------------------------== Total Execution Time: 733.6103 seconds (733.8740 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 645.0017 ( 87.9%) 0.2340 (100.0%) 645.2357 ( 88.0%) 645.2709 ( 87.9%) Emit matcher table 70.4501 ( 9.6%) 0.0000 ( 0.0%) 70.4501 ( 9.6%) 70.5510 ( 9.6%) Convert to matchers 14.6329 ( 2.0%) 0.0000 ( 0.0%) 14.6329 ( 2.0%) 14.7638 ( 2.0%) Parse, build records 2.1996 ( 0.3%) 0.0000 ( 0.0%) 2.1996 ( 0.3%) 2.1871 ( 0.3%) Sort patterns 1.0920 ( 0.1%) 0.0000 ( 0.0%) 1.0920 ( 0.1%) 1.0961 ( 0.1%) Optimize matchers 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0050 ( 0.0%) Write output 733.3763 (100.0%) 0.2340 (100.0%) 733.6103 (100.0%) 733.8740 (100.0%) Total As you can see, most of the time is spent emitting the C++ code. A simple step toward reducing the time is to use the --omit-comments option. However, I am informed that trying to debug the pattern matching table without comments is a hopeless task. ===-------------------------------------------------------------------------== TableGen Phase Timing ===-------------------------------------------------------------------------== Total Execution Time: 162.9274 seconds (162.9173 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 75.0833 ( 46.1%) 0.0468 ( 42.9%) 75.1301 ( 46.1%) 75.1313 ( 46.1%) Emit matcher table 69.7948 ( 42.9%) 0.0000 ( 0.0%) 69.7948 ( 42.8%) 69.8050 ( 42.8%) Convert to matchers 14.6173 ( 9.0%) 0.0468 ( 42.9%) 14.6641 ( 9.0%) 14.6668 ( 9.0%) Parse, build records 2.2308 ( 1.4%) 0.0000 ( 0.0%) 2.2308 ( 1.4%) 2.2191 ( 1.4%) Sort patterns 1.0920 ( 0.7%) 0.0000 ( 0.0%) 1.0920 ( 0.7%) 1.0921 ( 0.7%) Optimize matchers 0.0000 ( 0.0%) 0.0156 ( 14.3%) 0.0156 ( 0.0%) 0.0030 ( 0.0%) Write output 162.8182 (100.0%) 0.1092 (100.0%) 162.9274 (100.0%) 162.9173 (100.0%) Total Emitting the C++ code for most pattern operators is straightforward. However, three operators are more time-consuming: Matcher::Scope, SwitchOpcode, and SwitchType. These operators take a list of child patterns, each of which is emitted as a 1- to 3-byte size followed by the child's pattern bytes. The size is coded as a variable-length sequence of bytes, as is every integer in the matcher table. (Just for interest, the average number of children per Scope operator is about 2.7.) In order to minimize the length of the size, the backend performs a sort of relaxation algorithm, where it first tries a 1-byte size. If that fails, it tries the actual required number of bytes. Trying involves calculating the offset in the table for the child and then recursively generating the entire child, passing it a string buffer as its output stream. When the size is finally determined, that string buffer is appended to the actual buffer passed to the generation function. Why does the number of bytes in the size matter to the child? Because the offset of the child is determined by that size, and the offset is passed to the child and then included in many comments emitted by the child. If the size is wrong, then the offset is wrong, and then the comments are wrong. So it's clear that repetitive code generating is done because of the relaxation algorithm. How bad is it? It turns out that the algorithm is about O(1.7^(n-1)), where n is the depth of the pattern matching tree. The depth of the AMDGPU tree is 13. Here are some interesting statistics: Number of pattern operators at depth 11: 35,000 Number of times those operators are regenerated: 20 million I think we have found the problem. So what can be done? I tried a quick and dirty experiment. I forced all the child sizes to occupy a 3-byte length, so that the relaxation algorithm was no longer necesseary. The results are shown here. ===-------------------------------------------------------------------------== TableGen Phase Timing ===-------------------------------------------------------------------------== Total Execution Time: 90.7302 seconds (90.8242 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 69.7324 ( 76.9%) 0.0000 ( 0.0%) 69.7324 ( 76.9%) 69.7320 ( 76.8%) Convert to matchers 14.5705 ( 16.1%) 0.0312 ( 33.3%) 14.6017 ( 16.1%) 14.6958 ( 16.2%) Parse, build records 3.0576 ( 3.4%) 0.0624 ( 66.7%) 3.1200 ( 3.4%) 3.1192 ( 3.4%) Emit matcher table 2.1840 ( 2.4%) 0.0000 ( 0.0%) 2.1840 ( 2.4%) 2.1891 ( 2.4%) Sort patterns 1.0920 ( 1.2%) 0.0000 ( 0.0%) 1.0920 ( 1.2%) 1.0831 ( 1.2%) Optimize matchers 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0050 ( 0.0%) Write output 90.6366 (100.0%) 0.0936 (100.0%) 90.7302 (100.0%) 90.8242 (100.0%) Total Now the matcher emitter phase is insignificant! Unfortunately, the size of the matchter table increases from 451,430 bytes to 469,612 bytes, an increase of 18,182 bytes or 4%. So one solution is not to care about the 4% increase and always use 3-byte child sizes. A second solution is to add an option that specifies the starting number of child size bytes and retains the relaxation algorithm. When building the system, we would specify --child-size-bytes=1. When building for debugging, you could specify --child-size-bytes=3. Comments and suggestion gratefully accepted. _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Jay Foad via llvm-dev
2020-Nov-13 10:45 UTC
[llvm-dev] Musings on the TableGen -emit-dag-isel backend
I wouldn't want to be too hasty about simply removing the relaxation algorithm. The size and speed of the compiler affects all users, but the time to compile the compiler "only" affects us compiler developers. And I speak as a developer who is heavily affected by the time to compile the AMDGPU backend. One off-the-cuff idea (I haven't even looked at the code yet): could we pass in a null output stream while doing the relaxation phase, and use that to skip most of the actual textual formatting, followed by a final pass with a real output stream once the offset has been determined? Jay. On Thu, 12 Nov 2020 at 16:43, Krzysztof Parzyszek via llvm-dev < llvm-dev at lists.llvm.org> wrote:> This is great! Thanks Paul! > > I think that the 9x reduction in compile-time is well worth the 4% size > increase. TableGen's run-time has been a sore point and a source of > complaints for quite some time. > > -- > Krzysztof Parzyszek kparzysz at quicinc.com AI tools development > > -----Original Message----- > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Paul C. > Anagnostopoulos via llvm-dev > Sent: Thursday, November 12, 2020 10:23 AM > To: llvm-dev at lists.llvm.org > Subject: [EXT] [llvm-dev] Musings on the TableGen -emit-dag-isel backend > > A rather notorious aspect of TableGen is the time required to run the > -emit-dag-isel backend on some targets, including AMDGPU and X86. I added a > timing feature to TableGen and timed the AMDGPU run. > > > ===-------------------------------------------------------------------------==> TableGen Phase Timing > ===-------------------------------------------------------------------------==> Total Execution Time: 733.6103 seconds (733.8740 wall clock) > > ---User Time--- --System Time-- --User+System-- ---Wall Time--- > --- Name --- > 645.0017 ( 87.9%) 0.2340 (100.0%) 645.2357 ( 88.0%) 645.2709 ( > 87.9%) Emit matcher table > 70.4501 ( 9.6%) 0.0000 ( 0.0%) 70.4501 ( 9.6%) 70.5510 ( 9.6%) > Convert to matchers > 14.6329 ( 2.0%) 0.0000 ( 0.0%) 14.6329 ( 2.0%) 14.7638 ( 2.0%) > Parse, build records > 2.1996 ( 0.3%) 0.0000 ( 0.0%) 2.1996 ( 0.3%) 2.1871 ( 0.3%) > Sort patterns > 1.0920 ( 0.1%) 0.0000 ( 0.0%) 1.0920 ( 0.1%) 1.0961 ( 0.1%) > Optimize matchers > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0050 ( 0.0%) > Write output > 733.3763 (100.0%) 0.2340 (100.0%) 733.6103 (100.0%) 733.8740 > (100.0%) Total > > As you can see, most of the time is spent emitting the C++ code. A simple > step toward reducing the time is to use the --omit-comments option. > However, I am informed that trying to debug the pattern matching table > without comments is a hopeless task. > > > ===-------------------------------------------------------------------------==> TableGen Phase Timing > ===-------------------------------------------------------------------------==> Total Execution Time: 162.9274 seconds (162.9173 wall clock) > > ---User Time--- --System Time-- --User+System-- ---Wall Time--- > --- Name --- > 75.0833 ( 46.1%) 0.0468 ( 42.9%) 75.1301 ( 46.1%) 75.1313 ( 46.1%) > Emit matcher table > 69.7948 ( 42.9%) 0.0000 ( 0.0%) 69.7948 ( 42.8%) 69.8050 ( 42.8%) > Convert to matchers > 14.6173 ( 9.0%) 0.0468 ( 42.9%) 14.6641 ( 9.0%) 14.6668 ( 9.0%) > Parse, build records > 2.2308 ( 1.4%) 0.0000 ( 0.0%) 2.2308 ( 1.4%) 2.2191 ( 1.4%) > Sort patterns > 1.0920 ( 0.7%) 0.0000 ( 0.0%) 1.0920 ( 0.7%) 1.0921 ( 0.7%) > Optimize matchers > 0.0000 ( 0.0%) 0.0156 ( 14.3%) 0.0156 ( 0.0%) 0.0030 ( 0.0%) > Write output > 162.8182 (100.0%) 0.1092 (100.0%) 162.9274 (100.0%) 162.9173 > (100.0%) Total > > Emitting the C++ code for most pattern operators is straightforward. > However, three operators are more time-consuming: Matcher::Scope, > SwitchOpcode, and SwitchType. These operators take a list of child > patterns, each of which is emitted as a 1- to 3-byte size followed by the > child's pattern bytes. The size is coded as a variable-length sequence of > bytes, as is every integer in the matcher table. (Just for interest, the > average number of children per Scope operator is about 2.7.) > > In order to minimize the length of the size, the backend performs a sort > of relaxation algorithm, where it first tries a 1-byte size. If that fails, > it tries the actual required number of bytes. Trying involves calculating > the offset in the table for the child and then recursively generating the > entire child, passing it a string buffer as its output stream. When the > size is finally determined, that string buffer is appended to the actual > buffer passed to the generation function. > > Why does the number of bytes in the size matter to the child? Because the > offset of the child is determined by that size, and the offset is passed to > the child and then included in many comments emitted by the child. If the > size is wrong, then the offset is wrong, and then the comments are wrong. > > So it's clear that repetitive code generating is done because of the > relaxation algorithm. How bad is it? It turns out that the algorithm is > about O(1.7^(n-1)), where n is the depth of the pattern matching tree. The > depth of the AMDGPU tree is 13. Here are some interesting statistics: > > Number of pattern operators at depth 11: 35,000 > Number of times those operators are regenerated: 20 million > > I think we have found the problem. So what can be done? I tried a quick > and dirty experiment. I forced all the child sizes to occupy a 3-byte > length, so that the relaxation algorithm was no longer necesseary. The > results are shown here. > > > ===-------------------------------------------------------------------------==> TableGen Phase Timing > ===-------------------------------------------------------------------------==> Total Execution Time: 90.7302 seconds (90.8242 wall clock) > > ---User Time--- --System Time-- --User+System-- ---Wall Time--- > --- Name --- > 69.7324 ( 76.9%) 0.0000 ( 0.0%) 69.7324 ( 76.9%) 69.7320 ( 76.8%) > Convert to matchers > 14.5705 ( 16.1%) 0.0312 ( 33.3%) 14.6017 ( 16.1%) 14.6958 ( 16.2%) > Parse, build records > 3.0576 ( 3.4%) 0.0624 ( 66.7%) 3.1200 ( 3.4%) 3.1192 ( 3.4%) > Emit matcher table > 2.1840 ( 2.4%) 0.0000 ( 0.0%) 2.1840 ( 2.4%) 2.1891 ( 2.4%) > Sort patterns > 1.0920 ( 1.2%) 0.0000 ( 0.0%) 1.0920 ( 1.2%) 1.0831 ( 1.2%) > Optimize matchers > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0050 ( 0.0%) > Write output > 90.6366 (100.0%) 0.0936 (100.0%) 90.7302 (100.0%) 90.8242 (100.0%) > Total > > Now the matcher emitter phase is insignificant! Unfortunately, the size of > the matchter table increases from 451,430 bytes to 469,612 bytes, an > increase of 18,182 bytes or 4%. > > So one solution is not to care about the 4% increase and always use 3-byte > child sizes. A second solution is to add an option that specifies the > starting number of child size bytes and retains the relaxation algorithm. > When building the system, we would specify --child-size-bytes=1. When > building for debugging, you could specify --child-size-bytes=3. > > Comments and suggestion gratefully accepted. > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201113/1be67ba4/attachment.html>