Old bug, but I decided to use some modern hardware to do some analysis on it for fun.. I updated the Bugzilla report, but it was suggested that I should also share with llvmdev for broader exposure for anyone interested.. Text from the bug report copied below, and PPT attached to mail. Useful for anyone interested in or troubled by code alignment issues on IA. Thanks, Zia. -------------------------------------------------------- Comment 4 Zia Ansari 2015-05-28 17:49:13 CDT I know this is super old, but I took a quick look at this issue and the test-case attached to pr3120 to see if anything jumped out at me. Mostly for educational purposes, and also to see if there are any opportunities. Since this report is very old, it’s unclear on which architecture the performance swings were reported and, perhaps more importantly, whether we care about those architectures today, or not. I chose to play around with it a little on today’s hardware to see if there are still any alignment issues. It actually turned out that with “0 mod 32” vs “16 mod 32” byte alignment, the benchmark did show significant swings (~50%-70%) on an IVB and HSW. The reason for the swings wasn’t immediately obvious, but some deeper analysis pointed me to the issue being within the DSB (the post decode uop cache). I wrote up a detailed presentation of what’s going on so that I could share it with the rest of my team for educational purposes (attached). The quick summary is : The DSB caches post-decoded uops that are frequently executed so that front-end pipeline stages and overhead can be bypassed, allowing to feed 32B worth of instructions per clock, instead of 16B. The DSB allows 3 ways (each of which can hold 6 uops) to be allocated to each 32B chunk of instructions (by IP address). Unconditional branches always end a way. If the code is aligned and laid out in such a way as to require more than 3 ways per 32B chunk of instructions in tightly packed code with lots of JMP instructions, then we can get into situations where we keep flip-flopping execution in and out of the DSB vs. Front End. This can be inefficient and incurs additional penalties. You can find additional details in the presentation, and also in the public Intel Optimization Manual. It’s tricky to decide whether something can/should be done about this, or not. One option is to pad code whenever we detect multiple jmp instructions in a potential 32B chunk of instructions (specifically, more than 3). This may cause unnecessary code bloat with no payoff, but it could also be rare enough to be insignificant padding that may help boost performance in those rare cases. I plan on playing around with this a little to see how many cases we can catch in SPEC, for example, and measure bloat vs. perf to see if it’s a viable solution. The other option would be to do nothing, and make do with simply understanding what the problem is so that it can be identified in the future. Architectures change rapidly, and this could be something that goes away soon. In either case, I’ll probably pursue the first option above and report back on what I find. Regarding the other details reported in this issue, I realize that the slow vs. fast cases both had 0 mod 32 byte alignment. It’s hard to do the analysis on what the issue there was, without having the exact code and the exact (old) architecture on which it was run. If I had to guess, I would say that it was a case of unfortunate aliasing in the branch prediction buffer, causing differences in the prediction of one of the many branches, particularly the indirect branch, which is known to have prediction issues on some older architectures. Feel free to contact me if you’d like additional info. Thanks, Zia Ansari. -------------- next part -------------- A non-text attachment was scrubbed... Name: bugzilla-5615-presentation-public.pptx Type: application/vnd.openxmlformats-officedocument.presentationml.presentation Size: 165504 bytes Desc: bugzilla-5615-presentation-public.pptx URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150610/024ed87f/attachment.pptx>
Interesting findings, thanks for sharing. I'd be interesting in seeing any prototype patches you have for this. My frontend (Java) is likely to be generating code which is potentially more branch heavy than your typical C code. I'd be curious to see if the tradeoffs were different. I'd be happy to apply a patch locally and report back on the big picture impact. If it does turn out to be profitable to nop pad in the way you describe, we could potentially apply this only to hot loops. Using profile data to guide when we pad vs don't pad, we might be able to avoid excessive code bloat while still getting the improvements you describe. Philip On 06/10/2015 09:52 AM, Ansari, Zia wrote:> Old bug, but I decided to use some modern hardware to do some analysis on it for fun.. I updated the Bugzilla report, but it was suggested that I should also share with llvmdev for broader exposure for anyone interested.. Text from the bug report copied below, and PPT attached to mail. > > Useful for anyone interested in or troubled by code alignment issues on IA. > > Thanks, > Zia. > > -------------------------------------------------------- > > Comment 4 Zia Ansari 2015-05-28 17:49:13 CDT > I know this is super old, but I took a quick look at this issue and the test-case attached to pr3120 to see if anything jumped out at me. Mostly for educational purposes, and also to see if there are any opportunities. > > Since this report is very old, it’s unclear on which architecture the performance swings were reported and, perhaps more importantly, whether we care about those architectures today, or not. > > I chose to play around with it a little on today’s hardware to see if there are still any alignment issues. It actually turned out that with “0 mod 32” vs “16 mod 32” byte alignment, the benchmark did show significant swings (~50%-70%) on an IVB and HSW. > > The reason for the swings wasn’t immediately obvious, but some deeper analysis pointed me to the issue being within the DSB (the post decode uop cache). I wrote up a detailed presentation of what’s going on so that I could share it with the rest of my team for educational purposes (attached). > > The quick summary is : The DSB caches post-decoded uops that are frequently executed so that front-end pipeline stages and overhead can be bypassed, allowing to feed 32B worth of instructions per clock, instead of 16B. The DSB allows 3 ways (each of which can hold 6 uops) to be allocated to each 32B chunk of instructions (by IP address). Unconditional branches always end a way. If the code is aligned and laid out in such a way as to require more than 3 ways per 32B chunk of instructions in tightly packed code with lots of JMP instructions, then we can get into situations where we keep flip-flopping execution in and out of the DSB vs. Front End. This can be inefficient and incurs additional penalties. You can find additional details in the presentation, and also in the public Intel Optimization Manual. > > It’s tricky to decide whether something can/should be done about this, or not. One option is to pad code whenever we detect multiple jmp instructions in a potential 32B chunk of instructions (specifically, more than 3). This may cause unnecessary code bloat with no payoff, but it could also be rare enough to be insignificant padding that may help boost performance in those rare cases. I plan on playing around with this a little to see how many cases we can catch in SPEC, for example, and measure bloat vs. perf to see if it’s a viable solution. > > The other option would be to do nothing, and make do with simply understanding what the problem is so that it can be identified in the future. Architectures change rapidly, and this could be something that goes away soon. > > In either case, I’ll probably pursue the first option above and report back on what I find. > > Regarding the other details reported in this issue, I realize that the slow vs. fast cases both had 0 mod 32 byte alignment. It’s hard to do the analysis on what the issue there was, without having the exact code and the exact (old) architecture on which it was run. If I had to guess, I would say that it was a case of unfortunate aliasing in the branch prediction buffer, causing differences in the prediction of one of the many branches, particularly the indirect branch, which is known to have prediction issues on some older architectures. > > Feel free to contact me if you’d like additional info. > > Thanks, > Zia Ansari. > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150610/19252cce/attachment.html>
As someone who has been writing hand optimized assembly for Intel x86 since the 80486 era (I slacked off for years and have recently been reading the optimization manuals again, finally), this was very interesting to me, thanks for posting on the list. Cheers, Gordon Keiser Software Development Engineer Arxan Technologies -----Original Message----- From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Ansari, Zia Sent: Wednesday, June 10, 2015 12:53 PM To: llvmdev at cs.uiuc.edu Subject: [LLVMdev] Bugzilla – Bug 5615 Old bug, but I decided to use some modern hardware to do some analysis on it for fun.. I updated the Bugzilla report, but it was suggested that I should also share with llvmdev for broader exposure for anyone interested.. Text from the bug report copied below, and PPT attached to mail. Useful for anyone interested in or troubled by code alignment issues on IA. Thanks, Zia. -------------------------------------------------------- Comment 4 Zia Ansari 2015-05-28 17:49:13 CDT I know this is super old, but I took a quick look at this issue and the test-case attached to pr3120 to see if anything jumped out at me. Mostly for educational purposes, and also to see if there are any opportunities. Since this report is very old, it’s unclear on which architecture the performance swings were reported and, perhaps more importantly, whether we care about those architectures today, or not. I chose to play around with it a little on today’s hardware to see if there are still any alignment issues. It actually turned out that with “0 mod 32” vs “16 mod 32” byte alignment, the benchmark did show significant swings (~50%-70%) on an IVB and HSW. The reason for the swings wasn’t immediately obvious, but some deeper analysis pointed me to the issue being within the DSB (the post decode uop cache). I wrote up a detailed presentation of what’s going on so that I could share it with the rest of my team for educational purposes (attached). The quick summary is : The DSB caches post-decoded uops that are frequently executed so that front-end pipeline stages and overhead can be bypassed, allowing to feed 32B worth of instructions per clock, instead of 16B. The DSB allows 3 ways (each of which can hold 6 uops) to be allocated to each 32B chunk of instructions (by IP address). Unconditional branches always end a way. If the code is aligned and laid out in such a way as to require more than 3 ways per 32B chunk of instructions in tightly packed code with lots of JMP instructions, then we can get into situations where we keep flip-flopping execution in and out of the DSB vs. Front End. This can be inefficient and incurs additional penalties. You can find additional details in the presentation, and also in the public Intel Optimization Manual. It’s tricky to decide whether something can/should be done about this, or not. One option is to pad code whenever we detect multiple jmp instructions in a potential 32B chunk of instructions (specifically, more than 3). This may cause unnecessary code bloat with no payoff, but it could also be rare enough to be insignificant padding that may help boost performance in those rare cases. I plan on playing around with this a little to see how many cases we can catch in SPEC, for example, and measure bloat vs. perf to see if it’s a viable solution. The other option would be to do nothing, and make do with simply understanding what the problem is so that it can be identified in the future. Architectures change rapidly, and this could be something that goes away soon. In either case, I’ll probably pursue the first option above and report back on what I find. Regarding the other details reported in this issue, I realize that the slow vs. fast cases both had 0 mod 32 byte alignment. It’s hard to do the analysis on what the issue there was, without having the exact code and the exact (old) architecture on which it was run. If I had to guess, I would say that it was a case of unfortunate aliasing in the branch prediction buffer, causing differences in the prediction of one of the many branches, particularly the indirect branch, which is known to have prediction issues on some older architectures. Feel free to contact me if you’d like additional info. Thanks, Zia Ansari.