Andrea Di Biagio via llvm-dev
2020-May-10 16:59 UTC
[llvm-dev] [llvm-mca] Resource consumption of ProcResGroups
Hi Alex, On Sun, May 10, 2020 at 4:00 PM Alex Renda <renda at csail.mit.edu> wrote:> Thanks, that’s very helpful! > > > > Also, sorry for the miscue on that bug with the 2/4 cycles — I realize now > that that’s an artifact of a change that I made to not crash when resource > groups overlap without all atomic subunits being specified: > > `echo 'fxrstor (%rsp)' | llvm-mca -mtriple=x86_64-unknown-unknown > -march=x86-64 -mcpu=haswell` > crashes (because fxrstor requests > `HWPort0,HWPort6,HWPort23,HWPort05,HWPort06,HWPort15,HWPort0156`, so > HWPort0156 ends up asserting because 0,1,5, and 6 are all taken), so I > added: > ``` > --- a/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp > +++ b/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp > @@ -292,7 +292,7 @@ void ResourceManager::issueInstruction( > ResourceState &RS = *Resources[Index]; > > - if (!R.second.isReserved()) { > + if (!R.second.isReserved() && RS.isReady()) { > ResourceRef Pipe = selectPipe(R.first); > use(Pipe); > BusyResources[Pipe] += CS.size(); > ``` > which is probably the cause of that weird behavior I reported. > > > > I’m also somewhat curious about what “NumUnits” is modeling: I haven’t > totally worked through the code yet, but it seems that when more (not > necessarily atomic) sub-resources of a resource group are requested, more > “NumUnits” of that group is requested. This doesn’t seem particularly > intuitive to me, at least in my mental model of Haswell scheduling (and > also leads to some infinite loops like `echo ‘fldenv (%rsp)' | llvm-mca > -mtriple=x86_64-unknown-unknown -march=x86-64 -mcpu=haswell`, which > requests HWPort0,HWPort1,HWPort01,HWPort05, and HWPort015, meaning that > HWPort015 never schedules because it requests 4 “NumUnits” but only 3 are > ever available). Is there some particular behavior that this is modeling? >The issue with fxrstor is unfortunately a bug. I'll see if I can fix it. Strictly speaking, that "NumUnits" quantity is literally the number of processor resource units consumed. By construction, it should never exceed the actual number of resource units declared by a processor resource in the scheduling model (i.e. this particular field of MCProcResourceDesc - https://llvm.org/doxygen/structllvm_1_1MCProcResourceDesc.html#a9d4d0cc34fcce4779dc4445d8265fffc ). The reason why you see the crash/assert is because the number of resource units for scheduler resources is incorrectly set for that instruction by the instruction builder in llvm-mca ( lib/MCA/InstrBuilder.cpp - function `initializeUsedResources()`). I am looking at it. About what Andy suggested: I really like the idea of having a vector of delay-cycles for consumed resources. It would fix most of these problems in Haswell, and it could be used to improve the description of some instructions on other processor models too. I agree that it should not be difficult to implement (at least, the tablegen part should be pretty mechanical); the resource allocation logic in class ResourceManager (in llvm-mca) would require a bit of refactoring. Other than that, the rest should be doable. -Andrea> > -Alex > On May 10, 2020, 9:32 AM -0400, Andrew Trick <atrick at apple.com>, wrote: > > > > On May 9, 2020, at 5:12 PM, Andrea Di Biagio via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > The llvm scheduling model is quite simple and doesn't allow mca to > accurately simulate the execution of individual uOPs. That limitation is > sort-of acceptable if you consider how the scheduling model framework was > originally designed with a different goal in mind (i.e. machine > scheduling). The lack of expressiveness of the llvm scheduling model > unfortunately limits the accuracy of llvm-mca: we know the number of uOPs > of an instruction. However we don't know which resources are consumed by > which micro-opcodes. So we cannot accurately simulate the independent > execution of individual opcodes of an instruction. > > Another "problem" is that it is not possible to describe when uOPs > effectively start consuming resources. At the moment, the expectation is > that resource consumption always starts at relative cycle #0 (relative to > the instruction issue cycle). > Example: an horizontal add on x86 is usually decoded into a pair of > shuffles uOPs and a single (data-dependent) vector ADD uOP. > The ADD uOP doesn't execute immediately because it needs to wait for the > other two shuffle uOPs. It means that the ALU pipe is still available at > relative cycle #0 and it is only consumed starting from relative cycle #1 > (ssuming that both shuffles can start execution at relative cycle #0). In > practice, the llvm scheduling model only allows us to declare which > pipeline resources are consumed, and for how long (in number cycles). So we > cannot accurately describe to mca that the delayed consumption of the ALU > pipe. > Now think about what happens if: the first shuffle uOP consumes 1cy of > HWPort0, and the second shuffle uOp consumes 1cy of HWPort1, and the ADD > consumes 1cy of HWPort01. We end up in that "odd" situation you described > where HWPort01 is "reserved" for 1cy. > In reality, that 1cy of HWPort01 should have started 1cy after the other > two opcodes. At that point, both pipelines would have been seen available. > > In conclusion, the presence of a "reserved" flag is not ideal, but it is > sort-of a consequence of the above mentioned two limitations (plus the way > how the Haswell and Broadwell models were originally designed). > > I hope it helps, > -Andrea > > > Food for thought... > > It would be easy to add a DelayCycles vector to SchedWriteRes to indicate > the relative start cycle for each reserved resource. That would effectively > model dependent uOps. > > NumMicroOps is only meant to model any general limitation of the cpu > frontend to issue/rename/retire micro-ops. So, yes, there's no way to > associate resources with specific uOps. You can mark any kind of resource > as "dynamically scheduled" (BufferSize = -1). If an instruction uses > different kinds of dynamic resources, then those need not be reserved at > the same time. If we had the DelayCycles vector, it could be interpreted as > "this resource must be reserved N cycles after prior reservations of other > resources". > > -Andy > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200510/1c5e7f7e/attachment.html>
Andrea Di Biagio via llvm-dev
2020-May-10 18:56 UTC
[llvm-dev] [llvm-mca] Resource consumption of ProcResGroups
On Sun, May 10, 2020 at 5:59 PM Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:> Hi Alex, > > On Sun, May 10, 2020 at 4:00 PM Alex Renda <renda at csail.mit.edu> wrote: > >> Thanks, that’s very helpful! >> >> >> >> Also, sorry for the miscue on that bug with the 2/4 cycles — I realize >> now that that’s an artifact of a change that I made to not crash when >> resource groups overlap without all atomic subunits being specified: >> >> `echo 'fxrstor (%rsp)' | llvm-mca -mtriple=x86_64-unknown-unknown >> -march=x86-64 -mcpu=haswell` >> crashes (because fxrstor requests >> `HWPort0,HWPort6,HWPort23,HWPort05,HWPort06,HWPort15,HWPort0156`, so >> HWPort0156 ends up asserting because 0,1,5, and 6 are all taken), so I >> added: >> ``` >> --- a/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp >> +++ b/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp >> @@ -292,7 +292,7 @@ void ResourceManager::issueInstruction( >> ResourceState &RS = *Resources[Index]; >> >> - if (!R.second.isReserved()) { >> + if (!R.second.isReserved() && RS.isReady()) { >> ResourceRef Pipe = selectPipe(R.first); >> use(Pipe); >> BusyResources[Pipe] += CS.size(); >> ``` >> which is probably the cause of that weird behavior I reported. >> >> >> >> I’m also somewhat curious about what “NumUnits” is modeling: I haven’t >> totally worked through the code yet, but it seems that when more (not >> necessarily atomic) sub-resources of a resource group are requested, more >> “NumUnits” of that group is requested. This doesn’t seem particularly >> intuitive to me, at least in my mental model of Haswell scheduling (and >> also leads to some infinite loops like `echo ‘fldenv (%rsp)' | llvm-mca >> -mtriple=x86_64-unknown-unknown -march=x86-64 -mcpu=haswell`, which >> requests HWPort0,HWPort1,HWPort01,HWPort05, and HWPort015, meaning that >> HWPort015 never schedules because it requests 4 “NumUnits” but only 3 are >> ever available). Is there some particular behavior that this is modeling? >> > > The issue with fxrstor is unfortunately a bug. I'll see if I can fix it. > > Strictly speaking, that "NumUnits" quantity is literally the number of > processor resource units consumed. By construction, it should never exceed > the actual number of resource units declared by a processor resource in the > scheduling model (i.e. this particular field of MCProcResourceDesc - > https://llvm.org/doxygen/structllvm_1_1MCProcResourceDesc.html#a9d4d0cc34fcce4779dc4445d8265fffc > ). > > The reason why you see the crash/assert is because the number of resource > units for scheduler resources is incorrectly set for that instruction by > the instruction builder in llvm-mca ( lib/MCA/InstrBuilder.cpp - function > `initializeUsedResources()`). I am looking at it. >Hi Alex, This issue should be fixed by commit 47b95d7cf462. I now understand why so many questions about the meaning of reserved resources :-). Could you please check if that commit fixes the issue for you too? As Andy wrote, in the future we should really look into adding an optional DelayCycles vector for SchedWriteRes. That would be the ideal improvement; it would also allow us to get rid of the "reserved" bit. Thanks, -Andrea> About what Andy suggested: I really like the idea of having a vector of > delay-cycles for consumed resources. It would fix most of these problems in > Haswell, and it could be used to improve the description of some > instructions on other processor models too. I agree that it should not be > difficult to implement (at least, the tablegen part should be pretty > mechanical); the resource allocation logic in class ResourceManager (in > llvm-mca) would require a bit of refactoring. Other than that, the rest > should be doable. > > -Andrea > > >> >> -Alex >> On May 10, 2020, 9:32 AM -0400, Andrew Trick <atrick at apple.com>, wrote: >> >> >> >> On May 9, 2020, at 5:12 PM, Andrea Di Biagio via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> The llvm scheduling model is quite simple and doesn't allow mca to >> accurately simulate the execution of individual uOPs. That limitation is >> sort-of acceptable if you consider how the scheduling model framework was >> originally designed with a different goal in mind (i.e. machine >> scheduling). The lack of expressiveness of the llvm scheduling model >> unfortunately limits the accuracy of llvm-mca: we know the number of uOPs >> of an instruction. However we don't know which resources are consumed by >> which micro-opcodes. So we cannot accurately simulate the independent >> execution of individual opcodes of an instruction. >> >> Another "problem" is that it is not possible to describe when uOPs >> effectively start consuming resources. At the moment, the expectation is >> that resource consumption always starts at relative cycle #0 (relative to >> the instruction issue cycle). >> Example: an horizontal add on x86 is usually decoded into a pair of >> shuffles uOPs and a single (data-dependent) vector ADD uOP. >> The ADD uOP doesn't execute immediately because it needs to wait for the >> other two shuffle uOPs. It means that the ALU pipe is still available at >> relative cycle #0 and it is only consumed starting from relative cycle #1 >> (ssuming that both shuffles can start execution at relative cycle #0). In >> practice, the llvm scheduling model only allows us to declare which >> pipeline resources are consumed, and for how long (in number cycles). So we >> cannot accurately describe to mca that the delayed consumption of the ALU >> pipe. >> Now think about what happens if: the first shuffle uOP consumes 1cy of >> HWPort0, and the second shuffle uOp consumes 1cy of HWPort1, and the ADD >> consumes 1cy of HWPort01. We end up in that "odd" situation you described >> where HWPort01 is "reserved" for 1cy. >> In reality, that 1cy of HWPort01 should have started 1cy after the other >> two opcodes. At that point, both pipelines would have been seen available. >> >> In conclusion, the presence of a "reserved" flag is not ideal, but it is >> sort-of a consequence of the above mentioned two limitations (plus the way >> how the Haswell and Broadwell models were originally designed). >> >> I hope it helps, >> -Andrea >> >> >> Food for thought... >> >> It would be easy to add a DelayCycles vector to SchedWriteRes to indicate >> the relative start cycle for each reserved resource. That would effectively >> model dependent uOps. >> >> NumMicroOps is only meant to model any general limitation of the cpu >> frontend to issue/rename/retire micro-ops. So, yes, there's no way to >> associate resources with specific uOps. You can mark any kind of resource >> as "dynamically scheduled" (BufferSize = -1). If an instruction uses >> different kinds of dynamic resources, then those need not be reserved at >> the same time. If we had the DelayCycles vector, it could be interpreted as >> "this resource must be reserved N cycles after prior reservations of other >> resources". >> >> -Andy >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200510/51290bd9/attachment-0001.html>
Alex Renda via llvm-dev
2020-May-11 17:03 UTC
[llvm-dev] [llvm-mca] Resource consumption of ProcResGroups
Hi Andrea, Yep, that commit fixes that issue for me too. One last question, just to make 100% sure I understand the specification that llvm-mca is solving, and its design constraints: {HWPort01: 3} should _always_ mean (at the instruction specification level, even if not in the llvm-mca implementation) that there are 3 cycles dispatched to either of HWPort0 or HWPort1, regardless of whether HWPort0 or HWPort1 are additionally specified, right? And so the decision to use the “reserved” flag is one way of implementing that specification, and not necessarily a reflection of any desired execution behavior beyond that? Thanks! -Alex On May 10, 2020, 2:56 PM -0400, Andrea Di Biagio <andrea.dibiagio at gmail.com>, wrote:> > > > On Sun, May 10, 2020 at 5:59 PM Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote: > > > Hi Alex, > > > > > > > On Sun, May 10, 2020 at 4:00 PM Alex Renda <renda at csail.mit.edu> wrote: > > > > > Thanks, that’s very helpful! > > > > > > > > > > > > > > > > > > > > Also, sorry for the miscue on that bug with the 2/4 cycles — I realize now that that’s an artifact of a change that I made to not crash when resource groups overlap without all atomic subunits being specified: > > > > > > > > > > `echo 'fxrstor (%rsp)' | llvm-mca -mtriple=x86_64-unknown-unknown -march=x86-64 -mcpu=haswell` > > > > > crashes (because fxrstor requests `HWPort0,HWPort6,HWPort23,HWPort05,HWPort06,HWPort15,HWPort0156`, so HWPort0156 ends up asserting because 0,1,5, and 6 are all taken), so I added: > > > > > ``` > > > > > --- a/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp > > > > > +++ b/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp > > > > > @@ -292,7 +292,7 @@ void ResourceManager::issueInstruction( > > > > > ResourceState &RS = *Resources[Index]; > > > > > > > > > > - if (!R.second.isReserved()) { > > > > > + if (!R.second.isReserved() && RS.isReady()) { > > > > > ResourceRef Pipe = selectPipe(R.first); > > > > > use(Pipe); > > > > > BusyResources[Pipe] += CS.size(); > > > > > ``` > > > > > which is probably the cause of that weird behavior I reported. > > > > > > > > > > > > > > > > > > > > I’m also somewhat curious about what “NumUnits” is modeling: I haven’t totally worked through the code yet, but it seems that when more (not necessarily atomic) sub-resources of a resource group are requested, more “NumUnits” of that group is requested. This doesn’t seem particularly intuitive to me, at least in my mental model of Haswell scheduling (and also leads to some infinite loops like `echo ‘fldenv (%rsp)' | llvm-mca -mtriple=x86_64-unknown-unknown -march=x86-64 -mcpu=haswell`, which requests HWPort0,HWPort1,HWPort01,HWPort05, and HWPort015, meaning that HWPort015 never schedules because it requests 4 “NumUnits” but only 3 are ever available). Is there some particular behavior that this is modeling? > > > > > > > > The issue with fxrstor is unfortunately a bug. I'll see if I can fix it. > > > > > > > > Strictly speaking, that "NumUnits" quantity is literally the number of processor resource units consumed. By construction, it should never exceed the actual number of resource units declared by a processor resource in the scheduling model (i.e. this particular field of MCProcResourceDesc - https://llvm.org/doxygen/structllvm_1_1MCProcResourceDesc.html#a9d4d0cc34fcce4779dc4445d8265fffc). > > > > > > > > The reason why you see the crash/assert is because the number of resource units for scheduler resources is incorrectly set for that instruction by the instruction builder in llvm-mca ( lib/MCA/InstrBuilder.cpp - function `initializeUsedResources()`). I am looking at it. > > > > Hi Alex, > > > > This issue should be fixed by commit 47b95d7cf462. > > I now understand why so many questions about the meaning of reserved resources :-). > > Could you please check if that commit fixes the issue for you too? > > > > As Andy wrote, in the future we should really look into adding an optional DelayCycles vector for SchedWriteRes. > > That would be the ideal improvement; it would also allow us to get rid of the "reserved" bit. > > > > Thanks, > > -Andrea > > > > > > > > > > About what Andy suggested: I really like the idea of having a vector of delay-cycles for consumed resources. It would fix most of these problems in Haswell, and it could be used to improve the description of some instructions on other processor models too. I agree that it should not be difficult to implement (at least, the tablegen part should be pretty mechanical); the resource allocation logic in class ResourceManager (in llvm-mca) would require a bit of refactoring. Other than that, the rest should be doable. > > > > > > > > -Andrea > > > > > > > > > > > > > > -Alex > > > > > On May 10, 2020, 9:32 AM -0400, Andrew Trick <atrick at apple.com>, wrote: > > > > > > > > > > > > > > > > > > > On May 9, 2020, at 5:12 PM, Andrea Di Biagio via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > > > > > > > > > > > > The llvm scheduling model is quite simple and doesn't allow mca to accurately simulate the execution of individual uOPs. That limitation is sort-of acceptable if you consider how the scheduling model framework was originally designed with a different goal in mind (i.e. machine scheduling). The lack of expressiveness of the llvm scheduling model unfortunately limits the accuracy of llvm-mca: we know the number of uOPs of an instruction. However we don't know which resources are consumed by which micro-opcodes. So we cannot accurately simulate the independent execution of individual opcodes of an instruction. > > > > > > > > > > > > > > Another "problem" is that it is not possible to describe when uOPs effectively start consuming resources. At the moment, the expectation is that resource consumption always starts at relative cycle #0 (relative to the instruction issue cycle). > > > > > > > Example: an horizontal add on x86 is usually decoded into a pair of shuffles uOPs and a single (data-dependent) vector ADD uOP. > > > > > > > The ADD uOP doesn't execute immediately because it needs to wait for the other two shuffle uOPs. It means that the ALU pipe is still available at relative cycle #0 and it is only consumed starting from relative cycle #1 (ssuming that both shuffles can start execution at relative cycle #0). In practice, the llvm scheduling model only allows us to declare which pipeline resources are consumed, and for how long (in number cycles). So we cannot accurately describe to mca that the delayed consumption of the ALU pipe. > > > > > > > Now think about what happens if: the first shuffle uOP consumes 1cy of HWPort0, and the second shuffle uOp consumes 1cy of HWPort1, and the ADD consumes 1cy of HWPort01. We end up in that "odd" situation you described where HWPort01 is "reserved" for 1cy. > > > > > > > In reality, that 1cy of HWPort01 should have started 1cy after the other two opcodes. At that point, both pipelines would have been seen available. > > > > > > > > > > > > > > In conclusion, the presence of a "reserved" flag is not ideal, but it is sort-of a consequence of the above mentioned two limitations (plus the way how the Haswell and Broadwell models were originally designed). > > > > > > > > > > > > > > I hope it helps, > > > > > > > -Andrea > > > > > > > > > > > > Food for thought... > > > > > > > > > > > > It would be easy to add a DelayCycles vector to SchedWriteRes to indicate the relative start cycle for each reserved resource. That would effectively model dependent uOps. > > > > > > > > > > > > NumMicroOps is only meant to model any general limitation of the cpu frontend to issue/rename/retire micro-ops. So, yes, there's no way to associate resources with specific uOps. You can mark any kind of resource as "dynamically scheduled" (BufferSize = -1). If an instruction uses different kinds of dynamic resources, then those need not be reserved at the same time. If we had the DelayCycles vector, it could be interpreted as "this resource must be reserved N cycles after prior reservations of other resources". > > > > > > > > > > > > -Andy-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200511/56c14979/attachment.html>