Tom Chen via llvm-dev
2019-Jun-07 11:42 UTC
[llvm-dev] [llvm-mca] What's the difference between Rthroughput and "total cycles" in llvm-mca
Hi Andrea, So does this definition make sense for basic blocks with more than one instructions? E.g. how should one interpret a basic block with RThroughput of 2.3? On Fri, Jun 7, 2019 at 7:39 AM Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:> Hi Tom, > > Field 'Total Cycles' from the summary view simply reports the elapsed > number of cycles for the entire simulation. > > Rthroughput (from the "Instruction Info" view) is the reciprocal of the > instruction throughput. > Throughput is computed as the maximum number of instructions of a same > type that can be executed per clock cycle in the absence of operand > dependencies. > > Example (x86 - AMD Jaguar): > ADD EAX, ESI > > The integer unit in Jaguar has two ALU pipelines. An ADD instruction can > issue to any of those pipelines. That means, two independent ADD can be > issue during a same cycle. Therefore, throughput is 2 (instructions per > cycle), and RThrougput (1/throughput) is 0.5. > > I hope it helps, > -Andrea > > On Thu, Jun 6, 2019 at 10:11 PM Tom Chen via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> What is the difference between the two? I thought "Rthroughput" is >> basically the number of cycles required to execute a single iteration at >> steady state, but this does not seem to match with the schedule/timeline >> generated by llvm-mca. >> Thanks in advance, >> Tom >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190607/2cf5c31e/attachment.html>
Andrea Di Biagio via llvm-dev
2019-Jun-07 13:30 UTC
[llvm-dev] [llvm-mca] What's the difference between Rthroughput and "total cycles" in llvm-mca
In the absence of data dependencies, throughput of a block of code is superiorly limited by the dispatch rate (i.e. our DispatchWidth), and the availability of hardware resources. DispatchWidth is the maximum number of micro opcodes that can be dispatched to the out-of-order every cycle. That value inevitably affects the block throughput. Example: if a block in input decodes to 4 micro-opcodes in total, and the processor can only dispatch up to 2 opcodes per cycle, then the maximum block throughput cannot exceed 0.5 (i.e. one block every two cycles). Block throughput is also constrained by the availability of hardware resources. Example: if we have 4 ADD micro-opcodes, and each opcode consumes 1cy of ALU pipeline, then the block throughput is superiorly limited by N/4, where N is the number of ALU pipelines available on the target, and 4 is the number of ALU cycles consumed. So, if there is only 1 ALU pipeline, then the block throughput is superiorly limited to 1/4 = 0.25 (blocks per cycle) Back to the computation of the "Block Throughput". It is statically computed as the reciprocal of the block throughput. As for the normal instruction throughput, the computation doesn't take into account operand dependencies. Therefore, we could say that it is computed as the MAX of: - #MicroOpcodes of a block / DispatchWidth - #Consumed resource cycles / #Resources [ for every resource kind ]. In the absence of loop-carried dependencies between different iterations, the observed ‘uOps Per Cycle’ tends to a theoretical maximum throughput which can be computed by dividing the total number of uOps of a block by the Block RThroughput. You can find more information about it in the llvm-mca docs under section "How LLVM-MCA works". I hope it helps! -Andrea On Fri, Jun 7, 2019 at 12:43 PM Tom Chen <cyt046 at gmail.com> wrote:> Hi Andrea, > So does this definition make sense for basic blocks with more than one > instructions? E.g. how should one interpret a basic block with RThroughput > of 2.3? > > On Fri, Jun 7, 2019 at 7:39 AM Andrea Di Biagio <andrea.dibiagio at gmail.com> > wrote: > >> Hi Tom, >> >> Field 'Total Cycles' from the summary view simply reports the elapsed >> number of cycles for the entire simulation. >> >> Rthroughput (from the "Instruction Info" view) is the reciprocal of the >> instruction throughput. >> Throughput is computed as the maximum number of instructions of a same >> type that can be executed per clock cycle in the absence of operand >> dependencies. >> >> Example (x86 - AMD Jaguar): >> ADD EAX, ESI >> >> The integer unit in Jaguar has two ALU pipelines. An ADD instruction can >> issue to any of those pipelines. That means, two independent ADD can be >> issue during a same cycle. Therefore, throughput is 2 (instructions per >> cycle), and RThrougput (1/throughput) is 0.5. >> >> I hope it helps, >> -Andrea >> >> On Thu, Jun 6, 2019 at 10:11 PM Tom Chen via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> What is the difference between the two? I thought "Rthroughput" is >>> basically the number of cycles required to execute a single iteration at >>> steady state, but this does not seem to match with the schedule/timeline >>> generated by llvm-mca. >>> Thanks in advance, >>> Tom >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190607/4dad0a0e/attachment.html>
Andrea Di Biagio via llvm-dev
2019-Jun-07 13:33 UTC
[llvm-dev] [llvm-mca] What's the difference between Rthroughput and "total cycles" in llvm-mca
On Fri, Jun 7, 2019 at 2:30 PM Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:> In the absence of data dependencies, throughput of a block of code is > superiorly limited by the dispatch rate (i.e. our DispatchWidth), and the > availability of hardware resources. > > DispatchWidth is the maximum number of micro opcodes that can be > dispatched to the out-of-order every cycle. That value inevitably affects > the block throughput. Example: if a block in input decodes to 4 > micro-opcodes in total, and the processor can only dispatch up to 2 opcodes > per cycle, then the maximum block throughput cannot exceed 0.5 (i.e. one > block every two cycles). > > Block throughput is also constrained by the availability of hardware > resources. > Example: if we have 4 ADD micro-opcodes, and each opcode consumes 1cy of > ALU pipeline, then the block throughput is superiorly limited by N/4, where > N is the number of ALU pipelines available on the target, and 4 is the > number of ALU cycles consumed. So, if there is only 1 ALU pipeline, then > the block throughput is superiorly limited to 1/4 = 0.25 (blocks per cycle) > > Back to the computation of the "Block Throughput". >Sorry, I should have written "Block RThroughput" here. It is statically computed as the reciprocal of the block throughput. As for> the normal instruction throughput, the computation doesn't take into > account operand dependencies. Therefore, we could say that it is computed > as the MAX of: > - #MicroOpcodes of a block / DispatchWidth > - #Consumed resource cycles / #Resources [ for every resource kind ]. > > In the absence of loop-carried dependencies between different iterations, > the observed ‘uOps Per Cycle’ tends to a theoretical maximum throughput > which can be computed by dividing the total number of uOps of a block by > the Block RThroughput. > > You can find more information about it in the llvm-mca docs under section > "How LLVM-MCA works". > > I hope it helps! > -Andrea > > On Fri, Jun 7, 2019 at 12:43 PM Tom Chen <cyt046 at gmail.com> wrote: > >> Hi Andrea, >> So does this definition make sense for basic blocks with more than one >> instructions? E.g. how should one interpret a basic block with RThroughput >> of 2.3? >> >> On Fri, Jun 7, 2019 at 7:39 AM Andrea Di Biagio < >> andrea.dibiagio at gmail.com> wrote: >> >>> Hi Tom, >>> >>> Field 'Total Cycles' from the summary view simply reports the elapsed >>> number of cycles for the entire simulation. >>> >>> Rthroughput (from the "Instruction Info" view) is the reciprocal of the >>> instruction throughput. >>> Throughput is computed as the maximum number of instructions of a same >>> type that can be executed per clock cycle in the absence of operand >>> dependencies. >>> >>> Example (x86 - AMD Jaguar): >>> ADD EAX, ESI >>> >>> The integer unit in Jaguar has two ALU pipelines. An ADD instruction can >>> issue to any of those pipelines. That means, two independent ADD can be >>> issue during a same cycle. Therefore, throughput is 2 (instructions per >>> cycle), and RThrougput (1/throughput) is 0.5. >>> >>> I hope it helps, >>> -Andrea >>> >>> On Thu, Jun 6, 2019 at 10:11 PM Tom Chen via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> What is the difference between the two? I thought "Rthroughput" is >>>> basically the number of cycles required to execute a single iteration at >>>> steady state, but this does not seem to match with the schedule/timeline >>>> generated by llvm-mca. >>>> Thanks in advance, >>>> Tom >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190607/affd46d3/attachment.html>