thr3ads.net - llvm dev - [llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops [Mar 2018]

If this information is useful, please help other people find it:
Share via:

Guillaume Chatelet via llvm-dev

2018-Mar-15 15:04 UTC

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

[You can find an easier to read and more complete version of this RFC here
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>
.]


Knowing instruction scheduling properties (latency, uops) is the basis for
all scheduling work done by LLVM.

Unfortunately, vendors usually release only partial (and sometimes
incorrect) information.  Updating the information is painful and requires
careful guesswork and analysis. As a result, scheduling information is
incomplete for most X86 models (this bug
<https://bugs.llvm.org/show_bug.cgi?id=32325> tracks some of these
issues).
The goal of the tool presented here is to automatically (in)validate the
TableDef scheduling models. In the long run we envision automatic
generation of the models.

At Google, we have developed a tool that, given an instruction mnemonic,
uses the data in `MCInstrInfo` to generate a code snippet that makes
execution as serial (resp. as parallel) as possible so that we can measure
the latency (resp. uop decomposition) of the instruction. The code snippet
is jitted and executed on the host subtarget. The time taken (resp.
resource usage) is measured using hardware performance counters. More
details can be found in the ‘implementation’ section of the RFC.

For people familiar with the work of Agner Fog, this is essentially an
automation of the process of building the code snippets using instruction
descriptions from LLVM.
Results

   -

   Solving this bug <https://bugs.llvm.org/show_bug.cgi?id=36084>
   (sandybridge):
> llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
---

asm_template:

 name:            latency IMUL16rri8

cpu_name:        sandybridge

llvm_triple:     x86_64-grtev4-linux-gnu

num_repetitions: 10000

measurements:

 - { key: latency, value: 4.0115, debug_string: '' }

error:           ''

...
> llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
---

asm_template:

 name:            uops IMUL16rri8

cpu_name:        sandybridge

llvm_triple:     x86_64-grtev4-linux-gnu

num_repetitions: 10000

measurements:

 - { key: '2', value: 0.5232, debug_string: SBPort0 }

 - { key: '3', value: 1.0039, debug_string: SBPort1 }

 - { key: '4', value: 0.0024, debug_string: SBPort4 }

 - { key: '5', value: 0.3693, debug_string: SBPort5 }

error:           ''

...

Running both these commands took ~.2 seconds including printing.




   -

   List of measured latencies
  
<https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>
   for sandybridge, haswell and skylake processors including diffs with LLVM
   latencies. Excerpt:



sandybridge

haswell

skylake

mnemonic

llvm-exegesis

TD file

llvm-exegesis

TD file

llvm-exegesis

TD file

SHR32r1

1.01

1.00

1.00

1.00

1.01

1.00

IMUL16rri

4.02

3.00

4.01

3.00

4.01

3.00


   -

   Some instructions have different implementations depending on which
   registers are assigned. This is well known for cases like `xor eax, eax`
   and `xor eax, ebx`, which emits no uops in the first case (this happens
   during register renaming, see Agner Fog’s “Register Allocation and
   Renaming”, in microarchitecture.pdf
   <http://www.agner.org/optimize/microarchitecture.pdf>). But we found
out
   that this can go further. For example, SHLD64rri8 takes one cycle and
   runs on P06 in the `shld rax, rax, 0x1` case, but takes 3 cycles and
   runs on P1 in the `shld rbx, rax, 0x1` case. To the best of our
   knowledge, this has not yet been described.

Future Work

   -

   [easy] Fix Intel Scheduling Models.
   -

   [easy] Extend to memory operands.
   -

   [easy] Make the tool work reliably for x87 instructions.
   -

   [medium] A tool that automatically create patches to TD files.
   -

   [medium] Measure the effect of immediate/register values: Some
   instructions have performance characteristics that depends on the values it
   operates on. We should explore the value space (0, 1, ~1, 2^{8,16,32,64},
   inf, nan, denorm...).
   -

   [medium] Measure the effect of changing registers on instruction
   implementation (see results section
  
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>
   above). Model this in LLVM TD schema.
   -

   [hard] Make the tool work for instruction that have side effects (e.g.
   PUSH/POP, JMP, ...). This might involve extending the TD schema with
   information on how to setup measurements for specific instructions.
   -

   [??] Make the tool work for other CPUs. This mainly depends on the
   presence of performance counters.

Open QuestionsWe depend on libpfm
<http://perfmon2.sourceforge.net/docs_v4.html>. How do we handle the
dependency ?

--
Guillaume Chatelet (gchatelet at google.com), Clement Courbet (
courbet at google.com) for the Google Compiler Research Team
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/cdc319d2/attachment.html>

Guillaume Chatelet via llvm-dev

2018-Mar-15 15:30 UTC

head link

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

Patch for this RFC is available at https://reviews.llvm.org/D44519.

On Thu, Mar 15, 2018 at 4:04 PM Guillaume Chatelet <gchatelet at
google.com>
wrote:
> [You can find an easier to read and more complete version of this RFC here
>
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>
> .]
>
>
> Knowing instruction scheduling properties (latency, uops) is the basis for
> all scheduling work done by LLVM.
>
> Unfortunately, vendors usually release only partial (and sometimes
> incorrect) information.  Updating the information is painful and requires
> careful guesswork and analysis. As a result, scheduling information is
> incomplete for most X86 models (this bug
> <https://bugs.llvm.org/show_bug.cgi?id=32325> tracks some of these
> issues). The goal of the tool presented here is to automatically
> (in)validate the TableDef scheduling models. In the long run we envision
> automatic generation of the models.
>
> At Google, we have developed a tool that, given an instruction mnemonic,
> uses the data in `MCInstrInfo` to generate a code snippet that makes
> execution as serial (resp. as parallel) as possible so that we can measure
> the latency (resp. uop decomposition) of the instruction. The code snippet
> is jitted and executed on the host subtarget. The time taken (resp.
> resource usage) is measured using hardware performance counters. More
> details can be found in the ‘implementation’ section of the RFC.
>
> For people familiar with the work of Agner Fog, this is essentially an
> automation of the process of building the code snippets using instruction
> descriptions from LLVM.
> Results
>
>    -
>
>    Solving this bug <https://bugs.llvm.org/show_bug.cgi?id=36084>
>    (sandybridge):
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>
> ---
>
> asm_template:
>
>  name:            latency IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: latency, value: 4.0115, debug_string: '' }
>
> error:           ''
>
> ...
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>
> ---
>
> asm_template:
>
>  name:            uops IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: '2', value: 0.5232, debug_string: SBPort0 }
>
>  - { key: '3', value: 1.0039, debug_string: SBPort1 }
>
>  - { key: '4', value: 0.0024, debug_string: SBPort4 }
>
>  - { key: '5', value: 0.3693, debug_string: SBPort5 }
>
> error:           ''
>
> ...
>
> Running both these commands took ~.2 seconds including printing.
>
>
>
>
>    -
>
>    List of measured latencies
>   
<https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>
>    for sandybridge, haswell and skylake processors including diffs with
LLVM
>    latencies. Excerpt:
>
>
>
> sandybridge
>
> haswell
>
> skylake
>
> mnemonic
>
> llvm-exegesis
>
> TD file
>
> llvm-exegesis
>
> TD file
>
> llvm-exegesis
>
> TD file
>
> SHR32r1
>
> 1.01
>
> 1.00
>
> 1.00
>
> 1.00
>
> 1.01
>
> 1.00
>
> IMUL16rri
>
> 4.02
>
> 3.00
>
> 4.01
>
> 3.00
>
> 4.01
>
> 3.00
>
>
>    -
>
>    Some instructions have different implementations depending on which
>    registers are assigned. This is well known for cases like `xor eax,
>    eax` and `xor eax, ebx`, which emits no uops in the first case (this
>    happens during register renaming, see Agner Fog’s “Register Allocation
and
>    Renaming”, in microarchitecture.pdf
>    <http://www.agner.org/optimize/microarchitecture.pdf>). But we
found
>    out that this can go further. For example, SHLD64rri8 takes one cycle
>    and runs on P06 in the `shld rax, rax, 0x1` case, but takes 3 cycles
>    and runs on P1 in the `shld rbx, rax, 0x1` case. To the best of our
>    knowledge, this has not yet been described.
>
> Future Work
>
>    -
>
>    [easy] Fix Intel Scheduling Models.
>    -
>
>    [easy] Extend to memory operands.
>    -
>
>    [easy] Make the tool work reliably for x87 instructions.
>    -
>
>    [medium] A tool that automatically create patches to TD files.
>    -
>
>    [medium] Measure the effect of immediate/register values: Some
>    instructions have performance characteristics that depends on the values
it
>    operates on. We should explore the value space (0, 1, ~1,
2^{8,16,32,64},
>    inf, nan, denorm...).
>    -
>
>    [medium] Measure the effect of changing registers on instruction
>    implementation (see results section
>   
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>
>    above). Model this in LLVM TD schema.
>    -
>
>    [hard] Make the tool work for instruction that have side effects (e.g.
>    PUSH/POP, JMP, ...). This might involve extending the TD schema with
>    information on how to setup measurements for specific instructions.
>    -
>
>    [??] Make the tool work for other CPUs. This mainly depends on the
>    presence of performance counters.
>
> Open QuestionsWe depend on libpfm
> <http://perfmon2.sourceforge.net/docs_v4.html>. How do we handle the
> dependency ?
>
> --
> Guillaume Chatelet (gchatelet at google.com), Clement Courbet (
> courbet at google.com) for the Google Compiler Research Team
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/3ae3fbc2/attachment.html>

Hal Finkel via llvm-dev

2018-Mar-15 15:41 UTC

head link

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev
wrote:> [You can find an easier to read and more complete version of this RFC
> here
>
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.]
>
> Knowing instruction scheduling properties (latency, uops) is the basis
> for all scheduling work done by LLVM.
>
>
> Unfortunately, vendors usually release only partial (and sometimes
> incorrect) information.  Updating the information is painful and
> requires careful guesswork and analysis. As a result, scheduling
> information is incomplete for most X86 models (this bug
> <https://bugs.llvm.org/show_bug.cgi?id=32325>tracks some of these
> issues). The goal of the tool presented here is to automatically
> (in)validate the TableDef scheduling models. In the long run we
> envision automatic generation of the models.
>
>
> At Google, we have developed a tool that, given an instruction
> mnemonic, uses the data in `MCInstrInfo` to generate a code snippet
> that makes execution as serial (resp. as parallel) as possible so that
> we can measure the latency (resp. uop decomposition) of the
> instruction. The code snippet is jitted and executed on the host
> subtarget. The time taken (resp. resource usage) is measured using
> hardware performance counters. More details can be found in the
> ‘implementation’ section of the RFC.
>
>
> For people familiar with the work of Agner Fog, this is essentially an
> automation of the process of building the code snippets using
> instruction descriptions from LLVM.
>
>
>   Results
>
>  *
>
>     Solving this bug
>     <https://bugs.llvm.org/show_bug.cgi?id=36084>(sandybridge):
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>
> ---
>
> asm_template:    
>
>  name:            latency IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:    
>
>  - { key: latency, value: 4.0115, debug_string: '' }
>
> error:           ''
>
> ...
>
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>
> ---
>
> asm_template:    
>
>  name:            uops IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:    
>
>  - { key: '2', value: 0.5232, debug_string: SBPort0 }
>
>  - { key: '3', value: 1.0039, debug_string: SBPort1 }
>
>  - { key: '4', value: 0.0024, debug_string: SBPort4 }
>
>  - { key: '5', value: 0.3693, debug_string: SBPort5 }
>
> error:           ''
>
> ...
>
> Running both these commands took ~.2 seconds including printing.
>
>
>  *
>
>     List of measured latencies
>    
<https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>for
>     sandybridge, haswell and skylake processors including diffs with
>     LLVM latencies. Excerpt:
>
>
>
> 	
>
> sandybridge
>
> 	
>
> haswell
>
> 	
>
> skylake
>
> mnemonic
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> SHR32r1
>
> 	
>
> 1.01
>
> 	
>
> 1.00
>
> 	
>
> 1.00
>
> 	
>
> 1.00
>
> 	
>
> 1.01
>
> 	
>
> 1.00
>
> IMUL16rri
>
> 	
>
> 4.02
>
> 	
>
> 3.00
>
> 	
>
> 4.01
>
> 	
>
> 3.00
>
> 	
>
> 4.01
>
> 	
>
> 3.00
>
>
>  *
>
>     Some instructions have different implementationsdepending on which
>     registers are assigned. This is well known for cases like `xor
>     eax, eax`and `xor eax, ebx`, which emits no uops in the first case
>     (this happens during register renaming, see Agner Fog’s “Register
>     Allocation and Renaming”, in microarchitecture.pdf
>     <http://www.agner.org/optimize/microarchitecture.pdf>). But we
>     found out that this can go further. For example, SHLD64rri8takes
>     one cycle and runs on P06 in the `shld rax, rax, 0x1`case, but
>     takes 3 cycles and runs on P1 in the `shld rbx, rax, 0x1`case. To
>     the best of our knowledge, this has not yet been described.
>
This is great!
>
>   Future Work
>
>  *
>
>     [easy] Fix Intel Scheduling Models.
>
>  *
>
>     [easy] Extend to memory operands.
>
>  *
>
>     [easy] Make the tool work reliably for x87 instructions.
>
>  *
>
>     [medium] A tool that automatically create patches to TD files.
>
>  *
>
>     [medium] Measure the effect of immediate/register values: Some
>     instructions have performance characteristics that depends on the
>     values it operates on. We should explore the value space (0, 1,
>     ~1, 2^{8,16,32,64}, inf, nan, denorm...).
>
>  *
>
>     [medium] Measure the effect of changing registers on instruction
>     implementation(see results section
>    
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>above).
>     Model this in LLVM TD schema.
>
>  *
>
>     [hard] Make the tool work for instruction that have side effects
>     (e.g. PUSH/POP, JMP, ...). This might involve extending the TD
>     schema with information on how to setup measurements for specific
>     instructions.
>
>  *
>
>     [??] Make the tool work for other CPUs. This mainly depends on the
>     presence of performance counters.
>
>
>   Open Questions
>
> We depend on libpfm <http://perfmon2.sourceforge.net/docs_v4.html>.
> How do we handle the dependency ?
Are there options that you have in mind? It's an external MIT-licensed
dependency. Wouldn't CMake just detect it when it's available?

 -Hal
> --
> Guillaume Chatelet (gchatelet at google.com
> <mailto:gchatelet at google.com>), Clement Courbet (courbet at
google.com
> <mailto:courbet at google.com>) for the Google Compiler Research Team
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/e6f392fe/attachment-0001.html>

Clement Courbet via llvm-dev

2018-Mar-15 15:49 UTC

head link

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

On Thu, Mar 15, 2018 at 4:41 PM, Hal Finkel via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> On 03/15/2018 10:04 AM, Guillaume Chatelet via llvm-dev wrote:
>
> [You can find an easier to read and more complete version of this RFC here
>
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>
> .]
>
> Knowing instruction scheduling properties (latency, uops) is the basis for
> all scheduling work done by LLVM.
>
> Unfortunately, vendors usually release only partial (and sometimes
> incorrect) information.  Updating the information is painful and requires
> careful guesswork and analysis. As a result, scheduling information is
> incomplete for most X86 models (this bug
> <https://bugs.llvm.org/show_bug.cgi?id=32325> tracks some of these
> issues). The goal of the tool presented here is to automatically
> (in)validate the TableDef scheduling models. In the long run we envision
> automatic generation of the models.
>
> At Google, we have developed a tool that, given an instruction mnemonic,
> uses the data in `MCInstrInfo` to generate a code snippet that makes
> execution as serial (resp. as parallel) as possible so that we can measure
> the latency (resp. uop decomposition) of the instruction. The code snippet
> is jitted and executed on the host subtarget. The time taken (resp.
> resource usage) is measured using hardware performance counters. More
> details can be found in the ‘implementation’ section of the RFC.
>
> For people familiar with the work of Agner Fog, this is essentially an
> automation of the process of building the code snippets using instruction
> descriptions from LLVM.
> Results
>
>    -
>
>    Solving this bug <https://bugs.llvm.org/show_bug.cgi?id=36084>
>    (sandybridge):
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>
> ---
>
> asm_template:
>
>  name:            latency IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: latency, value: 4.0115, debug_string: '' }
>
> error:           ''
>
> ...
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>
> ---
>
> asm_template:
>
>  name:            uops IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: '2', value: 0.5232, debug_string: SBPort0 }
>
>  - { key: '3', value: 1.0039, debug_string: SBPort1 }
>
>  - { key: '4', value: 0.0024, debug_string: SBPort4 }
>
>  - { key: '5', value: 0.3693, debug_string: SBPort5 }
>
> error:           ''
>
> ...
>
> Running both these commands took ~.2 seconds including printing.
>
>
>
>    -
>
>    List of measured latencies
>   
<https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>
>    for sandybridge, haswell and skylake processors including diffs with
LLVM
>    latencies. Excerpt:
>
>
>
> sandybridge
>
> haswell
>
> skylake
>
> mnemonic
>
> llvm-exegesis
>
> TD file
>
> llvm-exegesis
>
> TD file
>
> llvm-exegesis
>
> TD file
>
> SHR32r1
>
> 1.01
>
> 1.00
>
> 1.00
>
> 1.00
>
> 1.01
>
> 1.00
>
> IMUL16rri
>
> 4.02
>
> 3.00
>
> 4.01
>
> 3.00
>
> 4.01
>
> 3.00
>
>
>    -
>
>    Some instructions have different implementations depending on which
>    registers are assigned. This is well known for cases like `xor eax,
>    eax` and `xor eax, ebx`, which emits no uops in the first case (this
>    happens during register renaming, see Agner Fog’s “Register Allocation
and
>    Renaming”, in microarchitecture.pdf
>    <http://www.agner.org/optimize/microarchitecture.pdf>). But we
found
>    out that this can go further. For example, SHLD64rri8 takes one cycle
>    and runs on P06 in the `shld rax, rax, 0x1` case, but takes 3 cycles
>    and runs on P1 in the `shld rbx, rax, 0x1` case. To the best of our
>    knowledge, this has not yet been described.
>
>
> This is great!
>
> Future Work
>
>    -
>
>    [easy] Fix Intel Scheduling Models.
>    -
>
>    [easy] Extend to memory operands.
>    -
>
>    [easy] Make the tool work reliably for x87 instructions.
>    -
>
>    [medium] A tool that automatically create patches to TD files.
>    -
>
>    [medium] Measure the effect of immediate/register values: Some
>    instructions have performance characteristics that depends on the values
it
>    operates on. We should explore the value space (0, 1, ~1,
2^{8,16,32,64},
>    inf, nan, denorm...).
>    -
>
>    [medium] Measure the effect of changing registers on instruction
>    implementation (see results section
>   
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>
>    above). Model this in LLVM TD schema.
>    -
>
>    [hard] Make the tool work for instruction that have side effects (e.g.
>    PUSH/POP, JMP, ...). This might involve extending the TD schema with
>    information on how to setup measurements for specific instructions.
>    -
>
>    [??] Make the tool work for other CPUs. This mainly depends on the
>    presence of performance counters.
>
> Open Questions We depend on libpfm
> <http://perfmon2.sourceforge.net/docs_v4.html>. How do we handle the
> dependency ?
>
>
> Are there options that you have in mind? It's an external MIT-licensed
> dependency. Wouldn't CMake just detect it when it's available?
>
That's what we've done for now (see code here
<https://reviews.llvm.org/differential/changeset/?ref=1002469&whitespace=ignore-most>).
We're not sure what the policy is wrt external deps. Right now if the tool
is enabled and libpfm is not on the system, we die with an error message.
The other options would be to disable the tool in that case (I'm not sure
how to do that). Opinions ?

>
>  -Hal
>
> --
> Guillaume Chatelet (gchatelet at google.com), Clement Courbet (
> courbet at google.com) for the Google Compiler Research Team
>
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/d9d3f1b6/attachment.html>

Philip Reames via llvm-dev

2018-Mar-15 16:30 UTC

head link

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

Sounds like a very useful tool.  Thank you for contributing.

Taking a step back and looking at the big picture, combining this with 
the recently contributed llvm-mca dramatically improves our scheduling 
and performance analysis story.  Being able to take a snippet of code on 
a particular machine, measure latency/throughput/ports for each 
instruction (this tool), and then analyze the entire code sequence in an 
actionable way using the measured information (llvm-mca), leads to a 
very powerful performance analysis workflow.


On 03/15/2018 08:04 AM, Guillaume Chatelet via llvm-dev
wrote:> [You can find an easier to read and more complete version of this RFC 
> here 
>
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>.]
>
> Knowing instruction scheduling properties (latency, uops) is the basis 
> for all scheduling work done by LLVM.
>
>
> Unfortunately, vendors usually release only partial (and sometimes 
> incorrect) information.  Updating the information is painful and 
> requires careful guesswork and analysis. As a result, scheduling 
> information is incomplete for most X86 models (this bug 
> <https://bugs.llvm.org/show_bug.cgi?id=32325>tracks some of these 
> issues). The goal of the tool presented here is to automatically 
> (in)validate the TableDef scheduling models. In the long run we 
> envision automatic generation of the models.
>
>
> At Google, we have developed a tool that, given an instruction 
> mnemonic, uses the data in `MCInstrInfo` to generate a code snippet 
> that makes execution as serial (resp. as parallel) as possible so that 
> we can measure the latency (resp. uop decomposition) of the 
> instruction. The code snippet is jitted and executed on the host 
> subtarget. The time taken (resp. resource usage) is measured using 
> hardware performance counters. More details can be found in the 
> ‘implementation’ section of the RFC.
>
>
> For people familiar with the work of Agner Fog, this is essentially an 
> automation of the process of building the code snippets using 
> instruction descriptions from LLVM.
>
>
>   Results
>
>  *
>
>     Solving this bug
>     <https://bugs.llvm.org/show_bug.cgi?id=36084>(sandybridge):
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>
> ---
>
> asm_template:
>
>  name:            latency IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: latency, value: 4.0115, debug_string: '' }
>
> error:           ''
>
> ...
>
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>
> ---
>
> asm_template:
>
>  name:            uops IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: '2', value: 0.5232, debug_string: SBPort0 }
>
>  - { key: '3', value: 1.0039, debug_string: SBPort1 }
>
>  - { key: '4', value: 0.0024, debug_string: SBPort4 }
>
>  - { key: '5', value: 0.3693, debug_string: SBPort5 }
>
> error:           ''
>
> ...
>
> Running both these commands took ~.2 seconds including printing.
>
>
>  *
>
>     List of measured latencies
>    
<https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>for
>     sandybridge, haswell and skylake processors including diffs with
>     LLVM latencies. Excerpt:
>
>
>
> 	
>
> sandybridge
>
> 	
>
> haswell
>
> 	
>
> skylake
>
> mnemonic
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> 	
>
> llvm-exegesis
>
> 	
>
> TD file
>
> SHR32r1
>
> 	
>
> 1.01
>
> 	
>
> 1.00
>
> 	
>
> 1.00
>
> 	
>
> 1.00
>
> 	
>
> 1.01
>
> 	
>
> 1.00
>
> IMUL16rri
>
> 	
>
> 4.02
>
> 	
>
> 3.00
>
> 	
>
> 4.01
>
> 	
>
> 3.00
>
> 	
>
> 4.01
>
> 	
>
> 3.00
>
>
>  *
>
>     Some instructions have different implementationsdepending on which
>     registers are assigned. This is well known for cases like `xor
>     eax, eax`and `xor eax, ebx`, which emits no uops in the first case
>     (this happens during register renaming, see Agner Fog’s “Register
>     Allocation and Renaming”, in microarchitecture.pdf
>     <http://www.agner.org/optimize/microarchitecture.pdf>). But we
>     found out that this can go further. For example, SHLD64rri8takes
>     one cycle and runs on P06 in the `shld rax, rax, 0x1`case, but
>     takes 3 cycles and runs on P1 in the `shld rbx, rax, 0x1`case. To
>     the best of our knowledge, this has not yet been described.
>
>
>   Future Work
>
>  *
>
>     [easy] Fix Intel Scheduling Models.
>
>  *
>
>     [easy] Extend to memory operands.
>
>  *
>
>     [easy] Make the tool work reliably for x87 instructions.
>
>  *
>
>     [medium] A tool that automatically create patches to TD files.
>
>  *
>
>     [medium] Measure the effect of immediate/register values: Some
>     instructions have performance characteristics that depends on the
>     values it operates on. We should explore the value space (0, 1,
>     ~1, 2^{8,16,32,64}, inf, nan, denorm...).
>
>  *
>
>     [medium] Measure the effect of changing registers on instruction
>     implementation(see results section
>    
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>above).
>     Model this in LLVM TD schema.
>
>  *
>
>     [hard] Make the tool work for instruction that have side effects
>     (e.g. PUSH/POP, JMP, ...). This might involve extending the TD
>     schema with information on how to setup measurements for specific
>     instructions.
>
>  *
>
>     [??] Make the tool work for other CPUs. This mainly depends on the
>     presence of performance counters.
>
>
>   Open Questions
>
> We depend on libpfm <http://perfmon2.sourceforge.net/docs_v4.html>. 
> How do we handle the dependency ?
> --
> Guillaume Chatelet (gchatelet at google.com 
> <mailto:gchatelet at google.com>), Clement Courbet (courbet at
google.com
> <mailto:courbet at google.com>) for the Google Compiler Research Team
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180315/f2c50aff/attachment.html>

Clement Courbet via llvm-dev

2018-Mar-16 08:10 UTC

head link

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

Thanks Philip,

That project is part of a global effort to better understand and simulate
the performance of our code, so it's not a coincidence that this
complements llvm-mca. As a matter of fact, when llvm-mca was announced we
were about to release a similar tool, and we're now working with Andrea et
al. to integrate our work into llvm-mca. We're happy that many people seem
to share the same goals and vision.
BTW we'll be presenting two lightning talks about this work at EuroLLVM,
happy to discuss if you happen to attend.


On Thu, Mar 15, 2018 at 5:30 PM, Philip Reames via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Sounds like a very useful tool.  Thank you for contributing.
>
> Taking a step back and looking at the big picture, combining this with the
> recently contributed llvm-mca dramatically improves our scheduling and
> performance analysis story.  Being able to take a snippet of code on a
> particular machine, measure latency/throughput/ports for each instruction
> (this tool), and then analyze the entire code sequence in an actionable way
> using the measured information (llvm-mca), leads to a very powerful
> performance analysis workflow.
>
> On 03/15/2018 08:04 AM, Guillaume Chatelet via llvm-dev wrote:
>
> [You can find an easier to read and more complete version of this RFC here
>
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#>
> .]
>
> Knowing instruction scheduling properties (latency, uops) is the basis for
> all scheduling work done by LLVM.
>
> Unfortunately, vendors usually release only partial (and sometimes
> incorrect) information.  Updating the information is painful and requires
> careful guesswork and analysis. As a result, scheduling information is
> incomplete for most X86 models (this bug
> <https://bugs.llvm.org/show_bug.cgi?id=32325> tracks some of these
> issues). The goal of the tool presented here is to automatically
> (in)validate the TableDef scheduling models. In the long run we envision
> automatic generation of the models.
>
> At Google, we have developed a tool that, given an instruction mnemonic,
> uses the data in `MCInstrInfo` to generate a code snippet that makes
> execution as serial (resp. as parallel) as possible so that we can measure
> the latency (resp. uop decomposition) of the instruction. The code snippet
> is jitted and executed on the host subtarget. The time taken (resp.
> resource usage) is measured using hardware performance counters. More
> details can be found in the ‘implementation’ section of the RFC.
>
> For people familiar with the work of Agner Fog, this is essentially an
> automation of the process of building the code snippets using instruction
> descriptions from LLVM.
> Results
>
>    -
>
>    Solving this bug <https://bugs.llvm.org/show_bug.cgi?id=36084>
>    (sandybridge):
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
>
> ---
>
> asm_template:
>
>  name:            latency IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: latency, value: 4.0115, debug_string: '' }
>
> error:           ''
>
> ...
>
> > llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
>
> ---
>
> asm_template:
>
>  name:            uops IMUL16rri8
>
> cpu_name:        sandybridge
>
> llvm_triple:     x86_64-grtev4-linux-gnu
>
> num_repetitions: 10000
>
> measurements:
>
>  - { key: '2', value: 0.5232, debug_string: SBPort0 }
>
>  - { key: '3', value: 1.0039, debug_string: SBPort1 }
>
>  - { key: '4', value: 0.0024, debug_string: SBPort4 }
>
>  - { key: '5', value: 0.3693, debug_string: SBPort5 }
>
> error:           ''
>
> ...
>
> Running both these commands took ~.2 seconds including printing.
>
>
>
>    -
>
>    List of measured latencies
>   
<https://docs.google.com/spreadsheets/d/11_vFQRpiPHQ3zLcx8cVYYCqR5N5PCa4IvMyKHwF7Op4/edit?usp=sharing>
>    for sandybridge, haswell and skylake processors including diffs with
LLVM
>    latencies. Excerpt:
>
>
>
> sandybridge
>
> haswell
>
> skylake
>
> mnemonic
>
> llvm-exegesis
>
> TD file
>
> llvm-exegesis
>
> TD file
>
> llvm-exegesis
>
> TD file
>
> SHR32r1
>
> 1.01
>
> 1.00
>
> 1.00
>
> 1.00
>
> 1.01
>
> 1.00
>
> IMUL16rri
>
> 4.02
>
> 3.00
>
> 4.01
>
> 3.00
>
> 4.01
>
> 3.00
>
>
>    -
>
>    Some instructions have different implementations depending on which
>    registers are assigned. This is well known for cases like `xor eax,
>    eax` and `xor eax, ebx`, which emits no uops in the first case (this
>    happens during register renaming, see Agner Fog’s “Register Allocation
and
>    Renaming”, in microarchitecture.pdf
>    <http://www.agner.org/optimize/microarchitecture.pdf>). But we
found
>    out that this can go further. For example, SHLD64rri8 takes one cycle
>    and runs on P06 in the `shld rax, rax, 0x1` case, but takes 3 cycles
>    and runs on P1 in the `shld rbx, rax, 0x1` case. To the best of our
>    knowledge, this has not yet been described.
>
> Future Work
>
>    -
>
>    [easy] Fix Intel Scheduling Models.
>    -
>
>    [easy] Extend to memory operands.
>    -
>
>    [easy] Make the tool work reliably for x87 instructions.
>    -
>
>    [medium] A tool that automatically create patches to TD files.
>    -
>
>    [medium] Measure the effect of immediate/register values: Some
>    instructions have performance characteristics that depends on the values
it
>    operates on. We should explore the value space (0, 1, ~1,
2^{8,16,32,64},
>    inf, nan, denorm...).
>    -
>
>    [medium] Measure the effect of changing registers on instruction
>    implementation (see results section
>   
<https://docs.google.com/document/d/1QidaJMJUyQdRrFKD66vE1_N55whe0coQ3h1GpFzz27M/edit?ts=5aaa84ee#bookmark=kix.q6a0imw9qn1n>
>    above). Model this in LLVM TD schema.
>    -
>
>    [hard] Make the tool work for instruction that have side effects (e.g.
>    PUSH/POP, JMP, ...). This might involve extending the TD schema with
>    information on how to setup measurements for specific instructions.
>    -
>
>    [??] Make the tool work for other CPUs. This mainly depends on the
>    presence of performance counters.
>
> Open Questions We depend on libpfm
> <http://perfmon2.sourceforge.net/docs_v4.html>. How do we handle the
> dependency ?
> --
> Guillaume Chatelet (gchatelet at google.com), Clement Courbet (
> courbet at google.com) for the Google Compiler Research Team
>
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180316/4cd2d832/attachment-0001.html>

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Mar 2018 - [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

[llvm-dev] [RFC] llvm-exegesis: Automatic Measurement of Instruction Latency/Uops

Reasonably Related Threads