thr3ads.net - llvm dev - [llvm-dev] RFC: Insertion of nops for performance stability [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Paparo Bivas, Omer via llvm-dev

2016-Nov-17 09:55 UTC

[llvm-dev] RFC: Insertion of nops for performance stability

Hi all,

These days I am working on a feature designed to insert nops for IA code
generation that will provide performance improvements and performance stability.
This feature will not affect other architectures. It will, however, set up an
infrastructure for other architectures to do the same, if ever needed.

Here are some examples for cases in which nops can improve performance:

1.      DSB (Decoded Stream Buffer) Thrashing.

DSB is a cache for uops that have been decoded. It is limited to 3 ways per 32B
window; 4 or more ways will cause a performance degradation. This can be avoided
by aligning with nops. See Zia Ansari's presentation for more information:
http://schd.ws/hosted_files/llvmdevelopersmeetingbay2016/d9/LLVM-Conference-US-2016-Code-Alignment.pdf

2.      Two or more branches with the same target in a single instruction window
may cause poor branch prediction.

Our recommendation to programmers is to try and keep 2 branches out of the same
16B chunk if they both go to the same target. From the manual:
"Avoid putting two conditional branch instructions in a loop so that both
have the same branch target address and, at the same time, belong to (i.e. have
their last bytes' addresses within) the same 16-byte aligned code
block."

Let's see an example for the second case. Consider the following code:

int y;
int x;

int foo(int i, int m, int p, int q, int *p1, int *p2)
{
  if (i+*p1 == p || i-*p2 > m || i == q) {
    y++;
    x++;
  }

  return 0;
}

The two last || clauses of the if will be translated to two conditional jumps to
the same target. The generated code will look like so:

foo:
       0:  48 8b 44 24 28 movq 40(%rsp), %rax
       5:  8b 00      movl (%rax), %eax
       7:  01 c8      addl %ecx, %eax
       9:  44 39 c0   cmpl %r8d, %eax
       c:  75 0f      jne  15 <foo+0x1D>
       e:  ff 05 00 00 00 00     incl (%rip)
      14:  ff 05 00 00 00 00     incl (%rip)
      1a:  31 c0      xorl %eax, %eax
      1c:  c3   retq
      1d:  44 39 c9   cmpl %r9d, %ecx
      20:  74 ec      je   -20 <foo+0xE>
      22:  48 8b 44 24 30 movq 48(%rsp), %rax
      27:  2b 08      subl (%rax), %ecx
      29:  39 d1      cmpl %edx, %ecx
      2b:  7f e1      jg   -31 <foo+0xE>
      2d:  31 c0      xorl %eax, %eax
      2f:  c3   retq

Note: the first if clause jump is the jne instruction at 0x0C, the second if
clause jump is the jg instruction at 0x2B and the third if clause jump is the je
instruction at 0x20. Also note that the jg and je share a 16 byte window, which
is exactly the situation we wish to avoid (consider the case in which foo is
called from inside a loop. This will cause performance penalty).

The generated code after my changes will look like so:

foo:
       0:  48 8b 44 24 28 movq 40(%rsp), %rax
       5:  8b 00      movl (%rax), %eax
       7:  01 c8      addl %ecx, %eax
       9:  44 39 c0   cmpl %r8d, %eax
       c:  75 0f      jne  15 <foo+0x1D>
       e:  ff 05 00 00 00 00     incl (%rip)
      14:  ff 05 00 00 00 00     incl (%rip)
      1a:  31 c0      xorl %eax, %eax
      1c:  c3   retq
      1d:  44 39 c9   cmpl %r9d, %ecx
      20:  74 ec      je   -20 <foo+0xE>
      22:  48 8b 44 24 30 movq 48(%rsp), %rax
      27:  2b 08      subl (%rax), %ecx
      29:  39 d1      cmpl %edx, %ecx
      2b:  0f 1f 40 00     nopl (%rax)
      2f:  7f dd      jg   -35 <foo+0xE>
      31:  31 c0      xorl %eax, %eax
      33:  c3   retq

The only change is the nopl at 0x2B, before the jg, which pushes the jg
instruction to the next 16 byte window, and thus avoids the undesired situation.
Note that, in this case, in order to push an instruction to the next window it
is sufficient to push its last byte to the next window.


Due to the nature of those performance nops, which often rely on the layout of
the code (e.g. number of instructions in a 16/32B window) that is only known in
assembly time, the insertion is divided to two phases:

1.      Marking "suspicious" places while encoding, i.e. hotspots that
may cause a performance penalty, during the encoding.

2.      Computing the actual number (zero or greater) of required nops in order
to avoid a performance penalty.

In order to achieve that, I am introducing a new kind of MCFragment named
MCPerfNopFragment. It will be the connecting link between the two phases:

1.      During encoding, the object streamer (MCObjectStreamer) will query the
backend for the need of a MCPerfNopFragment before every encoded instruction. If
required, a MCPerfNopFragment will be created and inserted to the MCSection.

2.      During the assembly phase:

a.      When computing the fragment size in MCAssembler::computeFragmentSize the
Assembler will consult the backend and will provide it the layout in order to
determine the required size of the fragment. Note that the computed size may
change from call to call, similarly to the process of computing the size of a
MCAlignFragment.

b.      When writing the fragment the assembler will again use the backend in
order to write the required number of nops.

Comments and questions are welcome.

Thanks,
Omer
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161117/c79cd2e1/attachment.html>

Stephen Checkoway via llvm-dev

2016-Nov-17 16:50 UTC

head link

[llvm-dev] RFC: Insertion of nops for performance stability

Hi Omer,
> On Nov 17, 2016, at 03:55, Paparo Bivas, Omer via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> The two last || clauses of the if will be translated to two conditional
jumps to the same target. The generated code will look like so:
>  
> foo:
>        0:  48 8b 44 24 28 movq 40(%rsp), %rax
>        5:  8b 00      movl (%rax), %eax
>        7:  01 c8      addl %ecx, %eax
>        9:  44 39 c0   cmpl %r8d, %eax
>        c:  75 0f      jne  15 <foo+0x1D>
>        e:  ff 05 00 00 00 00     incl (%rip)
>       14:  ff 05 00 00 00 00     incl (%rip)
>       1a:  31 c0      xorl %eax, %eax
>       1c:  c3   retq
>       1d:  44 39 c9   cmpl %r9d, %ecx
>       20:  74 ec      je   -20 <foo+0xE>
>       22:  48 8b 44 24 30 movq 48(%rsp), %rax
>       27:  2b 08      subl (%rax), %ecx
>       29:  39 d1      cmpl %edx, %ecx
>       2b:  7f e1      jg   -31 <foo+0xE>
>       2d:  31 c0      xorl %eax, %eax
>       2f:  c3   retq
>  
> Note: the first if clause jump is the jne instruction at 0x0C, the second
if clause jump is the jg instruction at 0x2B and the third if clause jump is the
je instruction at 0x20. Also note that the jg and je share a 16 byte window,
which is exactly the situation we wish to avoid (consider the case in which foo
is called from inside a loop. This will cause performance penalty).
Rather than inserting a nop, would it be better to change the instruction
encoding to use a different form? The JE at offset 0x20 could use the JE rel32
(0f 84 0f 00 00 00) form. Similarly, the MOV at offset 0x22 could use MOV r64,
rm/64 with a 32-bit offset (48 8b 84 24 30 00 00 00). The latter adds 3 bytes
which is insufficient in this case, but the former adds the required 4 bytes.

I have no idea if it's better to insert extra instructions rather than
increase the length of existing ones, but my intuition is that it's better
to decode and retire fewer instructions. I'd assume the same is true when
trying to align basic blocks.


-- 
Stephen Checkoway

Hal Finkel via llvm-dev

2016-Nov-18 18:57 UTC

head link

[llvm-dev] RFC: Insertion of nops for performance stability

----- Original Message -----
> From: "Paparo Bivas, Omer via llvm-dev" <llvm-dev at
lists.llvm.org>
> To: llvm-dev at lists.llvm.org
> Sent: Thursday, November 17, 2016 3:55:43 AM
> Subject: [llvm-dev] RFC: Insertion of nops for performance stability
> Hi all,
> These days I am working on a feature designed to insert nops for IA
> code generation that will provide performance improvements and
> performance stability. This feature will not affect other
> architectures. It will, however, set up an infrastructure for other
> architectures to do the same, if ever needed.
> Here are some examples for cases in which nops can improve
> performance:
> 1. DSB (Decoded Stream Buffer) Thrashing.
> DSB is a cache for uops that have been decoded. It is limited to 3
> ways per 32B window; 4 or more ways will cause a performance
> degradation. This can be avoided by aligning with nops. See Zia
> Ansari's presentation for more information:
>
http://schd.ws/hosted_files/llvmdevelopersmeetingbay2016/d9/LLVM-Conference-US-2016-Code-Alignment.pdf
> 2. Two or more branches with the same target in a single instruction
> window may cause poor branch prediction.
> Our recommendation to programmers is to try and keep 2 branches out
> of the same 16B chunk if they both go to the same target. From the
> manual:
> “Avoid putting two conditional branch instructions in a loop so that
> both have the same branch target address and, at the same time,
> belong to (i.e. have their last bytes’ addresses within) the same
> 16-byte aligned code block.”
> Let’s see an example for the second case. Consider the following
> code:
> int y;
> int x;
> int foo(int i, int m, int p, int q, int *p1, int *p2)
> {
> if ( i+*p1 == p || i-*p2 > m || i == q ) {
> y++;
> x++;
> }
> return 0;
> }
> The two last || clauses of the if will be translated to two
> conditional jumps to the same target. The generated code will look
> like so:
> foo:
> 0: 48 8b 44 24 28 movq 40(%rsp), %rax
> 5: 8b 00 movl (%rax), %eax
> 7: 01 c8 addl %ecx, %eax
> 9: 44 39 c0 cmpl %r8d, %eax
> c: 75 0f jne 15 <foo+0x1D>
> e: ff 05 00 00 00 00 incl (%rip)
> 14: ff 05 00 00 00 00 incl (%rip)
> 1a: 31 c0 xorl %eax, %eax
> 1c: c3 retq
> 1d: 44 39 c9 cmpl %r9d, %ecx
> 20: 74 ec je -20 <foo+0xE>
> 22: 48 8b 44 24 30 movq 48(%rsp), %rax
> 27: 2b 08 subl (%rax), %ecx
> 29: 39 d1 cmpl %edx, %ecx
> 2b: 7f e1 jg -31 <foo+0xE>
> 2d: 31 c0 xorl %eax, %eax
> 2f: c3 retq
> Note: the first if clause jump is the jne instruction at 0x0C, the
> second if clause jump is the jg instruction at 0x2B and the third if
> clause jump is the je instruction at 0x20. Also note that the jg and
> je share a 16 byte window, which is exactly the situation we wish to
> avoid (consider the case in which foo is called from inside a loop.
> This will cause performance penalty).
> The generated code after my changes will look like so:
> foo:
> 0: 48 8b 44 24 28 movq 40(%rsp), %rax
> 5: 8b 00 movl (%rax), %eax
> 7: 01 c8 addl %ecx, %eax
> 9: 44 39 c0 cmpl %r8d, %eax
> c: 75 0f jne 15 <foo+0x1D>
> e: ff 05 00 00 00 00 incl (%rip)
> 14: ff 05 00 00 00 00 incl (%rip)
> 1a: 31 c0 xorl %eax, %eax
> 1c: c3 retq
> 1d: 44 39 c9 cmpl %r9d, %ecx
> 20: 74 ec je -20 <foo+0xE>
> 22: 48 8b 44 24 30 movq 48(%rsp), %rax
> 27: 2b 08 subl (%rax), %ecx
> 29: 39 d1 cmpl %edx, %ecx
> 2b: 0f 1f 40 00 nopl (%rax)
> 2f: 7f dd jg -35 <foo+0xE>
> 31: 31 c0 xorl %eax, %eax
> 33: c3 retq
> The only change is the nopl at 0x2B, before the jg, which pushes the
> jg instruction to the next 16 byte window, and thus avoids the
> undesired situation. Note that, in this case, in order to push an
> instruction to the next window it is sufficient to push its last
> byte to the next window.
> Due to the nature of those performance nops, which often rely on the
> layout of the code (e.g. number of instructions in a 16/32B window)
> that is only known in assembly time, the insertion is divided to two
> phases:
> 1. Marking "suspicious" places while encoding, i.e. hotspots that
may
> cause a performance penalty, during the encoding.
> 2. Computing the actual number (zero or greater) of required nops in
> order to avoid a performance penalty.
> In order to achieve that, I am introducing a new kind of MCFragment
> named MCPerfNopFragment. It will be the connecting link between the
> two phases:
> 1. During encoding, the object streamer (MCObjectStreamer) will query
> the backend for the need of a MCPerfNopFragment before every encoded
> instruction. If required, a MCPerfNopFragment will be created and
> inserted to the MCSection.
> 2. During the assembly phase:
> a. When computing the fragment size in
> MCAssembler::computeFragmentSize the Assembler will consult the
> backend and will provide it the layout in order to determine the
> required size of the fragment. Note that the computed size may
> change from call to call, similarly to the process of computing the
> size of a MCAlignFragment.
> b. When writing the fragment the assembler will again use the backend
> in order to write the required number of nops.
> Comments and questions are welcome.Do you need to insert the nops as such a late stage? We have other components
that insert nops during post-RA scheduling, for example. Even later, nops could
be inserted during a pre-emit pass (similar to where other backends do branch
relaxation/selection). The branch relaxation process also relies on knowing the
layout of the code, but gathers this information from the available information
on instruction sizes, block alignments, etc.

-Hal 
> Thanks,
> Omer
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161118/055c1c1b/attachment-0001.html>

Paparo Bivas, Omer via llvm-dev

2016-Nov-20 13:54 UTC

head link

[llvm-dev] RFC: Insertion of nops for performance stability

Hi Stephen,

I actually not sure myself if it's better to insert extra instructions
rather than increase the length of existing ones. However, in means of the
current design, at least, it can complicate things.
In my solution the insertion of the nops integrates with the process of layout
in the Assembler, in which every MCFragment is processed in turn and its layout
is determines. When it's time for a MCFragment to be lay out we can rely on
the layout of the MCFragments prior to it, but we can't rely on the layout
of the MCFragments that follow it since they haven't been laid out yet and
thus their layout is not yet known.
For a MCFragment of the new kind MCPerfNopFragment, "laying out" means
computing the number of nops it will contain. Luckily enough, all the cases that
need nops does not require later instruction data, only earlier instruction
data, which is already set and laid out (keep in mind that a MCPerfNopFragment
will always appear right before a "potentially hazardous" instruction
and will be aware of that instruction).
In order to change instructions to another version of themselves that will take
up more bytes we will have to do one of two:
1. Rely on later instruction data when laying out the fragments that contain
those instructions. As I mentioned, we can't do this accurately as the
layout haven't been set for the fragments that follow.
2. Retroactively change the layout of the fragments that contain those
instructions once we encounter a MCPerfNopFragment. This can cause complications
as it will require changing the layout while performing a layout. Some fragments
in between the currently computed MCPerfNopFragment and the MCFragment we wish
to change layout for may rely on the already determined layout.
MCAlignFragments, for example, strongly rely on the layout of their
predecessors. The entire process of fragments layout will then become very
complex.

Please tell me if you have any thoughts as to how to get around this issues.

Thanks,
Omer

-----Original Message-----
From: Stephen Checkoway [mailto:s at pahtak.org] 
Sent: Thursday, November 17, 2016 18:51
To: Paparo Bivas, Omer <omer.paparo.bivas at intel.com>
Cc: llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] RFC: Insertion of nops for performance stability

Hi Omer,
> On Nov 17, 2016, at 03:55, Paparo Bivas, Omer via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> The two last || clauses of the if will be translated to two conditional
jumps to the same target. The generated code will look like so:
>  
> foo:
>        0:  48 8b 44 24 28 movq 40(%rsp), %rax
>        5:  8b 00      movl (%rax), %eax
>        7:  01 c8      addl %ecx, %eax
>        9:  44 39 c0   cmpl %r8d, %eax
>        c:  75 0f      jne  15 <foo+0x1D>
>        e:  ff 05 00 00 00 00     incl (%rip)
>       14:  ff 05 00 00 00 00     incl (%rip)
>       1a:  31 c0      xorl %eax, %eax
>       1c:  c3   retq
>       1d:  44 39 c9   cmpl %r9d, %ecx
>       20:  74 ec      je   -20 <foo+0xE>
>       22:  48 8b 44 24 30 movq 48(%rsp), %rax
>       27:  2b 08      subl (%rax), %ecx
>       29:  39 d1      cmpl %edx, %ecx
>       2b:  7f e1      jg   -31 <foo+0xE>
>       2d:  31 c0      xorl %eax, %eax
>       2f:  c3   retq
>  
> Note: the first if clause jump is the jne instruction at 0x0C, the second
if clause jump is the jg instruction at 0x2B and the third if clause jump is the
je instruction at 0x20. Also note that the jg and je share a 16 byte window,
which is exactly the situation we wish to avoid (consider the case in which foo
is called from inside a loop. This will cause performance penalty).
Rather than inserting a nop, would it be better to change the instruction
encoding to use a different form? The JE at offset 0x20 could use the JE rel32
(0f 84 0f 00 00 00) form. Similarly, the MOV at offset 0x22 could use MOV r64,
rm/64 with a 32-bit offset (48 8b 84 24 30 00 00 00). The latter adds 3 bytes
which is insufficient in this case, but the former adds the required 4 bytes.

I have no idea if it's better to insert extra instructions rather than
increase the length of existing ones, but my intuition is that it's better
to decode and retire fewer instructions. I'd assume the same is true when
trying to align basic blocks.

-- 
Stephen Checkoway

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Paparo Bivas, Omer via llvm-dev

2016-Nov-20 14:08 UTC

head link

[llvm-dev] RFC: Insertion of nops for performance stability

Hi Hal,

A pre-emit pass will indeed be preferable. I originally thought of it, too,
however I could not figure out how can such a pass have an access to information
on instruction sizes and block alignments.
I know that for X86, at least, the branch relaxation is happening during the
layout phase in the Assembler, where I plan to integrate the nop insertion such
that the new MCPerfNopFragment will just be yet another kind of fragment to
handle when laying out all the fragments: layout for a relaxable branch is
relaxing it if necessary, and layout for a MCPerfNopFragment will be computing
the number of nops it will contain.

Can you please refer me to a pre-emit pass that does something similar?

Thanks,
Omer

From: Hal Finkel [mailto:hfinkel at anl.gov]
Sent: Friday, November 18, 2016 20:57
To: Paparo Bivas, Omer <omer.paparo.bivas at intel.com>
Cc: llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] RFC: Insertion of nops for performance stability


________________________________
From: "Paparo Bivas, Omer via llvm-dev" <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
To: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
Sent: Thursday, November 17, 2016 3:55:43 AM
Subject: [llvm-dev] RFC: Insertion of nops for performance stability
Hi all,

These days I am working on a feature designed to insert nops for IA code
generation that will provide performance improvements and performance stability.
This feature will not affect other architectures. It will, however, set up an
infrastructure for other architectures to do the same, if ever needed.

Here are some examples for cases in which nops can improve performance:

1.      DSB (Decoded Stream Buffer) Thrashing.

DSB is a cache for uops that have been decoded. It is limited to 3 ways per 32B
window; 4 or more ways will cause a performance degradation. This can be avoided
by aligning with nops. See Zia Ansari's presentation for more information:
http://schd.ws/hosted_files/llvmdevelopersmeetingbay2016/d9/LLVM-Conference-US-2016-Code-Alignment.pdf

2.      Two or more branches with the same target in a single instruction window
may cause poor branch prediction.

Our recommendation to programmers is to try and keep 2 branches out of the same
16B chunk if they both go to the same target. From the manual:
“Avoid putting two conditional branch instructions in a loop so that both have
the same branch target address and, at the same time, belong to (i.e. have their
last bytes’ addresses within) the same 16-byte aligned code block.”

Let’s see an example for the second case. Consider the following code:

int y;
int x;

int foo(int i, int m, int p, int q, int *p1, int *p2)
{
  if (i+*p1 == p || i-*p2 > m || i == q) {
    y++;
    x++;
  }

  return 0;
}

The two last || clauses of the if will be translated to two conditional jumps to
the same target. The generated code will look like so:

foo:
       0:  48 8b 44 24 28 movq 40(%rsp), %rax
       5:  8b 00      movl (%rax), %eax
       7:  01 c8      addl %ecx, %eax
       9:  44 39 c0   cmpl %r8d, %eax
       c:  75 0f      jne  15 <foo+0x1D>
       e:  ff 05 00 00 00 00     incl (%rip)
      14:  ff 05 00 00 00 00     incl (%rip)
      1a:  31 c0      xorl %eax, %eax
      1c:  c3   retq
      1d:  44 39 c9   cmpl %r9d, %ecx
      20:  74 ec      je   -20 <foo+0xE>
      22:  48 8b 44 24 30 movq 48(%rsp), %rax
      27:  2b 08      subl (%rax), %ecx
      29:  39 d1      cmpl %edx, %ecx
      2b:  7f e1      jg   -31 <foo+0xE>
      2d:  31 c0      xorl %eax, %eax
      2f:  c3   retq

Note: the first if clause jump is the jne instruction at 0x0C, the second if
clause jump is the jg instruction at 0x2B and the third if clause jump is the je
instruction at 0x20. Also note that the jg and je share a 16 byte window, which
is exactly the situation we wish to avoid (consider the case in which foo is
called from inside a loop. This will cause performance penalty).

The generated code after my changes will look like so:

foo:
       0:  48 8b 44 24 28 movq 40(%rsp), %rax
       5:  8b 00      movl (%rax), %eax
       7:  01 c8      addl %ecx, %eax
       9:  44 39 c0   cmpl %r8d, %eax
       c:  75 0f      jne  15 <foo+0x1D>
       e:  ff 05 00 00 00 00     incl (%rip)
      14:  ff 05 00 00 00 00     incl (%rip)
      1a:  31 c0      xorl %eax, %eax
      1c:  c3   retq
      1d:  44 39 c9   cmpl %r9d, %ecx
      20:  74 ec      je   -20 <foo+0xE>
      22:  48 8b 44 24 30 movq 48(%rsp), %rax
      27:  2b 08      subl (%rax), %ecx
      29:  39 d1      cmpl %edx, %ecx
      2b:  0f 1f 40 00     nopl (%rax)
      2f:  7f dd      jg   -35 <foo+0xE>
      31:  31 c0      xorl %eax, %eax
      33:  c3   retq

The only change is the nopl at 0x2B, before the jg, which pushes the jg
instruction to the next 16 byte window, and thus avoids the undesired situation.
Note that, in this case, in order to push an instruction to the next window it
is sufficient to push its last byte to the next window.


Due to the nature of those performance nops, which often rely on the layout of
the code (e.g. number of instructions in a 16/32B window) that is only known in
assembly time, the insertion is divided to two phases:

1.      Marking "suspicious" places while encoding, i.e. hotspots that
may cause a performance penalty, during the encoding.

2.      Computing the actual number (zero or greater) of required nops in order
to avoid a performance penalty.

In order to achieve that, I am introducing a new kind of MCFragment named
MCPerfNopFragment. It will be the connecting link between the two phases:

1.      During encoding, the object streamer (MCObjectStreamer) will query the
backend for the need of a MCPerfNopFragment before every encoded instruction. If
required, a MCPerfNopFragment will be created and inserted to the MCSection.

2.      During the assembly phase:

a.      When computing the fragment size in MCAssembler::computeFragmentSize the
Assembler will consult the backend and will provide it the layout in order to
determine the required size of the fragment. Note that the computed size may
change from call to call, similarly to the process of computing the size of a
MCAlignFragment.

b.      When writing the fragment the assembler will again use the backend in
order to write the required number of nops.

Comments and questions are welcome.
Do you need to insert the nops as such a late stage? We have other components
that insert nops during post-RA scheduling, for example. Even later, nops could
be inserted during a pre-emit pass (similar to where other backends do branch
relaxation/selection). The branch relaxation process also relies on knowing the
layout of the code, but gathers this information from the available information
on instruction sizes, block alignments, etc.

 -Hal

Thanks,
Omer

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161120/15a7c404/attachment.html>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Nov 2016 - RFC: Insertion of nops for performance stability

[llvm-dev] RFC: Insertion of nops for performance stability

[llvm-dev] RFC: Insertion of nops for performance stability

[llvm-dev] RFC: Insertion of nops for performance stability

[llvm-dev] RFC: Insertion of nops for performance stability

[llvm-dev] RFC: Insertion of nops for performance stability

Maybe Matching Threads