thr3ads.net - llvm dev - [LLVMdev] RFC: [AArtch64] Removal of spurious mask instructions introduced during legalization [Jul 2014]

If this information is useful, please help other people find it:
Share via:

Louis Gerbarg

2014-Jul-01 22:11 UTC

[LLVMdev] RFC: [AArtch64] Removal of spurious mask instructions introduced during legalization

We have seen a series of performance issues where we are emitting unnecessary
masks on ARM64 that all have a similar form. Here is an example:

define zeroext i1 @test(i8 zeroext %theChar)  align 2 {                         
  entry:                                                                        
    %theChar.off = add i8 %theChar, -97                                         
    %0 = icmp ult i8 %theChar.off, 26                                           
    br i1 %0, label %return, label %lor.lhs.false                               
                                                                                
    lor.lhs.false:                                    ; preds = %entry          
    ret i1 false                                                                
                                                                                
  return:                                           ; preds = %entry            
    ret i1 true                                                                 
}

compiles to:

	.globl	_test
	.align	2
_test:                                  ; @test
	.cfi_startproc
; BB#0:                                 ; %entry
	sub	w8, w0, #97             ; =97
	and	w8, w8, #0xff
	cmp	 w8, #26                ; =26
	cset	 w0, lo
	ret
	.cfi_endproc

but the “and w8, w8, 0xff” is superfluous and could be removed, yielding:

	.globl	_test
	.align	2
_test:                                  ; @test
	.cfi_startproc
; BB#0:                                 ; %entry
	sub	w8, w0, #97             ; =97
	cmp	 w8, #26                ; =26
	cset	 w0, lo
	ret
	.cfi_endproc

The AND is just a single extra instruction, but it is an additional data
dependency and it sometimes shows up in the middle of things like string
processing loops and several hot loops in SPEC, so we would really like to fix
it.

After looking into it I have determined what is going on is that when an i8 or
i16 is legalized up to an i32 the extra masks are introduced at intermediary
steps of the calculation to preserve the same overflow behavior that would have
occurred if there were native 8 or 16 bit data types on the processor. In some
cases that is necessary, but in many cases (such as the one above) based on the
type of compares involved and the provenance of the inputs it is possible to
know that behavior is the same regardless of whether or not the AND instruction
is there, and thus it is safe to remove it.

While at first glance it seems like it would be reasonable to handle this with a
few DAG patterns in the TableGen files it turns out that is a less than
satisfactory approach. Whether the AND is removable is constrained by the values
of 2-3 of the inputs and the compare, which ends up with a combinatorial
explosion of patterns, especially once one considers the inputs values may come
from slightly different nodes such AssertZexts or extending loads (which we need
to capture since we need to know the intended extension semantics). Attached is
a file (AndDAG.pdf) that shows what I believe to be a canonical form of the
sub-DAG we are looking at, the following details may vary:

1) The orange subtree code be a different form of extension, such a
load<LD1[%arrayidx], anyext from i8> instead of an AssertZext
2) The blue and purple elements could be different constants
3) The green element (which is the condition code) could be different
4) It could all be feeding a different instruction (in particular an
AArch64ISD::BRCOND) at the end.
5) The inputs code be i16 instead i8

Rather than try respond reactively to individual cases we see in bugs I would
like to solve the entire problem. To that end I propose introducing an
architecture specific DAG combine that recognizes this basic tree, and performs
the and elimination if legal. Recognizing the DAG and recording the value is
fairly straight forward, but determining when the and is legal is somewhat more
complicated. Each case has taken me a few minutes, and that is just looking at
constant values.

In order to speed of the process I wrote a small command line tool that runs
over all inputs for both assembly patterns and generates 2 dimensional charts of
the cases where the code is the same with and without the mask. Using these
charts it is possible to write a series of equations to cover all the legal
cases, not unlike a Carnaugh map. Since these equations need to work for inputs
that are both i8 and i16 they have to be written symbolically in terms of the
input types (max representable, min representable, positive, negative, 0, etc)
in order to be generally applicable. Using that fact it is possible to reduce
the test case down to i4’s using 0x0f, making the tables much more tractable.
The tables run from -8 to +15 in both dimensions expressing all the 32 bit
representations of zero and sign extended i4 values. I am attaching an example
chart (ZeroExtendedLO.png), but I actually generate 24 of them (the al, nv, vc
and vs condition codes are not relevant leaving us with 12 codes, and then we
need 2 charts for each, one when the input is sign extended and when when it is
zero extended).

Fox example, a set of covering equations for a zero extended variable using for
the LO cond code is:

(AddConstant >= 0 && SubsConstant <= 0)
|| (AddConstant <= 0 && SubsConstant >= 0 && SubsConstant
<= int_max)
|| (Subs2 >= Subs2 + uint_max))

Which would result in an equivalent chart to the one attached when using 4 bit
integers, but representing it symbolically it also expands to 8 and 16 bit
integers.

Attached is a WIP patch that I have made. It is incomplete in several ways, but
I believe it should be enough to discuss whether or not this is reasonable path
to pursue. In particular it:

1) It is not quite LLVM coding style
2) It only works against a AssertZext input
3) It only has equations for LO
4) It only matches on CSEL, not BRCOND et al
6) It pass the CC around as an unsigned 
7) It makes also several simplifying assumptions that I believe are true but
need to confirm:
        1) By the time the DAG gets to me it has been canonicalized such that it
is of the form:
             (AArch64ISD::SUBS (and (+ variable constant1) mask) constant2)
         2) The mask is a 2^8-1 or 2^16-1 (though it is possible to extend this
to arbitrary powers of 2 it doesn’t
             seem like any fronted would generate those)

If there is consensus this is a reasonable path to pursue then my intent is to
clean up the patch fill in the other sets of equations, modifying the script I
used to generate the tables to output JSON or something similar, and write a
validator to ensure the equations 100% accurately match the generated output,
then choose a suitable subset of test cases to include with the patch.

Louis
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AndDAG.pdf
Type: application/pdf
Size: 22478 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140701/1e68961d/attachment.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ZeroExtendedLO.png
Type: image/png
Size: 36817 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140701/1e68961d/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DAG-AND.patch
Type: application/octet-stream
Size: 5939 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140701/1e68961d/attachment.obj>

Hal Finkel

2014-Jul-01 22:33 UTC

head link

[LLVMdev] RFC: [AArtch64] Removal of spurious mask instructions introduced during legalization

Hi Louis,

This is interesting. Moreover, it reminds me of the problem that I addressed in
PPCTargetLowering::DAGCombineExtBoolTrunc and
PPCTargetLowering::DAGCombineTruncBoolExt for i1 (and i32) <-> i64
extensions in the PPC backend. I think you could apply a similar methodology to
finding larger clusters of instructions that can benefit from this kind of
transformation.

 -Hal

----- Original Message -----> From: "Louis Gerbarg" <lgg at apple.com>
> To: llvmdev at cs.uiuc.edu
> Sent: Tuesday, July 1, 2014 5:11:11 PM
> Subject: [LLVMdev] RFC: [AArtch64] Removal of spurious mask instructions
introduced during legalization
> 
> 
> 
> We have seen a series of performance issues where we are emitting
> unnecessary masks on ARM64 that all have a similar form. Here is an
> example:
> 
> define zeroext i1 @test(i8 zeroext %theChar) align 2 {
> entry:
> %theChar.off = add i8 %theChar, -97
> %0 = icmp ult i8 %theChar.off, 26
> br i1 %0, label %return, label %lor.lhs.false
> 
> lor.lhs.false: ; preds = %entry
> ret i1 false
> 
> return: ; preds = %entry
> ret i1 true
> }
> 
> compiles to:
> 
> .globl _test
> .align 2
> _test: ; @test
> .cfi_startproc
> ; BB#0: ; %entry
> sub w8, w0, #97 ; =97
> and w8, w8, #0xff
> cmp w8, #26 ; =26
> cset w0, lo
> ret
> .cfi_endproc
> 
> but the “and w8, w8, 0xff” is superfluous and could be removed,
> yielding:
> 
> .globl _test
> .align 2
> _test: ; @test
> .cfi_startproc
> ; BB#0: ; %entry
> sub w8, w0, #97 ; =97
> cmp w8, #26 ; =26
> cset w0, lo
> ret
> .cfi_endproc
> 
> The AND is just a single extra instruction, but it is an additional
> data dependency and it sometimes shows up in the middle of things
> like string processing loops and several hot loops in SPEC, so we
> would really like to fix it.
> 
> After looking into it I have determined what is going on is that when
> an i8 or i16 is legalized up to an i32 the extra masks are
> introduced at intermediary steps of the calculation to preserve the
> same overflow behavior that would have occurred if there were native
> 8 or 16 bit data types on the processor. In some cases that is
> necessary, but in many cases (such as the one above) based on the
> type of compares involved and the provenance of the inputs it is
> possible to know that behavior is the same regardless of whether or
> not the AND instruction is there, and thus it is safe to remove it.
> 
> While at first glance it seems like it would be reasonable to handle
> this with a few DAG patterns in the TableGen files it turns out that
> is a less than satisfactory approach. Whether the AND is removable
> is constrained by the values of 2-3 of the inputs and the compare,
> which ends up with a combinatorial explosion of patterns, especially
> once one considers the inputs values may come from slightly
> different nodes such AssertZexts or extending loads (which we need
> to capture since we need to know the intended extension semantics).
> Attached is a file (AndDAG.pdf) that shows what I believe to be a
> canonical form of the sub-DAG we are looking at, the following
> details may vary:
> 
> 1) The orange subtree code be a different form of extension, such a
> load<LD1[%arrayidx], anyext from i8> instead of an AssertZext
> 2) The blue and purple elements could be different constants
> 3) The green element (which is the condition code) could be different
> 4) It could all be feeding a different instruction (in particular an
> AArch64ISD::BRCOND) at the end.
> 5) The inputs code be i16 instead i8
> 
> Rather than try respond reactively to individual cases we see in bugs
> I would like to solve the entire problem. To that end I propose
> introducing an architecture specific DAG combine that recognizes
> this basic tree, and performs the and elimination if legal.
> Recognizing the DAG and recording the value is fairly straight
> forward, but determining when the and is legal is somewhat more
> complicated. Each case has taken me a few minutes, and that is just
> looking at constant values.
> 
> In order to speed of the process I wrote a small command line tool
> that runs over all inputs for both assembly patterns and generates 2
> dimensional charts of the cases where the code is the same with and
> without the mask. Using these charts it is possible to write a
> series of equations to cover all the legal cases, not unlike a
> Carnaugh map. Since these equations need to work for inputs that are
> both i8 and i16 they have to be written symbolically in terms of the
> input types (max representable, min representable, positive,
> negative, 0, etc) in order to be generally applicable. Using that
> fact it is possible to reduce the test case down to i4’s using 0x0f,
> making the tables much more tractable. The tables run from -8 to +15
> in both dimensions expressing all the 32 bit representations of zero
> and sign extended i4 values. I am attaching an example chart
> (ZeroExtendedLO.png), but I actually generate 24 of them (the al,
> nv, vc and vs condition codes are not relevant leaving us with 12
> codes, and then we need 2 charts for each, one when the input is
> sign extended and when when it is zero extended).
> 
> Fox example, a set of covering equations for a zero extended variable
> using for the LO cond code is:
> 
> (AddConstant >= 0 && SubsConstant <= 0)
> || (AddConstant <= 0 && SubsConstant >= 0 &&
SubsConstant <= int_max)
> || (Subs2 >= Subs2 + uint_max))
> 
> Which would result in an equivalent chart to the one attached when
> using 4 bit integers, but representing it symbolically it also
> expands to 8 and 16 bit integers.
> 
> Attached is a WIP patch that I have made. It is incomplete in several
> ways, but I believe it should be enough to discuss whether or not
> this is reasonable path to pursue. In particular it:
> 
> 1) It is not quite LLVM coding style
> 2) It only works against a AssertZext input
> 3) It only has equations for LO
> 4) It only matches on CSEL, not BRCOND et al
> 6) It pass the CC around as an unsigned
> 7) It makes also several simplifying assumptions that I believe are
> true but need to confirm:
> 1) By the time the DAG gets to me it has been canonicalized such that
> it is of the form:
> (AArch64ISD::SUBS (and (+ variable constant1) mask) constant2)
> 2) The mask is a 2^8-1 or 2^16-1 (though it is possible to extend
> this to arbitrary powers of 2 it doesn’t
> seem like any fronted would generate those)
> 
> If there is consensus this is a reasonable path to pursue then my
> intent is to clean up the patch fill in the other sets of equations,
> modifying the script I used to generate the tables to output JSON or
> something similar, and write a validator to ensure the equations
> 100% accurately match the generated output, then choose a suitable
> subset of test cases to include with the patch.
> 
> Louis
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

llvm dev - Jul 2014 - [LLVMdev] RFC: [AArtch64] Removal of spurious mask instructions introduced during legalization

[LLVMdev] RFC: [AArtch64] Removal of spurious mask instructions introduced during legalization

[LLVMdev] RFC: [AArtch64] Removal of spurious mask instructions introduced during legalization