thr3ads.net - llvm dev - [LLVMdev] TableGen syntax for matching a constant load [Feb 2011]

If this information is useful, please help other people find it:
Share via:

Jakob Stoklund Olesen

2011-Feb-27 02:12 UTC

[LLVMdev] TableGen syntax for matching a constant load

On Feb 26, 2011, at 4:50 PM, Joerg Sonnenberger wrote:
> On Sun, Feb 27, 2011 at 01:29:25AM +0100, Joerg Sonnenberger wrote:
>> +let Predicates = [OptForSize] in {
>> +def : Pat<(store (i32 0), addr:$dst), (AND32mi8 addr:$dst, 0)>;
>> +def : Pat<(store (i32 0), addr:$dst), (AND32mi8 addr:$dst, 0)>;
>> +def : Pat<(store (i64 -1), addr:$dst), (OR64mi8 addr:$dst, -1)>;
>> +def : Pat<(store (i64 -1), addr:$dst), (OR64mi8 addr:$dst, -1)>;
>> +}
> 
> All these patterns have one important downside. They are suboptimal if
> more than one store happens in a row. E.g. the 0 store is better
> expressed as xor followed by two register moves, if a register is
> available... This is most noticable when memset() gets inlined
Note that LLVM's -Os option does not quite mean the same as GCC's flag.
It disables optimizations that increase code size without a clear performance
gain.
It does not try to minimize code size at any cost.

When you said you weren't concerned about performance, I assumed you
wouldn't be submitting patches. Sorry about the confusion.

Implementing constant stores as load+bitop+store is almost certainly not worth
the small size win.

As for materializing (i32 -1) in 3 bytes instead of 5, but with 2 µ-ops instead
of 1, I would like to see some performance numbers first. It might be cheap
enough that it is worth it.

The MOV32ri instruction can be rematerialized and is considered to be as cheap
as a move. That is not true for xorl+decl, and unfortunately the register
allocator currently doesn't know how to rematerialize multiple instructions
which means that the register containing -1 could get spilled. We really
don't want that to happen.

Until the register allocator learns how to rematerialize multiple instructions,
you would need to use a pseudo-instruction representing the xorl+decl pair. That
instruction could be marked as rematerializable and even as cheap as a move.

Then you can start measuring the performance impact ;-)

Thanks,
/jakob

Chris Lattner

2011-Feb-27 02:40 UTC

head link

[LLVMdev] TableGen syntax for matching a constant load

On Feb 26, 2011, at 6:12 PM, Jakob Stoklund Olesen wrote:
>> 
>> All these patterns have one important downside. They are suboptimal if
>> more than one store happens in a row. E.g. the 0 store is better
>> expressed as xor followed by two register moves, if a register is
>> available... This is most noticable when memset() gets inlined
> 
> Note that LLVM's -Os option does not quite mean the same as GCC's
flag.
> It disables optimizations that increase code size without a clear
performance gain.
> It does not try to minimize code size at any cost.
Jakob is right, but there is a clear market for "smallest at any
cost".  The FreeBSD folks would really like to build their bootloader with
clang for example :).

It should be reasonably easy to add a new "optsize2" function
attribute to LLVM IR, and have that be set with -Oz (the "optimize for size
at any cost") flag, which could then enable stuff like this.

There are lots of other cases where this would be useful, such as forced use of
"rep; movsb" on x86, which is much smaller than a call to memset, but
also much slower :).

-Chris

Jakob Stoklund Olesen

2011-Feb-27 03:03 UTC

head link

[LLVMdev] TableGen syntax for matching a constant load

On Feb 26, 2011, at 6:40 PM, Chris Lattner wrote:
> It should be reasonably easy to add a new "optsize2" function
attribute to LLVM IR, and have that be set with -Oz (the "optimize for size
at any cost") flag, which could then enable stuff like this.
> 
> There are lots of other cases where this would be useful, such as forced
use of "rep; movsb" on x86, which is much smaller than a call to
memset, but also much slower :).
Such a mode would also:

- Lower the inlining threshold to 0 so functions are only inlined when it would
decrease the overall code size.

- Disable the SSEExecutionDomain pass which selects equivalent, but larger
vector instructions.

- Disable tail duplication and be more aggressive about tail merging.

- Perhaps select x86 instructions with 16-bit immediates?

You could also tweak the heuristics used by register coalescing and spill code
placement, but that involves more work.

/jakob

Joerg Sonnenberger

2011-Feb-27 03:10 UTC

head link

[LLVMdev] TableGen syntax for matching a constant load

On Sat, Feb 26, 2011 at 06:40:19PM -0800, Chris Lattner
wrote:> 
> On Feb 26, 2011, at 6:12 PM, Jakob Stoklund Olesen wrote:
> 
> >> 
> >> All these patterns have one important downside. They are
suboptimal if
> >> more than one store happens in a row. E.g. the 0 store is better
> >> expressed as xor followed by two register moves, if a register is
> >> available... This is most noticable when memset() gets inlined
> > 
> > Note that LLVM's -Os option does not quite mean the same as
GCC's flag.
> > It disables optimizations that increase code size without a clear
performance gain.
> > It does not try to minimize code size at any cost.
> 
> Jakob is right, but there is a clear market for "smallest at any
cost".
> The FreeBSD folks would really like to build their bootloader with
> clang for example :).
Yes, I have the same problem for NetBSD. All but two of the boot loaders
are working. One is currently off by less than 300 Byte, one by 800.
> It should be reasonably easy to add a new "optsize2" function
attribute
> to LLVM IR, and have that be set with -Oz (the "optimize for size at
> any cost") flag, which could then enable stuff like this.
> 
> There are lots of other cases where this would be useful, such as
> forced use of "rep; movsb" on x86, which is much smaller than a
call
> to memset, but also much slower :).
Agreed. From studying GCC's peep hole optimisation list and the
assembler code, I see the following candiates for space saving:

setcc followed by movzbl into xor and setcc. This is #8785 and a general
optimisation. I have seen enough code to profit from this.

The optimised memory set from this thread. The question of assigning a
scratch register if multiple instructions want to use the same 32bit
immediate would be useful in other cases too.

Function prologue and epilogue can often be optimised by adjusting %esp
or %rsp using push/pop. E.g. for 32bit mode, "addl $4, %esp" and
"addl
$8, %esp" are more compact as one or two pops to a scratch register.
This is also a hot path for most CPUs. Same for subtracting and 64bit
mode. Generally using push/pop for stack manipulation would be much
nicer for code size, but require extensive changes to the code
generator. I think this is the majority of why GCC creates smaller
binaries.

Using cmp/test against a constant before a conditional branch can often
be optimised if the register is dead. For cmp, a check against -1 or 1
can be replaced with inc/dec and inverting the condition. This saves 2
Bytes in 32bit mode and 1 Byte in 64bit mode. It applies generally for
all optimiser levels. Compares against 8bit signed immediates for 32bit
/ 64bit registers can be expressed as add or sub, saving 2 Bytes in all
cases.

Joerg

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Feb 2011 - [LLVMdev] TableGen syntax for matching a constant load

[LLVMdev] TableGen syntax for matching a constant load

[LLVMdev] TableGen syntax for matching a constant load

[LLVMdev] TableGen syntax for matching a constant load

[LLVMdev] TableGen syntax for matching a constant load

Maybe Matching Threads