Jakob Stoklund Olesen
2011-Feb-27 02:12 UTC
[LLVMdev] TableGen syntax for matching a constant load
On Feb 26, 2011, at 4:50 PM, Joerg Sonnenberger wrote:> On Sun, Feb 27, 2011 at 01:29:25AM +0100, Joerg Sonnenberger wrote: >> +let Predicates = [OptForSize] in { >> +def : Pat<(store (i32 0), addr:$dst), (AND32mi8 addr:$dst, 0)>; >> +def : Pat<(store (i32 0), addr:$dst), (AND32mi8 addr:$dst, 0)>; >> +def : Pat<(store (i64 -1), addr:$dst), (OR64mi8 addr:$dst, -1)>; >> +def : Pat<(store (i64 -1), addr:$dst), (OR64mi8 addr:$dst, -1)>; >> +} > > All these patterns have one important downside. They are suboptimal if > more than one store happens in a row. E.g. the 0 store is better > expressed as xor followed by two register moves, if a register is > available... This is most noticable when memset() gets inlinedNote that LLVM's -Os option does not quite mean the same as GCC's flag. It disables optimizations that increase code size without a clear performance gain. It does not try to minimize code size at any cost. When you said you weren't concerned about performance, I assumed you wouldn't be submitting patches. Sorry about the confusion. Implementing constant stores as load+bitop+store is almost certainly not worth the small size win. As for materializing (i32 -1) in 3 bytes instead of 5, but with 2 ยต-ops instead of 1, I would like to see some performance numbers first. It might be cheap enough that it is worth it. The MOV32ri instruction can be rematerialized and is considered to be as cheap as a move. That is not true for xorl+decl, and unfortunately the register allocator currently doesn't know how to rematerialize multiple instructions which means that the register containing -1 could get spilled. We really don't want that to happen. Until the register allocator learns how to rematerialize multiple instructions, you would need to use a pseudo-instruction representing the xorl+decl pair. That instruction could be marked as rematerializable and even as cheap as a move. Then you can start measuring the performance impact ;-) Thanks, /jakob
Chris Lattner
2011-Feb-27 02:40 UTC
[LLVMdev] TableGen syntax for matching a constant load
On Feb 26, 2011, at 6:12 PM, Jakob Stoklund Olesen wrote:>> >> All these patterns have one important downside. They are suboptimal if >> more than one store happens in a row. E.g. the 0 store is better >> expressed as xor followed by two register moves, if a register is >> available... This is most noticable when memset() gets inlined > > Note that LLVM's -Os option does not quite mean the same as GCC's flag. > It disables optimizations that increase code size without a clear performance gain. > It does not try to minimize code size at any cost.Jakob is right, but there is a clear market for "smallest at any cost". The FreeBSD folks would really like to build their bootloader with clang for example :). It should be reasonably easy to add a new "optsize2" function attribute to LLVM IR, and have that be set with -Oz (the "optimize for size at any cost") flag, which could then enable stuff like this. There are lots of other cases where this would be useful, such as forced use of "rep; movsb" on x86, which is much smaller than a call to memset, but also much slower :). -Chris
Jakob Stoklund Olesen
2011-Feb-27 03:03 UTC
[LLVMdev] TableGen syntax for matching a constant load
On Feb 26, 2011, at 6:40 PM, Chris Lattner wrote:> It should be reasonably easy to add a new "optsize2" function attribute to LLVM IR, and have that be set with -Oz (the "optimize for size at any cost") flag, which could then enable stuff like this. > > There are lots of other cases where this would be useful, such as forced use of "rep; movsb" on x86, which is much smaller than a call to memset, but also much slower :).Such a mode would also: - Lower the inlining threshold to 0 so functions are only inlined when it would decrease the overall code size. - Disable the SSEExecutionDomain pass which selects equivalent, but larger vector instructions. - Disable tail duplication and be more aggressive about tail merging. - Perhaps select x86 instructions with 16-bit immediates? You could also tweak the heuristics used by register coalescing and spill code placement, but that involves more work. /jakob
Joerg Sonnenberger
2011-Feb-27 03:10 UTC
[LLVMdev] TableGen syntax for matching a constant load
On Sat, Feb 26, 2011 at 06:40:19PM -0800, Chris Lattner wrote:> > On Feb 26, 2011, at 6:12 PM, Jakob Stoklund Olesen wrote: > > >> > >> All these patterns have one important downside. They are suboptimal if > >> more than one store happens in a row. E.g. the 0 store is better > >> expressed as xor followed by two register moves, if a register is > >> available... This is most noticable when memset() gets inlined > > > > Note that LLVM's -Os option does not quite mean the same as GCC's flag. > > It disables optimizations that increase code size without a clear performance gain. > > It does not try to minimize code size at any cost. > > Jakob is right, but there is a clear market for "smallest at any cost". > The FreeBSD folks would really like to build their bootloader with > clang for example :).Yes, I have the same problem for NetBSD. All but two of the boot loaders are working. One is currently off by less than 300 Byte, one by 800.> It should be reasonably easy to add a new "optsize2" function attribute > to LLVM IR, and have that be set with -Oz (the "optimize for size at > any cost") flag, which could then enable stuff like this. > > There are lots of other cases where this would be useful, such as > forced use of "rep; movsb" on x86, which is much smaller than a call > to memset, but also much slower :).Agreed. From studying GCC's peep hole optimisation list and the assembler code, I see the following candiates for space saving: setcc followed by movzbl into xor and setcc. This is #8785 and a general optimisation. I have seen enough code to profit from this. The optimised memory set from this thread. The question of assigning a scratch register if multiple instructions want to use the same 32bit immediate would be useful in other cases too. Function prologue and epilogue can often be optimised by adjusting %esp or %rsp using push/pop. E.g. for 32bit mode, "addl $4, %esp" and "addl $8, %esp" are more compact as one or two pops to a scratch register. This is also a hot path for most CPUs. Same for subtracting and 64bit mode. Generally using push/pop for stack manipulation would be much nicer for code size, but require extensive changes to the code generator. I think this is the majority of why GCC creates smaller binaries. Using cmp/test against a constant before a conditional branch can often be optimised if the register is dead. For cmp, a check against -1 or 1 can be replaced with inc/dec and inverting the condition. This saves 2 Bytes in 32bit mode and 1 Byte in 64bit mode. It applies generally for all optimiser levels. Compares against 8bit signed immediates for 32bit / 64bit registers can be expressed as add or sub, saving 2 Bytes in all cases. Joerg
Possibly Parallel Threads
- [LLVMdev] TableGen syntax for matching a constant load
- [LLVMdev] TableGen syntax for matching a constant load
- [LLVMdev] TableGen syntax for matching a constant load
- [LLVMdev] TableGen syntax for matching a constant load
- [LLVMdev] TableGen syntax for matching a constant load