thr3ads.net - llvm dev - [LLVMdev] [RFC] LegalizeDAG support for targets without subword load/store instructions [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Matt Johnson

2011-Jul-16 02:34 UTC

[LLVMdev] [RFC] LegalizeDAG support for targets without subword load/store instructions

Hi All,
     Some targets don't provide subword (e.g., i8 and i16 for a 32-bit 
machine) load and store instructions, so currently we have to 
custom-lower Load- and StoreSDNodes in our backends.  For examples, see 
LowerLOAD() and LowerSTORE() in {XCore,CellSPU}ISelLowering.cpp.  I 
believe it's possible to support this lowering in a target-agnostic 
fashion in LegalizeDAG.cpp, similar to what is done for 
non-naturally-aligned loads and stores using the 
allowsUnalignedMemoryAccesses() target hook.

     I wanted to see if there was any interest in something like this 
for mainline before writing up something more detailed.  Here are a few 
supporting details for now:

* Several existing machines don't provide loads and stores for every 
power-of-2-sized datatype down to i8.  For example, Cell's SPUs only 
support 16-byte, 16-byte-aligned memory ops (this restriction is found 
for other SIMD processors as well), and some GPUs don't support i8 or 
i16 loads/stores.
* Even when short memory operations are possible, sometimes they are 
implemented in a very conservative way (e.g., trapping to a software 
routine) such that, if the compiler can expand the operation statically 
and make use of whatever alignment information it does have, it should 
do so.
* The current expansion of unaligned loads and stores in LegalizeDAG.cpp 
doesn't work for machines that don't support these datatypes.  The 
reason is that ExpandUnaligned*() splits the X-bit load/store into 2 
*independent (parallel in the SelectionDAG, with a TokenFactor
"beneath"
them) X/2-bit load/stores.  In machines without subword stores, you have 
to be careful about the ordering of the constituent operations, such 
that two stores don't clobber one another.  Here's an example:

Say I have a 32-bit target that only supports i32 loads and stores, and 
I have a word of memory at address 0x1000, initialized to 0x0000, with 
two adjacent i16's, s1 and s2.  I want to write 0x1234 to s1, and 0xABCD 
to s2.  I thus need to do (pseudocode):

r1 = mem[(0x1000 & ~0x3)]  #Load word containing s1
r2 = 0x1234 << 16          #Shift s1 value into place
r3 = r1 & 0x0000FFFF       #Mask out s1 bits
r4 = r3 | r2               #OR in s1 value
mem[(0x1000 & ~0x3)] = r4  #Store back word containing new s1 value   *****
r5 = mem[(0x1002 & ~0x3)]  #Load word containing s2                   *****
r6 = 0xABCD                #s2 value doesn't need to be shifted
r7 = r5 & 0xFFFF0000       #Mask out s2 bits
r8 = r7 | r6               #OR in s2 value
mem[(0x1002 & ~0x3)] = r8  #Store back word containing new s2 value

If all goes well, the word at mem[0x1000] should read 0x1234ABCD after 
we're done.

NOTE: The two starred instructions (the store for s1 and the load for 
s2) *must* be executed in that order.  Otherwise, the s2 
read-modify-write will see the old value of s1, and will clobber it when 
it writes back (yielding an incorrect mem[0x1000] value of 0x0000ABCD).

I'm not experienced enough with LLVM to figure out the most precise way 
to express this dependence in my lowering function.  My current solution 
is to mark all loads and stores in these cases as volatile.  This is too 
heavy-handed for my taste, and disallows reordering loads and stores 
that are to completely separate parts of memory, but it works for now.  
I think we can do a more precise job here, but I'm not exactly sure how.

Comments, questions, or requests for clarifications are welcome.  
Basically, I think we could obviate the logic in CellSPU, XCore, and 
future backends, as well as do a better job of optimizing based on 
available alignment information, by moving subword load/store lowering 
into LegalizeDAG, and adding another target hook along the lines of 
allowsUnalignedMemoryAccesses().

I'm interested in working on this and integrating it into mainline if 
people think it's worthwhile and not contrary to project goals.  
Otherwise, I can hack what I need into my own backend.

Best,
Matt

Richard Osborne

2011-Jul-16 21:01 UTC

head link

[LLVMdev] [RFC] LegalizeDAG support for targets without subword load/store instructions

On 16 Jul 2011, at 03:34, Matt Johnson wrote:
> Hi All,
>     Some targets don't provide subword (e.g., i8 and i16 for a 32-bit 
> machine) load and store instructions, so currently we have to 
> custom-lower Load- and StoreSDNodes in our backends.  For examples, see 
> LowerLOAD() and LowerSTORE() in {XCore,CellSPU}ISelLowering.cpp.  I 
> believe it's possible to support this lowering in a target-agnostic 
> fashion in LegalizeDAG.cpp, similar to what is done for 
> non-naturally-aligned loads and stores using the 
> allowsUnalignedMemoryAccesses() target hook.
The XCore does support i8 and i16 loads and stores. As far as I can remember the
standard lowering produced functionally correct code for us. We custom lower
misaligned loads and stores because we want to produce code that is better
optimized for our target.

In particular if a i32 load is from an address known to be a constant offset
away from being word aligned it is quicker to load the two 32bit values at
aligned addresses which overlap the data and then shift and or these values to
form the result.

Also i32 loads / stores not known to be 32bit or 16bit aligned are expanded to a
call to a library function. This can be a big code size win as these operations
would otherwise expand to a significant number of instructions.

I'm not sure how this fits in with the changes you want to make. It does
sound like the kind of thing that would be good to add to the target independent
lowering code,  but I suspect it won't help the XCore backend.

Regards,

Richard

Matt Johnson

2011-Jul-16 23:09 UTC

head link

[LLVMdev] [RFC] LegalizeDAG support for targets without subword load/store instructions

On 07/16/2011 04:01 PM, Richard Osborne wrote:> On 16 Jul 2011, at 03:34, Matt Johnson wrote:
>
>> Hi All,
>>      Some targets don't provide subword (e.g., i8 and i16 for a
32-bit
>> machine) load and store instructions, so currently we have to
>> custom-lower Load- and StoreSDNodes in our backends.  For examples, see
>> LowerLOAD() and LowerSTORE() in {XCore,CellSPU}ISelLowering.cpp.  I
>> believe it's possible to support this lowering in a target-agnostic
>> fashion in LegalizeDAG.cpp, similar to what is done for
>> non-naturally-aligned loads and stores using the
>> allowsUnalignedMemoryAccesses() target hook.
> The XCore does support i8 and i16 loads and stores. As far as I can
remember the standard lowering produced functionally correct code for us. We
custom lower misaligned loads and stores because we want to produce code that is
better optimized for our target.
Thanks for the clarification!  I didn't do enough homework on the XCore 
backend; I initially patterned unaligned support in my own backend after 
the stuff in CellSPU (which *is* for correctness) and grepped around for 
any other backends with similar constructs as I wrote my previous post.
> In particular if a i32 load is from an address known to be a constant
offset away from being word aligned it is quicker to load the two 32bit values
at aligned addresses which overlap the data and then shift and or these values
to form the result.
Smart; this allows you to elide some ADD and AND instructions that you'd 
need if you didn't know where the i32 fell w.r.t. word boundaries.
> Also i32 loads / stores not known to be 32bit or 16bit aligned are expanded
to a call to a library function. This can be a big code size win as these
operations would otherwise expand to a significant number of instructions.
Also very smart; my initial sketch results in 26 instructions for a 
worst-case i32 store.  Processors that omit subword ops also tend to 
have small i-caches, so the library function seems preferable.
> I'm not sure how this fits in with the changes you want to make. It
does sound like the kind of thing that would be good to add to the target
independent lowering code,  but I suspect it won't help the XCore backend.
To oversimplify a bit, what I'd like to do is support a superset of what 
XCore does (I would say "CellSPU and XCore", but realistically I think
just tackling scalar types would be a good first step, and CellSPU is 
vector-centric), and allow the Target to tune the codegen behavior for 
certain types, certain alignments, certain subtargets, etc. to get the 
best performance.  I agree that the benefit to XCore would probably just 
be that the existing lowering code would go away, unless we can find 
some more cases that are currently handled by the libcall that might be 
more efficient to expand inline.

I'd like to allow a target to specify an action ('Legal',
'Expand',
'Custom' and Libcall (like XCore)) for (type, base pointer alignment, 
base+offset alignment) tuples, I think.  Targets could implement special 
lowerings that don't make sense to put in target-independent codegen, 
but I'd imagine you could handle most of the common cases in a 
target-independent way.

The main thing that I think could make this feature very hard to do well 
is enforcing dependencies properly between loads and stores that happen 
to map onto the same word, even when they can be shown to be to 
different source-level variables.  I wonder if you'd end up having to 
insert a bunch of extra edges between all loads/stores you can't prove 
to be to different words.> Regards,
>
> Richard
>
-Matt

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Jul 2011 - [LLVMdev] [RFC] LegalizeDAG support for targets without subword load/store instructions

[LLVMdev] [RFC] LegalizeDAG support for targets without subword load/store instructions

[LLVMdev] [RFC] LegalizeDAG support for targets without subword load/store instructions

[LLVMdev] [RFC] LegalizeDAG support for targets without subword load/store instructions

Apparently Analagous Threads