thr3ads.net - llvm dev - [LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities [May 2014]

If this information is useful, please help other people find it:
Share via:

David Chisnall

2014-May-10 14:47 UTC

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

On 10 May 2014, at 13:53, Tim Northover <t.p.northover at gmail.com>
wrote:
> It doesn't make sense for everything though, particularly if you want
> target-specific IR to simply not exist. What would you map ARM's
> "ldrex" to on x86? 
This isn't a great example.  Having load-linked / store-conditional in the
IR would make a number of transforms related to atomics easier.  We currently
can't correctly model the weak compare-and-exchange from the C[++]11 memory
model and we generate terrible code for a number of common atomic idioms on
non-x86 platforms as a result.

David

Tim Northover

2014-May-10 15:18 UTC

head link

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

> This isn't a great example.  Having load-linked / store-conditional in
the
> IR would make a number of transforms related to atomics easier.  We
> currently can't correctly model the weak compare-and-exchange from
> the C[++]11 memory model and we generate terrible code for a number
> of common atomic idioms on non-x86 platforms as a result.
Actually, I really agree there. I considered it recently, but decided
to leave it as an intrinsic for now (the new IR expansion pass happens
after most optimisations so there wouldn't be much benefit, but if we
did it earlier and the mid-end understood what an ldrex/strex meant, I
could see code getting much better).

Load linked would be fairly easy (perhaps even written as "load
linked", a minor extension to "load atomic"). Store conditional
would
be a bigger change since stores don't return anything at the moment;
passes may not be expecting to have to ReplaceAllUses on them.

I'm hoping to have some more time to spend on atomics soon, after this
merge business is done. Perhaps then.

I don't suppose you have any plans to port Mips to the IR-level LL/SC
expansion? Now that the infrastructure is present it's quite a
simplification (r206490 in ARM64 for example, though you need existing
target-specific intrinsics at the moment). It would be good to iron
out any ARM-specific assumptions I've made.

But it would still be a construct that probably just couldn't be used
on x86 efficiently, not really a step towards a target independent IR.

Cheers.

Tim.

Suminda Dharmasena

2014-May-10 15:19 UTC

head link

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

When I meant for macroing facility is something along these lines. Must
think more on it though.

http://luajit.org/dynasm_features.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140510/0dd0afaa/attachment.html>

David Chisnall

2014-May-10 16:38 UTC

head link

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

On 10 May 2014, at 16:18, Tim Northover <t.p.northover at gmail.com>
wrote:
> Actually, I really agree there. I considered it recently, but decided
> to leave it as an intrinsic for now (the new IR expansion pass happens
> after most optimisations so there wouldn't be much benefit, but if we
> did it earlier and the mid-end understood what an ldrex/strex meant, I
> could see code getting much better).
> 
> Load linked would be fairly easy (perhaps even written as "load
> linked", a minor extension to "load atomic"). Store
conditional would
> be a bigger change since stores don't return anything at the moment;
> passes may not be expecting to have to ReplaceAllUses on them.
The easiest solution would be to extend the cmpxchg instruction with a weak
variant.  It is then trivial to map load, modify, weak-cmpxchg to load-linked,
modify, store-conditional (that is what weak cmpxchg was intended for in the
C[++]11 memory model).
> I'm hoping to have some more time to spend on atomics soon, after this
> merge business is done. Perhaps then.
> 
> I don't suppose you have any plans to port Mips to the IR-level LL/SC
> expansion? Now that the infrastructure is present it's quite a
> simplification (r206490 in ARM64 for example, though you need existing
> target-specific intrinsics at the moment). It would be good to iron
> out any ARM-specific assumptions I've made.
I'd rather avoid it, because it doing it that late precludes a lot of
optimisations that we're interested in.  I'd much rather extend the IR
to support them at a generic level.

We have a couple of plans for variations of atomic operations in our
architecture, so we'll likely end up trying and throwing away a few
approaches over the next couple of years.
> But it would still be a construct that probably just couldn't be used
> on x86 efficiently, not really a step towards a target independent IR.
On x86, we could map weak cmpxchg to the same thing as a strong cmpxchg, so it
would still generate the same code.  The same is true for all architectures with
a non-blocking compare and exchange operation.

David

Andrew Trick

2014-May-10 18:25 UTC

head link

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

On May 10, 2014, at 7:47 AM, David Chisnall <David.Chisnall at
cl.cam.ac.uk> wrote:
> On 10 May 2014, at 13:53, Tim Northover <t.p.northover at gmail.com>
wrote:
> 
>> It doesn't make sense for everything though, particularly if you
want
>> target-specific IR to simply not exist. What would you map ARM's
>> "ldrex" to on x86? 
> 
> This isn't a great example.  Having load-linked / store-conditional in
the IR would make a number of transforms related to atomics easier.  We
currently can't correctly model the weak compare-and-exchange from the
C[++]11 memory model and we generate terrible code for a number of common atomic
idioms on non-x86 platforms as a result.
The IR is missing a weak variant of cmpxchg. But is there anything else missing
at IR level? My understanding was that LLVM’s atomic memory ordering constraints
are complete, but that codegen is not highly optimized, and maybe conservative
for some targets. Which idiom do you have trouble with on non-x86?

-Andy

David Chisnall

2014-May-10 18:35 UTC

head link

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

On 10 May 2014, at 19:25, Andrew Trick <atrick at apple.com> wrote:
> The IR is missing a weak variant of cmpxchg. But is there anything else
missing at IR level? My understanding was that LLVM’s atomic memory ordering
constraints are complete, but that codegen is not highly optimized, and maybe
conservative for some targets. Which idiom do you have trouble with on non-x86?
The example from our EuroLLVM talk was this:

_Atomic(int) a; a *= b;

This is (according to the spec) equivalent to this (simplified slightly):

 
    int expected = a;
    int desired;
    do {
      desired = expected * b;
    } while (!compare_swap_weak(current, expected, desired));

What clang generates is almost this, but with a strong compare and swap:

define void @mul(i32* %a, i32 %b) #0 {
entry:
 %atomic-load = load atomic i32* %a seq_cst, align 4, !tbaa !1
 br label %atomic_op

atomic_op:                                        ; preds = %atomic_op, %entry
 %0 = phi i32 [ %atomic-load, %entry ], [ %1, %atomic_op ]
 %mul = mul nsw i32 %0, %b
 %1 = cmpxchg i32* %a, i32 %0, i32 %mul seq_cst
 %2 = icmp eq i32 %1, %0
 br i1 %2, label %atomic_cont, label %atomic_op

atomic_cont:                                      ; preds = %atomic_op
 ret void
}

This  maps trivially to x86:

LBB0_1:
	movl	%ecx, %edx
	imull	%esi, %edx
	movl	%ecx, %eax
	lock
	cmpxchgl	%edx, (%rdi)
	cmpl	%ecx, %eax
	movl	%eax, %ecx
	jne	LBB0_1

For MIPS, what we *should* be generating is:

	sync 0            # Ensure all other loads / stores are globally visible
retry:
	ll   $t4, $a0     # Load the current value of the atomic int
	mult $t4, $a1     # Multiply by the other argument
	mflo $t4          # Get the result
	sc   $t4, $a0     # Try to write it back atomically
	bnez $t4, entry   # If we failed, try the whole thing again
	sync 0            # branch delay slot - ensure seqcst behaviour here

What we actually generate is this:

# BB#0:                                 # %entry
	daddiu	$sp, $sp, -16
	sd	$fp, 8($sp)             # 8-byte Folded Spill
	move	 $fp, $sp
	addiu	$3, $zero, 0
$BB0_1:                                 # %entry
                                       # =>This Inner Loop Header: Depth=1
	ll	$2, 0($4)
	bne	$2, $3, $BB0_3
	nop
# BB#2:                                 # %entry
                                       #   in Loop: Header=BB0_1 Depth=1
	addiu	$6, $zero, 0
	sc	$6, 0($4)
	beqz	$6, $BB0_1
	nop
$BB0_3:                                 # %entry
	sync 0
$BB0_4:                                 # %atomic_op
                                       # =>This Loop Header: Depth=1
                                       #     Child Loop BB0_5 Depth 2
	move	 $3, $2
	mul	$6, $3, $5
	sync 0
$BB0_5:                                 # %atomic_op
                                       #   Parent Loop BB0_4 Depth=1
                                       # =>  This Inner Loop Header: Depth=2
	ll	$2, 0($4)
	bne	$2, $3, $BB0_7
	nop
# BB#6:                                 # %atomic_op
                                       #   in Loop: Header=BB0_5 Depth=2
	move	 $7, $6
	sc	$7, 0($4)
	beqz	$7, $BB0_5
	nop
$BB0_7:                                 # %atomic_op
                                       #   in Loop: Header=BB0_4 Depth=1
	sync 0
	bne	$2, $3, $BB0_4
	nop
# BB#8:                                 # %atomic_cont
	move	 $sp, $fp
	ld	$fp, 8($sp)             # 8-byte Folded Reload
	jr	$ra
	daddiu	$sp, $sp, 16

For correctness, we *have* to implement the cmpxchg in the IR as a ll/sc loop,
and so we end up with a nested loop for something that is a single line in the
source.

The idiom of the weak compare and exchange loop is a fairly common one, but we
generate spectacularly bad code for it.

David

Tim Northover

2014-May-10 18:35 UTC

head link

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

> The IR is missing a weak variant of cmpxchg. But is there anything else
> missing at IR level? My understanding was that LLVM’s atomic memory
> ordering constraints are complete, but that codegen is not highly
optimized,
> and maybe conservative for some targets. Which idiom do you have trouble
> with on non-x86?
For myself, I don't like the fact that LLVM's atomicrmw & cmpxchg
instructions are so beholden to C. With suitable constraints, an
"atomicrmw [](int x) { ... }" isn't unreasonable; but this can
only be
mapped to a cmpxchg loop with the current IR.

Tim.

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - May 2014 - [LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

Seemingly Similar Threads