thr3ads.net - llvm dev - [llvm-dev] RFC: atomic operations on SI+ [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Jan Vesely via llvm-dev

2016-Mar-25 18:22 UTC

[llvm-dev] RFC: atomic operations on SI+

Hi Tom, Matt,

I'm working on a project that needs few coherent atomic operations (HSA
mode: load, store, compare-and-swap) for std::atomic_uint in HCC.

the attached patch implements atomic compare and swap for SI+
(untested). I tried to stay within what was available, but there are
few issues that I was unsure how to address:

1.) it currently uses v2i32 for both input and output. This needlessly
clobbers a register, but I haven't found a way to say 'output = the
first subreg for input or output = input' in pattern constraints.

2.) I think these will need SLC bit set to work correctly on HSA
targets. I'm not sure what the best way is to do this. I considered:
* adding it as an operand to the newly introduced node
(AMDGPUISD::CMP_SWAP), and setting it to 0/1 in Lowering pass. Can this
be done without changing order of the inputs? slc is always the last
but the count is variable
* introducing HSA variants of the instructions with SLC bit set
* setting the bit in DAG combine (is there a way to know if I'm
combining atomic op, or do I need to split the SelectMUBUF functions?)

3.) depending on 2 atomic load/store can be either (preferably) MUBUF
load/store with glc and slc set, or hacked to be either swp_noret
(store) or cmp_swap 0,0 (load).

thanks,
-- 
Jan Vesely <jan.vesely at rutgers.edu>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-AMDGPU-SI-Implement-atomic-compare-and-swap.patch
Type: text/x-patch
Size: 12265 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160325/6f91d7ab/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160325/6f91d7ab/attachment.sig>

Tom Stellard via llvm-dev

2016-Mar-28 13:54 UTC

head link

[llvm-dev] RFC: atomic operations on SI+

On Fri, Mar 25, 2016 at 02:22:11PM -0400, Jan Vesely
wrote:> Hi Tom, Matt,
> 
> I'm working on a project that needs few coherent atomic operations (HSA
> mode: load, store, compare-and-swap) for std::atomic_uint in HCC.
> 
> the attached patch implements atomic compare and swap for SI+
> (untested). I tried to stay within what was available, but there are
> few issues that I was unsure how to address:
> 
> 1.) it currently uses v2i32 for both input and output. This needlessly
> clobbers a register, but I haven't found a way to say 'output = the
> first subreg for input or output = input' in pattern constraints.
> 
> 2.) I think these will need SLC bit set to work correctly on HSA
> targets. I'm not sure what the best way is to do this. I considered:
> * adding it as an operand to the newly introduced node
> (AMDGPUISD::CMP_SWAP), and setting it to 0/1 in Lowering pass. Can this
> be done without changing order of the inputs? slc is always the last
> but the count is variable
> * introducing HSA variants of the instructions with SLC bit set
> * setting the bit in DAG combine (is there a way to know if I'm
> combining atomic op, or do I need to split the SelectMUBUF functions?)
> 
> 3.) depending on 2 atomic load/store can be either (preferably) MUBUF
> load/store with glc and slc set, or hacked to be either swp_noret
> (store) or cmp_swap 0,0 (load).
> 
Have you seen this patch: http://reviews.llvm.org/D17280

-Tom
> thanks,
> -- 
> Jan Vesely <jan.vesely at rutgers.edu>
> From ffd2c28fb68e958682dc60090a2b1fb669897423 Mon Sep 17 00:00:00 2001
> From: Jan Vesely <jan.vesely at rutgers.edu>
> Date: Tue, 22 Mar 2016 13:54:55 -0400
> Subject: [PATCH 1/1] AMDGPU,SI: Implement atomic compare and swap
> 
> ---
>  lib/Target/AMDGPU/AMDGPUISelLowering.cpp |  1 +
>  lib/Target/AMDGPU/AMDGPUISelLowering.h   |  1 +
>  lib/Target/AMDGPU/CIInstructions.td      |  3 +-
>  lib/Target/AMDGPU/SIISelLowering.cpp     | 41 +++++++++++++++
>  lib/Target/AMDGPU/SIISelLowering.h       |  1 +
>  lib/Target/AMDGPU/SIInstrInfo.td         |  9 ++++
>  lib/Target/AMDGPU/SIInstructions.td      |  2 +-
>  test/CodeGen/AMDGPU/global_atomics.ll    | 89
++++++++++++++++++++++++++++++++
>  8 files changed, 145 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
> index 0820898..6c5b1ed 100644
> --- a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
> +++ b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
> @@ -2782,6 +2782,7 @@ const char*
AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
>    NODE_NAME_CASE(INTERP_P2)
>    NODE_NAME_CASE(STORE_MSKOR)
>    NODE_NAME_CASE(TBUFFER_STORE_FORMAT)
> +  NODE_NAME_CASE(CMP_SWAP)
>    case AMDGPUISD::LAST_AMDGPU_ISD_NUMBER: break;
>    }
>    return nullptr;
> diff --git a/lib/Target/AMDGPU/AMDGPUISelLowering.h
b/lib/Target/AMDGPU/AMDGPUISelLowering.h
> index 4627f34..fce8e76 100644
> --- a/lib/Target/AMDGPU/AMDGPUISelLowering.h
> +++ b/lib/Target/AMDGPU/AMDGPUISelLowering.h
> @@ -313,6 +313,7 @@ enum NodeType : unsigned {
>    STORE_MSKOR,
>    LOAD_CONSTANT,
>    TBUFFER_STORE_FORMAT,
> +  CMP_SWAP,
>    LAST_AMDGPU_ISD_NUMBER
>  };
>  
> diff --git a/lib/Target/AMDGPU/CIInstructions.td
b/lib/Target/AMDGPU/CIInstructions.td
> index 593300f..d99b013 100644
> --- a/lib/Target/AMDGPU/CIInstructions.td
> +++ b/lib/Target/AMDGPU/CIInstructions.td
> @@ -156,7 +156,7 @@ defm FLAT_ATOMIC_SWAP : FLAT_ATOMIC <
>    flat<0x30, 0x40>, "flat_atomic_swap", VGPR_32
>  >;
>  defm FLAT_ATOMIC_CMPSWAP : FLAT_ATOMIC <
> -  flat<0x31, 0x41>, "flat_atomic_cmpswap", VGPR_32,
VReg_64
> +  flat<0x31, 0x41>, "flat_atomic_cmpswap", VReg_64
>  >;
>  defm FLAT_ATOMIC_ADD : FLAT_ATOMIC <
>    flat<0x32, 0x42>, "flat_atomic_add", VGPR_32
> @@ -322,6 +322,7 @@ def : FlatAtomicPat <FLAT_ATOMIC_SMIN_RTN,
atomic_min_global, i32>;
>  def : FlatAtomicPat <FLAT_ATOMIC_UMIN_RTN, atomic_umin_global, i32>;
>  def : FlatAtomicPat <FLAT_ATOMIC_OR_RTN, atomic_or_global, i32>;
>  def : FlatAtomicPat <FLAT_ATOMIC_SWAP_RTN, atomic_swap_global, i32>;
> +def : FlatAtomicPat <FLAT_ATOMIC_CMPSWAP_RTN, SIcmp_swap, v2i32>;
>  def : FlatAtomicPat <FLAT_ATOMIC_XOR_RTN, atomic_xor_global, i32>;
>  
>  } // End Predicates = [isCIVI]
> diff --git a/lib/Target/AMDGPU/SIISelLowering.cpp
b/lib/Target/AMDGPU/SIISelLowering.cpp
> index a8f6ed5..4b6d118 100644
> --- a/lib/Target/AMDGPU/SIISelLowering.cpp
> +++ b/lib/Target/AMDGPU/SIISelLowering.cpp
> @@ -263,6 +263,12 @@ SITargetLowering::SITargetLowering(TargetMachine
&TM,
>    setOperationAction(ISD::FDIV, MVT::f32, Custom);
>    setOperationAction(ISD::FDIV, MVT::f64, Custom);
>  
> +  // GCN CMP_SWAP needs input marshalling, and output demarshalling
> +  setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i32, Custom);
> +  // We can't return success/failure, only the old value,
> +  // let LLVM add the comparison
> +  setOperationAction(ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS, MVT::i32, Expand);
> +
>    setTargetDAGCombine(ISD::FADD);
>    setTargetDAGCombine(ISD::FSUB);
>    setTargetDAGCombine(ISD::FMINNUM);
> @@ -1168,6 +1174,7 @@ SDValue SITargetLowering::LowerOperation(SDValue Op,
SelectionDAG &DAG) const {
>      SIMachineFunctionInfo *MFI =
MF.getInfo<SIMachineFunctionInfo>();
>      return LowerGlobalAddress(MFI, Op, DAG);
>    }
> +  case ISD::ATOMIC_CMP_SWAP: return LowerATOMIC_CMP_SWAP(Op, DAG);
>    case ISD::INTRINSIC_WO_CHAIN: return LowerINTRINSIC_WO_CHAIN(Op, DAG);
>    case ISD::INTRINSIC_VOID: return LowerINTRINSIC_VOID(Op, DAG);
>    }
> @@ -1680,6 +1687,40 @@ SDValue
SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
>    }
>  }
>  
> +SDValue SITargetLowering::LowerATOMIC_CMP_SWAP(SDValue Op, SelectionDAG
&DAG) const
> +{
> +  SDLoc DL(Op);
> +
> +  AtomicSDNode *Swap = cast<AtomicSDNode>(Op);
> +  assert(Swap && Swap->isCompareAndSwap());
> +
> +  EVT MemVT = Swap->getMemoryVT();
> +  EVT ValVT = EVT::getVectorVT(*DAG.getContext(), MemVT, 2);
> +  SDValue Cmp = Op.getOperand(2);
> +  SDValue New = Op.getOperand(3);
> +
> +  // "src comes from the first data-vgpr, cmp from the second."
> +  SDValue Val = DAG.getNode(ISD::BUILD_VECTOR, DL, ValVT, New, Cmp);
> +
> +  SDValue Ops[] = {Swap->getChain(), Swap->getBasePtr(), Val};
> +  ArrayRef<EVT> VTs = { MVT::v2i32, MVT::Other };
> +  SDVTList VTList = { VTs.data(), VTs.size() };
> +
> +  SDValue CmpSwap = DAG.getMemIntrinsicNode(AMDGPUISD::CMP_SWAP, DL,
VTList,
> +                                            Ops, MemVT,
Swap->getMemOperand());
> +  // TODO: What about ordering? synchscope?
> +
> +  // Extract returned old value
> +  SDValue Zero = DAG.getConstant(0, DL, MVT::i32);
> +  SDValue Old  = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i32,
> +                             CmpSwap.getValue(0), Zero);
> +
> +  // Merge return value and Chain
> +  SDValue Ret = DAG.getNode(ISD::MERGE_VALUES, DL, Op->getVTList(),
> +                            Old, CmpSwap.getValue(1));
> +  return Ret;
> +}
> +
>  SDValue SITargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG)
const {
>    SDLoc DL(Op);
>    LoadSDNode *Load = cast<LoadSDNode>(Op);
> diff --git a/lib/Target/AMDGPU/SIISelLowering.h
b/lib/Target/AMDGPU/SIISelLowering.h
> index 842fbeb..2f87de1 100644
> --- a/lib/Target/AMDGPU/SIISelLowering.h
> +++ b/lib/Target/AMDGPU/SIISelLowering.h
> @@ -29,6 +29,7 @@ class SITargetLowering final : public
AMDGPUTargetLowering {
>    SDValue lowerImplicitZextParam(SelectionDAG &DAG, SDValue Op,
>                                   MVT VT, unsigned Offset) const;
>  
> +  SDValue LowerATOMIC_CMP_SWAP(SDValue Op, SelectionDAG &DAG) const;
>    SDValue LowerINTRINSIC_WO_CHAIN(SDValue Op, SelectionDAG &DAG)
const;
>    SDValue LowerINTRINSIC_VOID(SDValue Op, SelectionDAG &DAG) const;
>    SDValue LowerFrameIndex(SDValue Op, SelectionDAG &DAG) const;
> diff --git a/lib/Target/AMDGPU/SIInstrInfo.td
b/lib/Target/AMDGPU/SIInstrInfo.td
> index 37ee35c..eb0e6ce 100644
> --- a/lib/Target/AMDGPU/SIInstrInfo.td
> +++ b/lib/Target/AMDGPU/SIInstrInfo.td
> @@ -114,6 +114,15 @@ def SItbuffer_store :
SDNode<"AMDGPUISD::TBUFFER_STORE_FORMAT",
>    [SDNPMayStore, SDNPMemOperand, SDNPHasChain]
>  >;
>  
> +def SIcmp_swap : SDNode<"AMDGPUISD::CMP_SWAP",
> +  SDTypeProfile<1, 2,
> +    [SDTCisVT<0, v2i32>, // return value (should be i32)
> +     SDTCisVT<1, i64>,   // address
> +     SDTCisVT<2, v2i32>  // src followed by cmp
> +    ]>,
> +    [SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]
> +>;
> +
>  def SIload_input : SDNode<"AMDGPUISD::LOAD_INPUT",
>    SDTypeProfile<1, 3, [SDTCisVT<0, v4f32>, SDTCisVT<1,
v4i32>, SDTCisVT<2, i16>,
>                         SDTCisVT<3, i32>]>
> diff --git a/lib/Target/AMDGPU/SIInstructions.td
b/lib/Target/AMDGPU/SIInstructions.td
> index 2f7f908..77bb8ca 100644
> --- a/lib/Target/AMDGPU/SIInstructions.td
> +++ b/lib/Target/AMDGPU/SIInstructions.td
> @@ -1012,7 +1012,7 @@ defm BUFFER_ATOMIC_SWAP : MUBUF_Atomic <
>    mubuf<0x30, 0x40>, "buffer_atomic_swap", VGPR_32, i32,
atomic_swap_global
>  >;
>  defm BUFFER_ATOMIC_CMPSWAP : MUBUF_Atomic <
> -  mubuf<0x31, 0x41>, "buffer_atomic_cmpswap", VReg_64,
v2i32, null_frag
> +  mubuf<0x31, 0x41>, "buffer_atomic_cmpswap", VReg_64,
v2i32, SIcmp_swap
>  >;
>  defm BUFFER_ATOMIC_ADD : MUBUF_Atomic <
>    mubuf<0x32, 0x42>, "buffer_atomic_add", VGPR_32, i32,
atomic_add_global
> diff --git a/test/CodeGen/AMDGPU/global_atomics.ll
b/test/CodeGen/AMDGPU/global_atomics.ll
> index a92ee89..340e808 100644
> --- a/test/CodeGen/AMDGPU/global_atomics.ll
> +++ b/test/CodeGen/AMDGPU/global_atomics.ll
> @@ -758,6 +758,95 @@ entry:
>    ret void
>  }
>  
> +; CMP_SWAP
> +
> +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_offset:
> +; GCN: buffer_atomic_cmpswap v[{{[0-9]+}}:{{[0-9]+}}],
s[{{[0-9]+}}:{{[0-9]+}}], 0 offset:16{{$}}
> +define void @atomic_cmpxchg_i32_offset(i32 addrspace(1)* %out, i32 %in,
i32 %old) {
> +entry:
> +  %gep = getelementptr i32, i32 addrspace(1)* %out, i32 4
> +  %0  = cmpxchg volatile i32 addrspace(1)* %gep, i32 %old, i32 %in seq_cst
seq_cst
> +  ret void
> +}
> +
> +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_ret_offset:
> +; GCN: buffer_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]{{:[0-9]+}}],
s[{{[0-9]+}}:{{[0-9]+}}], 0 offset:16 glc{{$}}
> +; GCN: buffer_store_dword v[[RET]]
> +define void @atomic_cmpxchg_i32_ret_offset(i32 addrspace(1)* %out, i32
addrspace(1)* %out2, i32 %in, i32 %old) {
> +entry:
> +  %gep = getelementptr i32, i32 addrspace(1)* %out, i32 4
> +  %0  = cmpxchg volatile i32 addrspace(1)* %gep, i32 %old, i32 %in seq_cst
seq_cst
> +  %1  = extractvalue { i32, i1 } %0, 0
> +  store i32 %1, i32 addrspace(1)* %out2
> +  ret void
> +}
> +
> +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_addr64_offset:
> +; SI: buffer_atomic_cmpswap v[{{[0-9]+\:[0-9]+}}],
v[{{[0-9]+}}:{{[0-9]+}}], s[{{[0-9]+}}:{{[0-9]+}}], 0 addr64 offset:16{{$}}
> +define void @atomic_cmpxchg_i32_addr64_offset(i32 addrspace(1)* %out, i32
%in, i64 %index, i32 %old) {
> +entry:
> +  %ptr = getelementptr i32, i32 addrspace(1)* %out, i64 %index
> +  %gep = getelementptr i32, i32 addrspace(1)* %ptr, i32 4
> +  %0   = cmpxchg volatile i32 addrspace(1)* %gep, i32 %old, i32 %in
seq_cst seq_cst
> +  ret void
> +}
> +
> +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_ret_addr64_offset:
> +; SI: buffer_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]:{{[0-9]+}}],
v[{{[0-9]+}}:{{[0-9]+}}], s[{{[0-9]+}}:{{[0-9]+}}], 0 addr64 offset:16 glc{{$}}
> +; VI: flat_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]:{{[0-9]+}}],
v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] glc{{$}}
> +; GCN: buffer_store_dword v[[RET]]
> +define void @atomic_cmpxchg_i32_ret_addr64_offset(i32 addrspace(1)* %out,
i32 addrspace(1)* %out2, i32 %in, i64 %index, i32 %old) {
> +entry:
> +  %ptr = getelementptr i32, i32 addrspace(1)* %out, i64 %index
> +  %gep = getelementptr i32, i32 addrspace(1)* %ptr, i32 4
> +  %0   = cmpxchg volatile i32 addrspace(1)* %gep, i32 %old, i32 %in
seq_cst seq_cst
> +  %1  = extractvalue { i32, i1 } %0, 0
> +  store i32 %1, i32 addrspace(1)* %out2
> +  ret void
> +}
> +
> +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32:
> +; GCN: buffer_atomic_cmpswap v[{{[0-9]+:[0-9]+}}],
s[{{[0-9]+}}:{{[0-9]+}}], 0{{$}}
> +define void @atomic_cmpxchg_i32(i32 addrspace(1)* %out, i32 %in, i32 %old)
{
> +entry:
> +  %0  = cmpxchg volatile i32 addrspace(1)* %out, i32 %old, i32 %in seq_cst
seq_cst
> +  ret void
> +}
> +
> +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_ret:
> +; GCN: buffer_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]:{{[0-9]+}}],
s[{{[0-9]+}}:{{[0-9]+}}], 0 glc
> +; GCN: buffer_store_dword v[[RET]]
> +define void @atomic_cmpxchg_i32_ret(i32 addrspace(1)* %out, i32
addrspace(1)* %out2, i32 %in, i32 %old) {
> +entry:
> +  %0  = cmpxchg volatile i32 addrspace(1)* %out, i32 %old, i32 %in seq_cst
seq_cst
> +  %1  = extractvalue { i32, i1 } %0, 0
> +  store i32 %1, i32 addrspace(1)* %out2
> +  ret void
> +}
> +
> +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_addr64:
> +; SI: buffer_atomic_cmpswap v[{{[0-9]+:[0-9]+}}],
v[{{[0-9]+}}:{{[0-9]+}}], s[{{[0-9]+}}:{{[0-9]+}}], 0 addr64{{$}}
> +; VI: flat_atomic_cmpswap v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]{{$}}
> +define void @atomic_cmpxchg_i32_addr64(i32 addrspace(1)* %out, i32 %in,
i64 %index, i32 %old) {
> +entry:
> +  %ptr = getelementptr i32, i32 addrspace(1)* %out, i64 %index
> +  %0  = cmpxchg volatile i32 addrspace(1)* %ptr, i32 %old, i32 %in seq_cst
seq_cst
> +  ret void
> +}
> +
> +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_ret_addr64:
> +; SI: buffer_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]:{{[0-9]+}}],
v[{{[0-9]+}}:{{[0-9]+}}], s[{{[0-9]+}}:{{[0-9]+}}], 0 addr64 glc{{$}}
> +; VI: flat_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]:{{[0-9]+}}],
v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] glc{{$}}
> +; GCN: buffer_store_dword v[[RET]]
> +define void @atomic_cmpxchg_i32_ret_addr64(i32 addrspace(1)* %out, i32
addrspace(1)* %out2, i32 %in, i64 %index, i32 %old) {
> +entry:
> +  %ptr = getelementptr i32, i32 addrspace(1)* %out, i64 %index
> +  %0  = cmpxchg volatile i32 addrspace(1)* %ptr, i32 %old, i32 %in seq_cst
seq_cst
> +  %1  = extractvalue { i32, i1 } %0, 0
> +  store i32 %1, i32 addrspace(1)* %out2
> +  ret void
> +}
> +
>  ; FUNC-LABEL: {{^}}atomic_xor_i32_offset:
>  ; GCN: buffer_atomic_xor v{{[0-9]+}}, s[{{[0-9]+}}:{{[0-9]+}}], 0
offset:16{{$}}
>  define void @atomic_xor_i32_offset(i32 addrspace(1)* %out, i32 %in) {
> -- 
> 2.5.5
>

Jan Vesely via llvm-dev

2016-Mar-28 19:29 UTC

head link

[llvm-dev] RFC: atomic operations on SI+

On Mon, 2016-03-28 at 06:54 -0700, Tom Stellard wrote:> On Fri, Mar 25, 2016 at 02:22:11PM -0400, Jan Vesely wrote:
> > 
> > Hi Tom, Matt,
> > 
> > I'm working on a project that needs few coherent atomic operations
> > (HSA
> > mode: load, store, compare-and-swap) for std::atomic_uint in HCC.
> > 
> > the attached patch implements atomic compare and swap for SI+
> > (untested). I tried to stay within what was available, but there
> > are
> > few issues that I was unsure how to address:
> > 
> > 1.) it currently uses v2i32 for both input and output. This
> > needlessly
> > clobbers a register, but I haven't found a way to say 'output
= the
> > first subreg for input or output = input' in pattern constraints.
> > 
> > 2.) I think these will need SLC bit set to work correctly on HSA
> > targets. I'm not sure what the best way is to do this. I
> > considered:
> > * adding it as an operand to the newly introduced node
> > (AMDGPUISD::CMP_SWAP), and setting it to 0/1 in Lowering pass. Can
> > this
> > be done without changing order of the inputs? slc is always the
> > last
> > but the count is variable
> > * introducing HSA variants of the instructions with SLC bit set
> > * setting the bit in DAG combine (is there a way to know if I'm
> > combining atomic op, or do I need to split the SelectMUBUF
> > functions?)
> > 
> > 3.) depending on 2 atomic load/store can be either (preferably)
> > MUBUF
> > load/store with glc and slc set, or hacked to be either swp_noret
> > (store) or cmp_swap 0,0 (load).
> > 
> Have you seen this patch: http://reviews.llvm.org/D17280
Thanks, I subscribed to that one. It answers the first question (can't
be done), but it does not address the SLC problem. anyway I moved my
questions there.

thanks,
Jan
> 
> -Tom
> 
> > 
> > thanks,
> > -- 
> > Jan Vesely <jan.vesely at rutgers.edu>
> > 
> > From ffd2c28fb68e958682dc60090a2b1fb669897423 Mon Sep 17 00:00:00
> > 2001
> > From: Jan Vesely <jan.vesely at rutgers.edu>
> > Date: Tue, 22 Mar 2016 13:54:55 -0400
> > Subject: [PATCH 1/1] AMDGPU,SI: Implement atomic compare and swap
> > 
> > ---
> >  lib/Target/AMDGPU/AMDGPUISelLowering.cpp |  1 +
> >  lib/Target/AMDGPU/AMDGPUISelLowering.h   |  1 +
> >  lib/Target/AMDGPU/CIInstructions.td      |  3 +-
> >  lib/Target/AMDGPU/SIISelLowering.cpp     | 41 +++++++++++++++
> >  lib/Target/AMDGPU/SIISelLowering.h       |  1 +
> >  lib/Target/AMDGPU/SIInstrInfo.td         |  9 ++++
> >  lib/Target/AMDGPU/SIInstructions.td      |  2 +-
> >  test/CodeGen/AMDGPU/global_atomics.ll    | 89
> > ++++++++++++++++++++++++++++++++
> >  8 files changed, 145 insertions(+), 2 deletions(-)
> > 
> > diff --git a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
> > b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
> > index 0820898..6c5b1ed 100644
> > --- a/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
> > +++ b/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
> > @@ -2782,6 +2782,7 @@ const char*
> > AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
> >    NODE_NAME_CASE(INTERP_P2)
> >    NODE_NAME_CASE(STORE_MSKOR)
> >    NODE_NAME_CASE(TBUFFER_STORE_FORMAT)
> > +  NODE_NAME_CASE(CMP_SWAP)
> >    case AMDGPUISD::LAST_AMDGPU_ISD_NUMBER: break;
> >    }
> >    return nullptr;
> > diff --git a/lib/Target/AMDGPU/AMDGPUISelLowering.h
> > b/lib/Target/AMDGPU/AMDGPUISelLowering.h
> > index 4627f34..fce8e76 100644
> > --- a/lib/Target/AMDGPU/AMDGPUISelLowering.h
> > +++ b/lib/Target/AMDGPU/AMDGPUISelLowering.h
> > @@ -313,6 +313,7 @@ enum NodeType : unsigned {
> >    STORE_MSKOR,
> >    LOAD_CONSTANT,
> >    TBUFFER_STORE_FORMAT,
> > +  CMP_SWAP,
> >    LAST_AMDGPU_ISD_NUMBER
> >  };
> >  
> > diff --git a/lib/Target/AMDGPU/CIInstructions.td
> > b/lib/Target/AMDGPU/CIInstructions.td
> > index 593300f..d99b013 100644
> > --- a/lib/Target/AMDGPU/CIInstructions.td
> > +++ b/lib/Target/AMDGPU/CIInstructions.td
> > @@ -156,7 +156,7 @@ defm FLAT_ATOMIC_SWAP : FLAT_ATOMIC <
> >    flat<0x30, 0x40>, "flat_atomic_swap", VGPR_32
> >  >;
> >  defm FLAT_ATOMIC_CMPSWAP : FLAT_ATOMIC <
> > -  flat<0x31, 0x41>, "flat_atomic_cmpswap", VGPR_32,
VReg_64
> > +  flat<0x31, 0x41>, "flat_atomic_cmpswap", VReg_64
> >  >;
> >  defm FLAT_ATOMIC_ADD : FLAT_ATOMIC <
> >    flat<0x32, 0x42>, "flat_atomic_add", VGPR_32
> > @@ -322,6 +322,7 @@ def : FlatAtomicPat <FLAT_ATOMIC_SMIN_RTN,
> > atomic_min_global, i32>;
> >  def : FlatAtomicPat <FLAT_ATOMIC_UMIN_RTN, atomic_umin_global,
> > i32>;
> >  def : FlatAtomicPat <FLAT_ATOMIC_OR_RTN, atomic_or_global,
i32>;
> >  def : FlatAtomicPat <FLAT_ATOMIC_SWAP_RTN, atomic_swap_global,
> > i32>;
> > +def : FlatAtomicPat <FLAT_ATOMIC_CMPSWAP_RTN, SIcmp_swap,
v2i32>;
> >  def : FlatAtomicPat <FLAT_ATOMIC_XOR_RTN, atomic_xor_global,
i32>;
> >  
> >  } // End Predicates = [isCIVI]
> > diff --git a/lib/Target/AMDGPU/SIISelLowering.cpp
> > b/lib/Target/AMDGPU/SIISelLowering.cpp
> > index a8f6ed5..4b6d118 100644
> > --- a/lib/Target/AMDGPU/SIISelLowering.cpp
> > +++ b/lib/Target/AMDGPU/SIISelLowering.cpp
> > @@ -263,6 +263,12 @@
> > SITargetLowering::SITargetLowering(TargetMachine &TM,
> >    setOperationAction(ISD::FDIV, MVT::f32, Custom);
> >    setOperationAction(ISD::FDIV, MVT::f64, Custom);
> >  
> > +  // GCN CMP_SWAP needs input marshalling, and output
> > demarshalling
> > +  setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i32, Custom);
> > +  // We can't return success/failure, only the old value,
> > +  // let LLVM add the comparison
> > +  setOperationAction(ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS, MVT::i32,
> > Expand);
> > +
> >    setTargetDAGCombine(ISD::FADD);
> >    setTargetDAGCombine(ISD::FSUB);
> >    setTargetDAGCombine(ISD::FMINNUM);
> > @@ -1168,6 +1174,7 @@ SDValue
> > SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG)
> > const {
> >      SIMachineFunctionInfo *MFI > >
MF.getInfo<SIMachineFunctionInfo>();
> >      return LowerGlobalAddress(MFI, Op, DAG);
> >    }
> > +  case ISD::ATOMIC_CMP_SWAP: return LowerATOMIC_CMP_SWAP(Op, DAG);
> >    case ISD::INTRINSIC_WO_CHAIN: return LowerINTRINSIC_WO_CHAIN(Op,
> > DAG);
> >    case ISD::INTRINSIC_VOID: return LowerINTRINSIC_VOID(Op, DAG);
> >    }
> > @@ -1680,6 +1687,40 @@ SDValue
> > SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
> >    }
> >  }
> >  
> > +SDValue SITargetLowering::LowerATOMIC_CMP_SWAP(SDValue Op,
> > SelectionDAG &DAG) const
> > +{
> > +  SDLoc DL(Op);
> > +
> > +  AtomicSDNode *Swap = cast<AtomicSDNode>(Op);
> > +  assert(Swap && Swap->isCompareAndSwap());
> > +
> > +  EVT MemVT = Swap->getMemoryVT();
> > +  EVT ValVT = EVT::getVectorVT(*DAG.getContext(), MemVT, 2);
> > +  SDValue Cmp = Op.getOperand(2);
> > +  SDValue New = Op.getOperand(3);
> > +
> > +  // "src comes from the first data-vgpr, cmp from the
second."
> > +  SDValue Val = DAG.getNode(ISD::BUILD_VECTOR, DL, ValVT, New,
> > Cmp);
> > +
> > +  SDValue Ops[] = {Swap->getChain(), Swap->getBasePtr(), Val};
> > +  ArrayRef<EVT> VTs = { MVT::v2i32, MVT::Other };
> > +  SDVTList VTList = { VTs.data(), VTs.size() };
> > +
> > +  SDValue CmpSwap = DAG.getMemIntrinsicNode(AMDGPUISD::CMP_SWAP,
> > DL, VTList,
> > +                                            Ops, MemVT, Swap-
> > >getMemOperand());
> > +  // TODO: What about ordering? synchscope?
> > +
> > +  // Extract returned old value
> > +  SDValue Zero = DAG.getConstant(0, DL, MVT::i32);
> > +  SDValue Old  = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL,
> > MVT::i32,
> > +                             CmpSwap.getValue(0), Zero);
> > +
> > +  // Merge return value and Chain
> > +  SDValue Ret = DAG.getNode(ISD::MERGE_VALUES, DL, Op-
> > >getVTList(),
> > +                            Old, CmpSwap.getValue(1));
> > +  return Ret;
> > +}
> > +
> >  SDValue SITargetLowering::LowerLOAD(SDValue Op, SelectionDAG
&DAG)
> > const {
> >    SDLoc DL(Op);
> >    LoadSDNode *Load = cast<LoadSDNode>(Op);
> > diff --git a/lib/Target/AMDGPU/SIISelLowering.h
> > b/lib/Target/AMDGPU/SIISelLowering.h
> > index 842fbeb..2f87de1 100644
> > --- a/lib/Target/AMDGPU/SIISelLowering.h
> > +++ b/lib/Target/AMDGPU/SIISelLowering.h
> > @@ -29,6 +29,7 @@ class SITargetLowering final : public
> > AMDGPUTargetLowering {
> >    SDValue lowerImplicitZextParam(SelectionDAG &DAG, SDValue Op,
> >                                   MVT VT, unsigned Offset) const;
> >  
> > +  SDValue LowerATOMIC_CMP_SWAP(SDValue Op, SelectionDAG &DAG)
> > const;
> >    SDValue LowerINTRINSIC_WO_CHAIN(SDValue Op, SelectionDAG &DAG)
> > const;
> >    SDValue LowerINTRINSIC_VOID(SDValue Op, SelectionDAG &DAG)
> > const;
> >    SDValue LowerFrameIndex(SDValue Op, SelectionDAG &DAG) const;
> > diff --git a/lib/Target/AMDGPU/SIInstrInfo.td
> > b/lib/Target/AMDGPU/SIInstrInfo.td
> > index 37ee35c..eb0e6ce 100644
> > --- a/lib/Target/AMDGPU/SIInstrInfo.td
> > +++ b/lib/Target/AMDGPU/SIInstrInfo.td
> > @@ -114,6 +114,15 @@ def SItbuffer_store :
> > SDNode<"AMDGPUISD::TBUFFER_STORE_FORMAT",
> >    [SDNPMayStore, SDNPMemOperand, SDNPHasChain]
> >  >;
> >  
> > +def SIcmp_swap : SDNode<"AMDGPUISD::CMP_SWAP",
> > +  SDTypeProfile<1, 2,
> > +    [SDTCisVT<0, v2i32>, // return value (should be i32)
> > +     SDTCisVT<1, i64>,   // address
> > +     SDTCisVT<2, v2i32>  // src followed by cmp
> > +    ]>,
> > +    [SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]
> > +>;
> > +
> >  def SIload_input : SDNode<"AMDGPUISD::LOAD_INPUT",
> >    SDTypeProfile<1, 3, [SDTCisVT<0, v4f32>, SDTCisVT<1,
v4i32>,
> > SDTCisVT<2, i16>,
> >                         SDTCisVT<3, i32>]>
> > diff --git a/lib/Target/AMDGPU/SIInstructions.td
> > b/lib/Target/AMDGPU/SIInstructions.td
> > index 2f7f908..77bb8ca 100644
> > --- a/lib/Target/AMDGPU/SIInstructions.td
> > +++ b/lib/Target/AMDGPU/SIInstructions.td
> > @@ -1012,7 +1012,7 @@ defm BUFFER_ATOMIC_SWAP : MUBUF_Atomic <
> >    mubuf<0x30, 0x40>, "buffer_atomic_swap", VGPR_32,
i32,
> > atomic_swap_global
> >  >;
> >  defm BUFFER_ATOMIC_CMPSWAP : MUBUF_Atomic <
> > -  mubuf<0x31, 0x41>, "buffer_atomic_cmpswap",
VReg_64, v2i32,
> > null_frag
> > +  mubuf<0x31, 0x41>, "buffer_atomic_cmpswap",
VReg_64, v2i32,
> > SIcmp_swap
> >  >;
> >  defm BUFFER_ATOMIC_ADD : MUBUF_Atomic <
> >    mubuf<0x32, 0x42>, "buffer_atomic_add", VGPR_32,
i32,
> > atomic_add_global
> > diff --git a/test/CodeGen/AMDGPU/global_atomics.ll
> > b/test/CodeGen/AMDGPU/global_atomics.ll
> > index a92ee89..340e808 100644
> > --- a/test/CodeGen/AMDGPU/global_atomics.ll
> > +++ b/test/CodeGen/AMDGPU/global_atomics.ll
> > @@ -758,6 +758,95 @@ entry:
> >    ret void
> >  }
> >  
> > +; CMP_SWAP
> > +
> > +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_offset:
> > +; GCN: buffer_atomic_cmpswap v[{{[0-9]+}}:{{[0-9]+}}], s[{{[0-
> > 9]+}}:{{[0-9]+}}], 0 offset:16{{$}}
> > +define void @atomic_cmpxchg_i32_offset(i32 addrspace(1)* %out, i32
> > %in, i32 %old) {
> > +entry:
> > +  %gep = getelementptr i32, i32 addrspace(1)* %out, i32 4
> > +  %0  = cmpxchg volatile i32 addrspace(1)* %gep, i32 %old, i32 %in
> > seq_cst seq_cst
> > +  ret void
> > +}
> > +
> > +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_ret_offset:
> > +; GCN: buffer_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]{{:[0-9]+}}],
> > s[{{[0-9]+}}:{{[0-9]+}}], 0 offset:16 glc{{$}}
> > +; GCN: buffer_store_dword v[[RET]]
> > +define void @atomic_cmpxchg_i32_ret_offset(i32 addrspace(1)* %out,
> > i32 addrspace(1)* %out2, i32 %in, i32 %old) {
> > +entry:
> > +  %gep = getelementptr i32, i32 addrspace(1)* %out, i32 4
> > +  %0  = cmpxchg volatile i32 addrspace(1)* %gep, i32 %old, i32 %in
> > seq_cst seq_cst
> > +  %1  = extractvalue { i32, i1 } %0, 0
> > +  store i32 %1, i32 addrspace(1)* %out2
> > +  ret void
> > +}
> > +
> > +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_addr64_offset:
> > +; SI: buffer_atomic_cmpswap v[{{[0-9]+\:[0-9]+}}], v[{{[0-
> > 9]+}}:{{[0-9]+}}], s[{{[0-9]+}}:{{[0-9]+}}], 0 addr64
> > offset:16{{$}}
> > +define void @atomic_cmpxchg_i32_addr64_offset(i32 addrspace(1)*
> > %out, i32 %in, i64 %index, i32 %old) {
> > +entry:
> > +  %ptr = getelementptr i32, i32 addrspace(1)* %out, i64 %index
> > +  %gep = getelementptr i32, i32 addrspace(1)* %ptr, i32 4
> > +  %0   = cmpxchg volatile i32 addrspace(1)* %gep, i32 %old, i32
> > %in seq_cst seq_cst
> > +  ret void
> > +}
> > +
> > +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_ret_addr64_offset:
> > +; SI: buffer_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]:{{[0-9]+}}],
> > v[{{[0-9]+}}:{{[0-9]+}}], s[{{[0-9]+}}:{{[0-9]+}}], 0 addr64
> > offset:16 glc{{$}}
> > +; VI: flat_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]:{{[0-9]+}}],
> > v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] glc{{$}}
> > +; GCN: buffer_store_dword v[[RET]]
> > +define void @atomic_cmpxchg_i32_ret_addr64_offset(i32
> > addrspace(1)* %out, i32 addrspace(1)* %out2, i32 %in, i64 %index,
> > i32 %old) {
> > +entry:
> > +  %ptr = getelementptr i32, i32 addrspace(1)* %out, i64 %index
> > +  %gep = getelementptr i32, i32 addrspace(1)* %ptr, i32 4
> > +  %0   = cmpxchg volatile i32 addrspace(1)* %gep, i32 %old, i32
> > %in seq_cst seq_cst
> > +  %1  = extractvalue { i32, i1 } %0, 0
> > +  store i32 %1, i32 addrspace(1)* %out2
> > +  ret void
> > +}
> > +
> > +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32:
> > +; GCN: buffer_atomic_cmpswap v[{{[0-9]+:[0-9]+}}], s[{{[0-
> > 9]+}}:{{[0-9]+}}], 0{{$}}
> > +define void @atomic_cmpxchg_i32(i32 addrspace(1)* %out, i32 %in,
> > i32 %old) {
> > +entry:
> > +  %0  = cmpxchg volatile i32 addrspace(1)* %out, i32 %old, i32 %in
> > seq_cst seq_cst
> > +  ret void
> > +}
> > +
> > +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_ret:
> > +; GCN: buffer_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]:{{[0-9]+}}],
> > s[{{[0-9]+}}:{{[0-9]+}}], 0 glc
> > +; GCN: buffer_store_dword v[[RET]]
> > +define void @atomic_cmpxchg_i32_ret(i32 addrspace(1)* %out, i32
> > addrspace(1)* %out2, i32 %in, i32 %old) {
> > +entry:
> > +  %0  = cmpxchg volatile i32 addrspace(1)* %out, i32 %old, i32 %in
> > seq_cst seq_cst
> > +  %1  = extractvalue { i32, i1 } %0, 0
> > +  store i32 %1, i32 addrspace(1)* %out2
> > +  ret void
> > +}
> > +
> > +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_addr64:
> > +; SI: buffer_atomic_cmpswap v[{{[0-9]+:[0-9]+}}], v[{{[0-
> > 9]+}}:{{[0-9]+}}], s[{{[0-9]+}}:{{[0-9]+}}], 0 addr64{{$}}
> > +; VI: flat_atomic_cmpswap v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-
> > 9]+}}]{{$}}
> > +define void @atomic_cmpxchg_i32_addr64(i32 addrspace(1)* %out, i32
> > %in, i64 %index, i32 %old) {
> > +entry:
> > +  %ptr = getelementptr i32, i32 addrspace(1)* %out, i64 %index
> > +  %0  = cmpxchg volatile i32 addrspace(1)* %ptr, i32 %old, i32 %in
> > seq_cst seq_cst
> > +  ret void
> > +}
> > +
> > +; FUNC-LABEL: {{^}}atomic_cmpxchg_i32_ret_addr64:
> > +; SI: buffer_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]:{{[0-9]+}}],
> > v[{{[0-9]+}}:{{[0-9]+}}], s[{{[0-9]+}}:{{[0-9]+}}], 0 addr64
> > glc{{$}}
> > +; VI: flat_atomic_cmpswap v{{\[}}[[RET:[0-9]+]]:{{[0-9]+}}],
> > v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] glc{{$}}
> > +; GCN: buffer_store_dword v[[RET]]
> > +define void @atomic_cmpxchg_i32_ret_addr64(i32 addrspace(1)* %out,
> > i32 addrspace(1)* %out2, i32 %in, i64 %index, i32 %old) {
> > +entry:
> > +  %ptr = getelementptr i32, i32 addrspace(1)* %out, i64 %index
> > +  %0  = cmpxchg volatile i32 addrspace(1)* %ptr, i32 %old, i32 %in
> > seq_cst seq_cst
> > +  %1  = extractvalue { i32, i1 } %0, 0
> > +  store i32 %1, i32 addrspace(1)* %out2
> > +  ret void
> > +}
> > +
> >  ; FUNC-LABEL: {{^}}atomic_xor_i32_offset:
> >  ; GCN: buffer_atomic_xor v{{[0-9]+}}, s[{{[0-9]+}}:{{[0-9]+}}], 0
> > offset:16{{$}}
> >  define void @atomic_xor_i32_offset(i32 addrspace(1)* %out, i32
> > %in) {-- 
Jan Vesely <jan.vesely at rutgers.edu>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160328/d2eb9b79/attachment.sig>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Mar 2016 - RFC: atomic operations on SI+

[llvm-dev] RFC: atomic operations on SI+

[llvm-dev] RFC: atomic operations on SI+

[llvm-dev] RFC: atomic operations on SI+

Maybe Matching Threads