Frank Winter via llvm-dev
2020-Apr-22 20:57 UTC
[llvm-dev] ROCm module from LLVM AMDGPU backend
Hi, I'm trying to launch a GPU kernel which was compiled by the LLVM AMDGPU backend. Currently I'm having no success with it and I was hoping someone tuned in on here might have an idea. It seems that tensorflow is doing a similar thing. So I was reading the tensorflow code on github and I believe the following setup is pretty close in the vital parts: 1) Compile an LLVM IR module (see below) with AMDGPU backend to a 'module.o' file. Using this triple/CPU: llvm::Triple TheTriple; TheTriple.setArch (llvm::Triple::ArchType::amdgcn); TheTriple.setVendor (llvm::Triple::VendorType::AMD); TheTriple.setOS (llvm::Triple::OSType::AMDHSA); std::string CPUStr("gfx906"); LLVM IR passes that I use: TargetLibraryInfoWrapperPass TargetMachine->addPassesToEmitFile with CGFT_ObjectFile 2) LLVM linker generates a shared lib using 'system()' call ld.lld -shared module.o -o module.so 3) Reading this shared module back into a 'vector<uint8> shared' 4) Using HIP to load this module: hipModule_t module; ret = hipModuleLoadData( &module , shared.data() ); (this returns hipSuccess) 5) Trying to get a HIP function: hipFunction_t kernel; ret = hipModuleGetFunction(&kernel, module, "kernel" ); .. and this fails with HIP error code 500 !? I believe the vital steps here concerning ROCm are similar (identical?) to what's in tensorflow but I don't get it to work. I have to admit that I did not build tensorflow to see if the AMD GPU bits actually work. I read the comments and some are saying that it comes with some performance overhead. Performance isn't the point at the moment - I'm working on a proof-of-concept. My test machine has an 'AMD gfx906' card installed. Digging deeper, the hipModule_t is a pointer to ihipModule_t and printing out the values after loading the module gives ihip->fileName ihip->hash = 3943538976062281088 ihip->kernargs.size() = 0 ihip->executable.handle = 42041072 It's not telling me much. 'Not sure what to do with the handle for the executable. Any ideas what could be tried next? Frank -------------------------------------------------------------- LLVM IR module target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7" define void @kernel(i1 %arg0, i32 %arg1, i32 %arg2, i32 %arg3, i1 %arg4, i32* %arg5, i1* %arg6, float* %arg7, float* %arg8, float* %arg9) { entrypoint: %0 = sext i1 %arg4 to i32 %1 = xor i32 -1, %0 %2 = call i32 @llvm.amdgcn.workitem.id.x() %3 = icmp sge i32 %2, %arg1 br i1 %3, label %L0, label %L1 L0: ; preds = %entrypoint ret void L1: ; preds = %entrypoint %4 = trunc i32 %1 to i1 br i1 %4, label %L3, label %L4 L2: ; preds = %L6, %L5, %L4 %5 = phi i32 [ %7, %L4 ], [ %8, %L5 ], [ %2, %L6 ] br i1 %arg0, label %L7, label %L8 L3: ; preds = %L1 br i1 %arg0, label %L5, label %L6 L4: ; preds = %L1 %6 = getelementptr i32, i32* %arg5, i32 %2 %7 = load i32, i32* %6 br label %L2 L5: ; preds = %L3 %8 = add nsw i32 %2, %arg2 br label %L2 L6: ; preds = %L3 br label %L2 L7: ; preds = %L2 %9 = icmp sgt i32 %5, %arg3 br i1 %9, label %L12, label %L13 L8: ; preds = %L2 %10 = getelementptr i1, i1* %arg6, i32 %5 %11 = load i1, i1* %10 %12 = sext i1 %11 to i32 %13 = xor i32 -1, %12 %14 = trunc i32 %13 to i1 br i1 %14, label %L10, label %L11 L9: ; preds = %L15, %L11 %15 = add nsw i32 0, %5 %16 = add nsw i32 0, %5 %17 = getelementptr float, float* %arg8, i32 %16 %18 = load float, float* %17 %19 = add nsw i32 0, %5 %20 = getelementptr float, float* %arg9, i32 %19 %21 = load float, float* %20 %22 = fmul float %18, %21 %23 = getelementptr float, float* %arg7, i32 %15 store float %22, float* %23 ret void L10: ; preds = %L8 ret void L11: ; preds = %L8 br label %L9 L12: ; preds = %L7 ret void L13: ; preds = %L7 %24 = icmp slt i32 %5, %arg2 br i1 %24, label %L14, label %L15 L14: ; preds = %L13 ret void L15: ; preds = %L13 br label %L9 } ; Function Attrs: nounwind readnone speculatable declare i32 @llvm.amdgcn.workitem.id.x() #0 attributes #0 = { nounwind readnone speculatable } ------------------------------------------------------------------------------ The following is the assembly output the AMDGPU backend generates: output: .text .amdgcn_target "amdgcn-amd-amdhsa--gfx906" .globl kernel .p2align 2 .type kernel, at function kernel: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) v_and_b32_e32 v4, 1, v4 v_cmp_eq_u32_e64 s[4:5], 1, v4 v_and_b32_e32 v0, 1, v0 v_and_b32_e32 v4, 0x3ff, v15 v_cmp_eq_u32_e32 vcc, 1, v0 v_cmp_lt_i32_e64 s[6:7], v4, v1 s_and_saveexec_b64 s[8:9], s[6:7] s_cbranch_execz BB0_16 BB0_1: s_and_saveexec_b64 s[6:7], s[4:5] s_xor_b64 s[6:7], exec, s[6:7] s_cbranch_execz BB0_3 BB0_2: v_lshlrev_b32_e32 v0, 2, v4 v_add_co_u32_e64 v0, s[4:5], v5, v0 v_addc_co_u32_e64 v1, s[4:5], 0, v6, s[4:5] flat_load_dword v0, v[0:1] BB0_3: s_or_saveexec_b64 s[4:5], s[6:7] s_xor_b64 exec, exec, s[4:5] s_cbranch_execz BB0_7 BB0_4: s_xor_b64 s[6:7], vcc, -1 s_waitcnt vmcnt(0) lgkmcnt(0) v_add_u32_e32 v0, v4, v2 s_and_saveexec_b64 s[10:11], s[6:7] s_xor_b64 s[6:7], exec, s[10:11] BB0_5: v_mov_b32_e32 v0, v4 BB0_6: s_or_b64 exec, exec, s[6:7] BB0_7: s_or_b64 exec, exec, s[4:5] s_xor_b64 s[6:7], vcc, -1 s_mov_b64 s[4:5], 0 s_and_saveexec_b64 s[10:11], s[6:7] s_xor_b64 s[6:7], exec, s[10:11] s_cbranch_execz BB0_9 BB0_8: s_waitcnt vmcnt(0) lgkmcnt(0) v_ashrrev_i32_e32 v1, 31, v0 v_add_co_u32_e32 v4, vcc, v7, v0 v_addc_co_u32_e32 v5, vcc, v8, v1, vcc flat_load_ubyte v1, v[4:5] s_waitcnt vmcnt(0) lgkmcnt(0) v_and_b32_e32 v1, 1, v1 v_cmp_eq_u32_e32 vcc, 1, v1 s_and_b64 s[4:5], vcc, exec BB0_9: s_or_saveexec_b64 s[6:7], s[6:7] s_xor_b64 exec, exec, s[6:7] s_cbranch_execz BB0_13 BB0_10: s_waitcnt vmcnt(0) lgkmcnt(0) v_cmp_le_i32_e32 vcc, v0, v3 s_mov_b64 s[12:13], s[4:5] s_and_saveexec_b64 s[10:11], vcc BB0_11: v_cmp_ge_i32_e32 vcc, v0, v2 s_andn2_b64 s[12:13], s[4:5], exec s_and_b64 s[14:15], vcc, exec s_or_b64 s[12:13], s[12:13], s[14:15] BB0_12: s_or_b64 exec, exec, s[10:11] s_andn2_b64 s[4:5], s[4:5], exec s_and_b64 s[10:11], s[12:13], exec s_or_b64 s[4:5], s[4:5], s[10:11] BB0_13: s_or_b64 exec, exec, s[6:7] s_and_saveexec_b64 s[6:7], s[4:5] s_cbranch_execz BB0_15 BB0_14: s_waitcnt vmcnt(0) lgkmcnt(0) v_ashrrev_i32_e32 v1, 31, v0 v_lshlrev_b64 v[0:1], 2, v[0:1] v_add_co_u32_e32 v2, vcc, v11, v0 v_addc_co_u32_e32 v3, vcc, v12, v1, vcc flat_load_dword v4, v[2:3] v_add_co_u32_e32 v2, vcc, v13, v0 v_addc_co_u32_e32 v3, vcc, v14, v1, vcc flat_load_dword v2, v[2:3] v_add_co_u32_e32 v0, vcc, v9, v0 v_addc_co_u32_e32 v1, vcc, v10, v1, vcc s_waitcnt vmcnt(0) lgkmcnt(0) v_mul_f32_e32 v2, v4, v2 flat_store_dword v[0:1], v2 BB0_15: s_or_b64 exec, exec, s[6:7] BB0_16: s_or_b64 exec, exec, s[8:9] s_waitcnt vmcnt(0) lgkmcnt(0) s_setpc_b64 s[30:31] .Lfunc_end0: .size kernel, .Lfunc_end0-kernel .section ".note.GNU-stack" .amdgpu_metadata --- amdhsa.kernels: [] amdhsa.version: - 1 - 0 ... .end_amdgpu_metadata ----------------------------------------------------------------------- rocminfo output: Agent 1 and 2 are the host's Intel CPUs, then Agent 3 - 6 look like: ******* Agent 3 ******* Name: gfx906 Marketing Name: Vega 20 Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 4096(0x1000) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU Cache Info: L1: 16(0x10) KB Chip ID: 26273(0x66a1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1725 BDFID: 35328 Internal Node ID: 2 Compute Unit: 60 SIMDs per CU: 4 Shader Engines: 4 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: FALSE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 33538048(0x1ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Acessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 33538048(0x1ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Acessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Acessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx906 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32
Arsenault, Matthew via llvm-dev
2020-Apr-22 21:12 UTC
[llvm-dev] ROCm module from LLVM AMDGPU backend
[AMD Official Use Only - Internal Distribution Only] Your "@kernel" function isn't a kernel, it's the default C calling convention. You need to use the amdgpu_kernel calling convention -Matt ________________________________ From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Frank Winter via llvm-dev <llvm-dev at lists.llvm.org> Sent: Wednesday, April 22, 2020 1:57 PM To: LLVM Dev <llvm-dev at lists.llvm.org> Subject: [llvm-dev] ROCm module from LLVM AMDGPU backend [CAUTION: External Email] Hi, I'm trying to launch a GPU kernel which was compiled by the LLVM AMDGPU backend. Currently I'm having no success with it and I was hoping someone tuned in on here might have an idea. It seems that tensorflow is doing a similar thing. So I was reading the tensorflow code on github and I believe the following setup is pretty close in the vital parts: 1) Compile an LLVM IR module (see below) with AMDGPU backend to a 'module.o' file. Using this triple/CPU: llvm::Triple TheTriple; TheTriple.setArch (llvm::Triple::ArchType::amdgcn); TheTriple.setVendor (llvm::Triple::VendorType::AMD); TheTriple.setOS (llvm::Triple::OSType::AMDHSA); std::string CPUStr("gfx906"); LLVM IR passes that I use: TargetLibraryInfoWrapperPass TargetMachine->addPassesToEmitFile with CGFT_ObjectFile 2) LLVM linker generates a shared lib using 'system()' call ld.lld -shared module.o -o module.so 3) Reading this shared module back into a 'vector<uint8> shared' 4) Using HIP to load this module: hipModule_t module; ret = hipModuleLoadData( &module , shared.data() ); (this returns hipSuccess) 5) Trying to get a HIP function: hipFunction_t kernel; ret = hipModuleGetFunction(&kernel, module, "kernel" ); .. and this fails with HIP error code 500 !? I believe the vital steps here concerning ROCm are similar (identical?) to what's in tensorflow but I don't get it to work. I have to admit that I did not build tensorflow to see if the AMD GPU bits actually work. I read the comments and some are saying that it comes with some performance overhead. Performance isn't the point at the moment - I'm working on a proof-of-concept. My test machine has an 'AMD gfx906' card installed. Digging deeper, the hipModule_t is a pointer to ihipModule_t and printing out the values after loading the module gives ihip->fileName ihip->hash = 3943538976062281088 ihip->kernargs.size() = 0 ihip->executable.handle = 42041072 It's not telling me much. 'Not sure what to do with the handle for the executable. Any ideas what could be tried next? Frank -------------------------------------------------------------- LLVM IR module target datalayout "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7" define void @kernel(i1 %arg0, i32 %arg1, i32 %arg2, i32 %arg3, i1 %arg4, i32* %arg5, i1* %arg6, float* %arg7, float* %arg8, float* %arg9) { entrypoint: %0 = sext i1 %arg4 to i32 %1 = xor i32 -1, %0 %2 = call i32 @llvm.amdgcn.workitem.id.x() %3 = icmp sge i32 %2, %arg1 br i1 %3, label %L0, label %L1 L0: ; preds = %entrypoint ret void L1: ; preds = %entrypoint %4 = trunc i32 %1 to i1 br i1 %4, label %L3, label %L4 L2: ; preds = %L6, %L5, %L4 %5 = phi i32 [ %7, %L4 ], [ %8, %L5 ], [ %2, %L6 ] br i1 %arg0, label %L7, label %L8 L3: ; preds = %L1 br i1 %arg0, label %L5, label %L6 L4: ; preds = %L1 %6 = getelementptr i32, i32* %arg5, i32 %2 %7 = load i32, i32* %6 br label %L2 L5: ; preds = %L3 %8 = add nsw i32 %2, %arg2 br label %L2 L6: ; preds = %L3 br label %L2 L7: ; preds = %L2 %9 = icmp sgt i32 %5, %arg3 br i1 %9, label %L12, label %L13 L8: ; preds = %L2 %10 = getelementptr i1, i1* %arg6, i32 %5 %11 = load i1, i1* %10 %12 = sext i1 %11 to i32 %13 = xor i32 -1, %12 %14 = trunc i32 %13 to i1 br i1 %14, label %L10, label %L11 L9: ; preds = %L15, %L11 %15 = add nsw i32 0, %5 %16 = add nsw i32 0, %5 %17 = getelementptr float, float* %arg8, i32 %16 %18 = load float, float* %17 %19 = add nsw i32 0, %5 %20 = getelementptr float, float* %arg9, i32 %19 %21 = load float, float* %20 %22 = fmul float %18, %21 %23 = getelementptr float, float* %arg7, i32 %15 store float %22, float* %23 ret void L10: ; preds = %L8 ret void L11: ; preds = %L8 br label %L9 L12: ; preds = %L7 ret void L13: ; preds = %L7 %24 = icmp slt i32 %5, %arg2 br i1 %24, label %L14, label %L15 L14: ; preds = %L13 ret void L15: ; preds = %L13 br label %L9 } ; Function Attrs: nounwind readnone speculatable declare i32 @llvm.amdgcn.workitem.id.x() #0 attributes #0 = { nounwind readnone speculatable } ------------------------------------------------------------------------------ The following is the assembly output the AMDGPU backend generates: output: .text .amdgcn_target "amdgcn-amd-amdhsa--gfx906" .globl kernel .p2align 2 .type kernel, at function kernel: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) v_and_b32_e32 v4, 1, v4 v_cmp_eq_u32_e64 s[4:5], 1, v4 v_and_b32_e32 v0, 1, v0 v_and_b32_e32 v4, 0x3ff, v15 v_cmp_eq_u32_e32 vcc, 1, v0 v_cmp_lt_i32_e64 s[6:7], v4, v1 s_and_saveexec_b64 s[8:9], s[6:7] s_cbranch_execz BB0_16 BB0_1: s_and_saveexec_b64 s[6:7], s[4:5] s_xor_b64 s[6:7], exec, s[6:7] s_cbranch_execz BB0_3 BB0_2: v_lshlrev_b32_e32 v0, 2, v4 v_add_co_u32_e64 v0, s[4:5], v5, v0 v_addc_co_u32_e64 v1, s[4:5], 0, v6, s[4:5] flat_load_dword v0, v[0:1] BB0_3: s_or_saveexec_b64 s[4:5], s[6:7] s_xor_b64 exec, exec, s[4:5] s_cbranch_execz BB0_7 BB0_4: s_xor_b64 s[6:7], vcc, -1 s_waitcnt vmcnt(0) lgkmcnt(0) v_add_u32_e32 v0, v4, v2 s_and_saveexec_b64 s[10:11], s[6:7] s_xor_b64 s[6:7], exec, s[10:11] BB0_5: v_mov_b32_e32 v0, v4 BB0_6: s_or_b64 exec, exec, s[6:7] BB0_7: s_or_b64 exec, exec, s[4:5] s_xor_b64 s[6:7], vcc, -1 s_mov_b64 s[4:5], 0 s_and_saveexec_b64 s[10:11], s[6:7] s_xor_b64 s[6:7], exec, s[10:11] s_cbranch_execz BB0_9 BB0_8: s_waitcnt vmcnt(0) lgkmcnt(0) v_ashrrev_i32_e32 v1, 31, v0 v_add_co_u32_e32 v4, vcc, v7, v0 v_addc_co_u32_e32 v5, vcc, v8, v1, vcc flat_load_ubyte v1, v[4:5] s_waitcnt vmcnt(0) lgkmcnt(0) v_and_b32_e32 v1, 1, v1 v_cmp_eq_u32_e32 vcc, 1, v1 s_and_b64 s[4:5], vcc, exec BB0_9: s_or_saveexec_b64 s[6:7], s[6:7] s_xor_b64 exec, exec, s[6:7] s_cbranch_execz BB0_13 BB0_10: s_waitcnt vmcnt(0) lgkmcnt(0) v_cmp_le_i32_e32 vcc, v0, v3 s_mov_b64 s[12:13], s[4:5] s_and_saveexec_b64 s[10:11], vcc BB0_11: v_cmp_ge_i32_e32 vcc, v0, v2 s_andn2_b64 s[12:13], s[4:5], exec s_and_b64 s[14:15], vcc, exec s_or_b64 s[12:13], s[12:13], s[14:15] BB0_12: s_or_b64 exec, exec, s[10:11] s_andn2_b64 s[4:5], s[4:5], exec s_and_b64 s[10:11], s[12:13], exec s_or_b64 s[4:5], s[4:5], s[10:11] BB0_13: s_or_b64 exec, exec, s[6:7] s_and_saveexec_b64 s[6:7], s[4:5] s_cbranch_execz BB0_15 BB0_14: s_waitcnt vmcnt(0) lgkmcnt(0) v_ashrrev_i32_e32 v1, 31, v0 v_lshlrev_b64 v[0:1], 2, v[0:1] v_add_co_u32_e32 v2, vcc, v11, v0 v_addc_co_u32_e32 v3, vcc, v12, v1, vcc flat_load_dword v4, v[2:3] v_add_co_u32_e32 v2, vcc, v13, v0 v_addc_co_u32_e32 v3, vcc, v14, v1, vcc flat_load_dword v2, v[2:3] v_add_co_u32_e32 v0, vcc, v9, v0 v_addc_co_u32_e32 v1, vcc, v10, v1, vcc s_waitcnt vmcnt(0) lgkmcnt(0) v_mul_f32_e32 v2, v4, v2 flat_store_dword v[0:1], v2 BB0_15: s_or_b64 exec, exec, s[6:7] BB0_16: s_or_b64 exec, exec, s[8:9] s_waitcnt vmcnt(0) lgkmcnt(0) s_setpc_b64 s[30:31] .Lfunc_end0: .size kernel, .Lfunc_end0-kernel .section ".note.GNU-stack" .amdgpu_metadata --- amdhsa.kernels: [] amdhsa.version: - 1 - 0 ... .end_amdgpu_metadata ----------------------------------------------------------------------- rocminfo output: Agent 1 and 2 are the host's Intel CPUs, then Agent 3 - 6 look like: ******* Agent 3 ******* Name: gfx906 Marketing Name: Vega 20 Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 4096(0x1000) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 2 Device Type: GPU Cache Info: L1: 16(0x10) KB Chip ID: 26273(0x66a1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1725 BDFID: 35328 Internal Node ID: 2 Compute Unit: 60 SIMDs per CU: 4 Shader Engines: 4 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: FALSE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 40(0x28) Max Work-item Per CU: 2560(0xa00) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 33538048(0x1ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Acessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 33538048(0x1ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Acessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Acessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx906 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=02%7C01%7CMatthew.Arsenault%40amd.com%7C89e2121592ab48c84c1908d7e6ffce69%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637231858665874178&sdata=DcmMDUa7TNCPu1lPd3ZP5jcuiuCsBrQtrTfKUKxX%2FGw%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200422/5304c8b9/attachment.html>