search for: __shared__

Displaying 10 results from an estimated 10 matches for "__shared__".

Did you mean: __shared
2013 Mar 20
2
[LLVMdev] UNREACHABLE executed! error while trying to generate PTX
...y.kernel.cpp:29:9: warning: '__constant__' macro redefined #define __constant__ __attribute__((address_space(2))) ^ /opt/cuda/include/host_defines.h:183:9: note: previous definition is here #define __constant__ \ ^ 1 warning generated. Another question is What about extern __shared__ ? I can see that the error goes away if I replace "extern __shared__ float4 sharedPos[]" with "__shared__ float4* sharedPos;". Do I have to dynamically allocate the shared memory by specifying size in kernel Launch? If so, why doesn't the second use of the same statement i...
2013 Mar 20
0
[LLVMdev] UNREACHABLE executed! error while trying to generate PTX
...__' macro redefined > #define __constant__ __attribute__((address_space(2))) > ^ > /opt/cuda/include/host_defines.h:183:9: note: previous definition is here > #define __constant__ \ > ^ > 1 warning generated. > > Another question is > What about extern __shared__ ? > > I can see that the error goes away if I replace "extern __shared__ float4 > sharedPos[]" with "__shared__ float4* sharedPos;". Do I have to dynamically > allocate the shared memory by specifying size in kernel Launch? If so, why > doesn't the second use...
2018 Jun 21
2
NVPTX - Reordering load instructions
...NVPTX (coming from Julia, not CUDA C) and I'm seeing a significant difference due to PTX instruction ordering. The relevant source code consists of two nested loops that get fully unrolled, doing some basic arithmetic with values loaded from shared memory: > #define BLOCK_SIZE 16 > > __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE]; > __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE]; > > int idx = threadIdx.x - BLOCK_SIZE; > for (int i = 0; i < BLOCK_SIZE; i++) { > for (int j = 0; j < i; j++) > peri_col[idx][i] -= peri_col[idx][j] * dia[j][i]; > peri_col[idx...
2013 Mar 20
0
[LLVMdev] UNREACHABLE executed! error while trying to generate PTX
The problem you are seeing is because clang is putting the global variables in address space 0, which in NVPTX means the generic address space. PTX does not allow this, so the back-end *should* be printing an error for you. Are you using trunk or 3.2? Generally, clang won't be compatible with the CUDA Toolkit headers. If you want to use the __constant__ modifier from CUDA in Clang, define
2013 Mar 18
2
[LLVMdev] UNREACHABLE executed! error while trying to generate PTX
Please find the .ll attached below . Yes, I am using the cuda_runtime.h from the toolkit. nbody.kernel.ll <http://llvm.1065342.n5.nabble.com/file/n56048/nbody.kernel.ll> - Uday -- View this message in context: http://llvm.1065342.n5.nabble.com/UNREACHABLE-executed-error-while-trying-to-generate-PTX-tp56026p56048.html Sent from the LLVM - Dev mailing list archive at Nabble.com.
2013 Apr 20
1
[LLVMdev] Adding custom pragma's for generating particular LLVM IR
...callback functions are implemented in a third party library. I guess that is it quite close to what was implemented for the OpenMP support, but I need something simpler. Basically, how can I - Add a pragma that identifies a block (i.e. scope) of code ? - Add an action that flag some variables as __shared__ in the annotated block ? - Generate LLVM IR that add specific call-backs at the start and end of the block and on shared variables ? I hope my explanation is sufficiently clear. Feel free to ask questions if it's not. Best regards, Julien.
2018 Jun 21
2
NVPTX - Reordering load instructions
...gt; significant difference due to PTX instruction ordering. The relevant > > source code consists of two nested loops that get fully unrolled, doing > > some basic arithmetic with values loaded from shared memory: > > > >> #define BLOCK_SIZE 16 > >> > >> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE]; > >> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE]; > >> > >> int idx = threadIdx.x - BLOCK_SIZE; > >> for (int i = 0; i < BLOCK_SIZE; i++) { > >> for (int j = 0; j < i; j++) > >> peri_col[idx][...
2014 Apr 19
4
[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs
...ich convinces us to try to upstream it. Here's a brief description of why we wrote this optimization, what we did, and how we did it. Loops in CUDA programs are often extensively unrolled by programmers and compilers, leading to many similar GEPs for array accesses. e.g., a 2-level loop like __shared__ float a[32][32]; unroll for (int i = 0; i < 2; ++i) { unroll for (int j = 0; j < 2; ++j) { ... ... = a[threadIdx.x + i][threadIdx.y + j]; ... } } will be unrolled to: gep a, 0, tid.x, tid.y; load gep a, 0, tid.x, tid.y + 1; load gep a, 0, tid.x + 1, tid.y; load gep a, 0, tid...
2014 Apr 21
2
[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs
...hat we did, and how we did it. > > > > > > Loops in CUDA programs are often extensively unrolled by programmers > > and compilers, leading to many similar > > GEPs for array accesses. > > > > > > e.g., a 2-level loop like > > > > > > __shared__ float a[32][32]; > > unroll for (int i = 0; i < 2; ++i) { > > unroll for (int j = 0; j < 2; ++j) { > > ... > > ... = a[threadIdx.x + i][threadIdx.y + j]; > > ... > > } > > } > > > > > > will be unrolled to: > > > > > &...
2014 Apr 22
2
[LLVMdev] InstCombine strips the inBounds attribute in GetElementPtr ConstantExpr
I can't upload my program due to confidentiality, but the problem is obvious. At lib/Analysis/ConstantFolding.cpp:646 Constant *C = ConstantExpr::getGetElementPtr(Ops[0], NewIdxs); if (ConstantExpr *CE = dyn_cast<ConstantExpr>(C)) { if (Constant *Folded = ConstantFoldConstantExpression(CE, TD, TLI)) C = Folded; } The generated ConstantExpr C doesn't inherit the