Displaying 10 results from an estimated 10 matches for "__shared__".
Did you mean:
__shared
2013 Mar 20
2
[LLVMdev] UNREACHABLE executed! error while trying to generate PTX
...y.kernel.cpp:29:9: warning: '__constant__' macro redefined
#define __constant__ __attribute__((address_space(2)))
^
/opt/cuda/include/host_defines.h:183:9: note: previous definition is here
#define __constant__ \
^
1 warning generated.
Another question is
What about extern __shared__ ?
I can see that the error goes away if I replace "extern __shared__ float4
sharedPos[]" with "__shared__ float4* sharedPos;". Do I have to dynamically
allocate the shared memory by specifying size in kernel Launch? If so, why
doesn't the second use of the same statement i...
2013 Mar 20
0
[LLVMdev] UNREACHABLE executed! error while trying to generate PTX
...__' macro redefined
> #define __constant__ __attribute__((address_space(2)))
> ^
> /opt/cuda/include/host_defines.h:183:9: note: previous definition is here
> #define __constant__ \
> ^
> 1 warning generated.
>
> Another question is
> What about extern __shared__ ?
>
> I can see that the error goes away if I replace "extern __shared__ float4
> sharedPos[]" with "__shared__ float4* sharedPos;". Do I have to dynamically
> allocate the shared memory by specifying size in kernel Launch? If so, why
> doesn't the second use...
2018 Jun 21
2
NVPTX - Reordering load instructions
...NVPTX (coming from Julia, not CUDA C) and I'm seeing a
significant difference due to PTX instruction ordering. The relevant
source code consists of two nested loops that get fully unrolled, doing
some basic arithmetic with values loaded from shared memory:
> #define BLOCK_SIZE 16
>
> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE];
> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE];
>
> int idx = threadIdx.x - BLOCK_SIZE;
> for (int i = 0; i < BLOCK_SIZE; i++) {
> for (int j = 0; j < i; j++)
> peri_col[idx][i] -= peri_col[idx][j] * dia[j][i];
> peri_col[idx...
2013 Mar 20
0
[LLVMdev] UNREACHABLE executed! error while trying to generate PTX
The problem you are seeing is because clang is putting the global variables
in address space 0, which in NVPTX means the generic address space. PTX
does not allow this, so the back-end *should* be printing an error for
you. Are you using trunk or 3.2?
Generally, clang won't be compatible with the CUDA Toolkit headers. If you
want to use the __constant__ modifier from CUDA in Clang, define
2013 Mar 18
2
[LLVMdev] UNREACHABLE executed! error while trying to generate PTX
Please find the .ll attached below . Yes, I am using the cuda_runtime.h from
the toolkit.
nbody.kernel.ll
<http://llvm.1065342.n5.nabble.com/file/n56048/nbody.kernel.ll>
- Uday
--
View this message in context: http://llvm.1065342.n5.nabble.com/UNREACHABLE-executed-error-while-trying-to-generate-PTX-tp56026p56048.html
Sent from the LLVM - Dev mailing list archive at Nabble.com.
2013 Apr 20
1
[LLVMdev] Adding custom pragma's for generating particular LLVM IR
...callback functions are implemented in a third party library.
I guess that is it quite close to what was implemented for the OpenMP
support, but I need something simpler.
Basically, how can I
- Add a pragma that identifies a block (i.e. scope) of code ?
- Add an action that flag some variables as __shared__ in the annotated
block ?
- Generate LLVM IR that add specific call-backs at the start and end of
the block and on shared variables ?
I hope my explanation is sufficiently clear. Feel free to ask questions
if it's not.
Best regards,
Julien.
2018 Jun 21
2
NVPTX - Reordering load instructions
...gt; significant difference due to PTX instruction ordering. The relevant
> > source code consists of two nested loops that get fully unrolled, doing
> > some basic arithmetic with values loaded from shared memory:
> >
> >> #define BLOCK_SIZE 16
> >>
> >> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE];
> >> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE];
> >>
> >> int idx = threadIdx.x - BLOCK_SIZE;
> >> for (int i = 0; i < BLOCK_SIZE; i++) {
> >> for (int j = 0; j < i; j++)
> >> peri_col[idx][...
2014 Apr 19
4
[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs
...ich convinces us to try to upstream it. Here's a brief
description of why we wrote this optimization, what we did, and how we did
it.
Loops in CUDA programs are often extensively unrolled by programmers and
compilers, leading to many similar
GEPs for array accesses.
e.g., a 2-level loop like
__shared__ float a[32][32];
unroll for (int i = 0; i < 2; ++i) {
unroll for (int j = 0; j < 2; ++j) {
...
... = a[threadIdx.x + i][threadIdx.y + j];
...
}
}
will be unrolled to:
gep a, 0, tid.x, tid.y; load
gep a, 0, tid.x, tid.y + 1; load
gep a, 0, tid.x + 1, tid.y; load
gep a, 0, tid...
2014 Apr 21
2
[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs
...hat we did, and how we did it.
> >
> >
> > Loops in CUDA programs are often extensively unrolled by programmers
> > and compilers, leading to many similar
> > GEPs for array accesses.
> >
> >
> > e.g., a 2-level loop like
> >
> >
> > __shared__ float a[32][32];
> > unroll for (int i = 0; i < 2; ++i) {
> > unroll for (int j = 0; j < 2; ++j) {
> > ...
> > ... = a[threadIdx.x + i][threadIdx.y + j];
> > ...
> > }
> > }
> >
> >
> > will be unrolled to:
> >
> >
> &...
2014 Apr 22
2
[LLVMdev] InstCombine strips the inBounds attribute in GetElementPtr ConstantExpr
I can't upload my program due to confidentiality, but the problem is
obvious.
At lib/Analysis/ConstantFolding.cpp:646
Constant *C = ConstantExpr::getGetElementPtr(Ops[0], NewIdxs);
if (ConstantExpr *CE = dyn_cast<ConstantExpr>(C)) {
if (Constant *Folded = ConstantFoldConstantExpression(CE, TD, TLI))
C = Folded;
}
The generated ConstantExpr C doesn't inherit the