thr3ads.net - search: "threadidx"

Displaying 20 results from an estimated 22 matches for "threadidx".

Did you mean: threadid

[CUDA/NVPTX] is inlining __syncthreads allowed?

2015 Aug 21

[CUDA/NVPTX] is inlining __syncthreads allowed?

Hi Justin, Is a compiler allowed to inline a function that calls __syncthreads? I saw nvcc does that, but not sure it's valid though. For example, void foo() { __syncthreads(); } if (threadIdx.x % 2 == 0) { ... foo(); } else { ... foo(); } Before inlining, all threads meet at one __syncthreads(). After inlining if (threadIdx.x % 2 == 0) { ... __syncthreads(); } else { ... __syncthreads(); } The __syncthreads call is duplicated, and it's no longer guaranteed that al...

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 16

[LLVMdev] Attaching range metadata to IntrinsicInst

...so that the frontend can annotate the range of the return value of an intrinsic call. e.g., %a = call i32 @llvm.xxx(), !range !0 !0 = metadata !{ i32 0, i23 1024 } The motivation behind this extension is some optimizations we are working on for CUDA programs. Some special registers in CUDA (e.g., threadIdx.x) are bounded per CUDA programming guide, and knowing their ranges can improve the precision of ValueTracking and benefit optimizations such as InstCombine. To implement this idea, we need ValueTracking to be aware of the ranges of these special variables. These special registers are so far read-...

[CUDA/NVPTX] is inlining __syncthreads allowed?

2015 Aug 21

[CUDA/NVPTX] is inlining __syncthreads allowed?

...rote: > > > > Hi Justin, > > > > Is a compiler allowed to inline a function that calls __syncthreads? I > saw nvcc does that, but not sure it's valid though. For example, > > > > void foo() { > > __syncthreads(); > > } > > > > if (threadIdx.x % 2 == 0) { > > ... > > foo(); > > } else { > > ... > > foo(); > > } > > > > Before inlining, all threads meet at one __syncthreads(). After inlining > > > > if (threadIdx.x % 2 == 0) { > > ... > > __syncthreads()...

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

...>> %a = call i32 @llvm.xxx(), !range !0 >>> !0 = metadata !{ i32 0, i23 1024 } >>> >>> The motivation behind this extension is some optimizations we are >>> working on for CUDA programs. Some special registers in CUDA (e.g., >>> threadIdx.x) are bounded per CUDA programming guide, and knowing >>> their ranges can improve the precision of ValueTracking and benefit >>> optimizations such as InstCombine. >>> >>> To implement this idea, we need ValueTracking to be aware of the >>>...

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

...value of an intrinsic > call. e.g., > %a = call i32 @llvm.xxx(), !range !0 > !0 = metadata !{ i32 0, i23 1024 } > > The motivation behind this extension is some optimizations we are > working on for CUDA programs. Some special registers in CUDA (e.g., > threadIdx.x) are bounded per CUDA programming guide, and knowing > their ranges can improve the precision of ValueTracking and benefit > optimizations such as InstCombine. > > To implement this idea, we need ValueTracking to be aware of the > ranges of these special variables....

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

Eh? How do you envision this? -eric On Tue, Jun 17, 2014 at 2:09 PM, Jingyue Wu <jingyue at google.com> wrote: > Hi Nick, > > That makes sense. I think a main issue here is that the ranges of these PTX > special registers (e.g., threadIdx.x) depend on -target-cpu which is only > visible to clang and llc. Would you mind we specify "target cpu" in the IR > similar to what we did for "target triple"? > > Thanks, > Jingyue > > > On Tue, Jun 17, 2014 at 12:19 PM, Nick Lewycky <nlewycky at go...

[LLVMdev] Proposal: pragma for branch divergence

2015 Jan 24

[LLVMdev] Proposal: pragma for branch divergence

.... Therefore, if the code contains divergent branches (i.e., threads in a warp do not agree on which path of the branch to take), the warp has to execute all the paths from that branch with different subsets of threads enabled until they converge at a post-dominating BB of the paths. For example, // threadIdx.x returns the index of a thread in the warpif (threadIdx.x == 0) { foo();} else { bar();}The warp that contains thread 0-31 needs to execute foo() with only thread 0 enabled and then bar() with the other 31 threads enabled. Therefore, the run time of the above code will be the run time of foo() +...

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

...;> >> -eric >> >> On Tue, Jun 17, 2014 at 2:09 PM, Jingyue Wu <jingyue at google.com> wrote: >> > Hi Nick, >> > >> > That makes sense. I think a main issue here is that the ranges of these >> > PTX >> > special registers (e.g., threadIdx.x) depend on -target-cpu which is >> > only >> > visible to clang and llc. Would you mind we specify "target cpu" in the >> > IR >> > similar to what we did for "target triple"? >> > >> > Thanks, >> > Jingyue >&gt...

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

2014 Apr 19

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

...did it. Loops in CUDA programs are often extensively unrolled by programmers and compilers, leading to many similar GEPs for array accesses. e.g., a 2-level loop like __shared__ float a[32][32]; unroll for (int i = 0; i < 2; ++i) { unroll for (int j = 0; j < 2; ++j) { ... ... = a[threadIdx.x + i][threadIdx.y + j]; ... } } will be unrolled to: gep a, 0, tid.x, tid.y; load gep a, 0, tid.x, tid.y + 1; load gep a, 0, tid.x + 1, tid.y; load gep a, 0, tid.x + 1, tid.y + 1; load The NVPTX backend currently doesn't handle many similar multi-dimensional GEPs well enough. It emits...

[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence

2015 Jan 24

[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence

...code contains divergent branches (i.e., > threads in a warp do not agree on which path of the branch to take), the > warp has to execute all the paths from that branch with different subsets > of threads enabled until they converge at a post-dominating BB of the > paths. For example, // threadIdx.x returns the index of a thread in the > warpif (threadIdx.x == 0) { foo();} else { bar();}The warp that contains > thread 0-31 needs to execute foo() with only thread 0 enabled and then > bar() with the other 31 threads enabled. Therefore, the run time of the > above code will be the...

Information about the number of indices in memory accesses

2020 Sep 23

Information about the number of indices in memory accesses

...EP operand, explicitly create a (trivial) GEP with index 0 So now the operand of every load and store is a GEP instruction. For simple stuff i am getting the right answer but when the index expression becomes more complex multiple GEPs are introduced. For instance: *(A+2*(blockDim.x*blockIdx.x+threadIdx.x+1)+2+3) = 5; produces: %6 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() %7 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() %8 = mul i32 %6, %7, %9 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() %10 = add i32 %8, %9, %11 = add i32 %10, 1, %12 = mul i32 2, %11, %13 = zext i32 %1...

FLAC on GPGPU

2007 May 02

FLAC on GPGPU

...) d += data[i] * data[i-lag]; autoc[lag] = d; } instead of looping through the "while" block sequentially, we launch a 'lag' number of threads, each computing autoc[lag] for the corresponding index on its own processor, the code looking something like this: for (i = threadIdx, d = 0.0; i < data_len; i++) d += data[i] * data[i-lag]; autoc[threadIdx] = d; This only goes to a certain amount of threads per block. In addition to this kind of parallelism, grids of blocks (which do not share memory, unlike threads within the same block) can be used to process seve...

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

2014 Apr 21

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

...ng to many similar > > GEPs for array accesses. > > > > > > e.g., a 2-level loop like > > > > > > __shared__ float a[32][32]; > > unroll for (int i = 0; i < 2; ++i) { > > unroll for (int j = 0; j < 2; ++j) { > > ... > > ... = a[threadIdx.x + i][threadIdx.y + j]; > > ... > > } > > } > > > > > > will be unrolled to: > > > > > > gep a, 0, tid.x, tid.y; load > > gep a, 0, tid.x, tid.y + 1; load > > gep a, 0, tid.x + 1, tid.y; load > > gep a, 0, tid.x + 1, tid.y + 1...

NVPTX - Reordering load instructions

2018 Jun 21

NVPTX - Reordering load instructions

...levant source code consists of two nested loops that get fully unrolled, doing some basic arithmetic with values loaded from shared memory: > #define BLOCK_SIZE 16 > > __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE]; > __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE]; > > int idx = threadIdx.x - BLOCK_SIZE; > for (int i = 0; i < BLOCK_SIZE; i++) { > for (int j = 0; j < i; j++) > peri_col[idx][i] -= peri_col[idx][j] * dia[j][i]; > peri_col[idx][i] /= dia[i][i]; > } NVCC emits PTX instructions where all loads from shared memory are packed together: > ......

[LLVMdev] UNREACHABLE executed! error while trying to generate PTX

2013 Mar 18

[LLVMdev] UNREACHABLE executed! error while trying to generate PTX

...0x000000000040d131 main + 465 13 libc.so.6 0x00007f3855a7d4bd __libc_start_main + 253 14 llc 0x0000000000406e59 Stack dump: 0. Program arguments: llc nbody.kernel.ll -o nbody.kernel.ptx make: *** [nbody.kernel.ptx] Aborted I replaced the global indexes like for e.g threadIdx.x with __builtin_ptx_read_tid_x() and others. There are no problems in generating LLVM IR (i.e .ll). The error pops up while trying to generate PTX from the IR using llc. Any pointers on what might be going on here ? Will appreciate any help in going forward I have attached my program and obser...

[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence

2015 Jan 25

[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence

...vergent branches (i.e., >> threads in a warp do not agree on which path of the branch to take), the >> warp has to execute all the paths from that branch with different subsets >> of threads enabled until they converge at a post-dominating BB of the >> paths. For example, // threadIdx.x returns the index of a thread in the >> warpif (threadIdx.x == 0) { foo();} else { bar();}The warp that contains >> thread 0-31 needs to execute foo() with only thread 0 enabled and then >> bar() with the other 31 threads enabled. Therefore, the run time of the >> above...

Information about the number of indices in memory accesses

2020 Oct 03

Information about the number of indices in memory accesses

...gt; So now the operand of every load and store is a GEP instruction. > > > > For simple stuff i am getting the right answer but when the index > > expression becomes more complex multiple GEPs are introduced. For > > instance: > > > > *(A+2*(blockDim.x*blockIdx.x+threadIdx.x+1)+2+3) = 5; > > > > produces: > > > > %6 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() > > %7 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() > > %8 = mul i32 %6, %7, > > %9 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() > > %10 = add i32 %8, %...

Information about the number of indices in memory accesses

2020 Oct 03

Information about the number of indices in memory accesses

...s a GEP instruction. >>> > >>> > For simple stuff i am getting the right answer but when the index >>> > expression becomes more complex multiple GEPs are introduced. For >>> > instance: >>> > >>> > *(A+2*(blockDim.x*blockIdx.x+threadIdx.x+1)+2+3) = 5; >>> > >>> > produces: >>> > >>> > %6 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() >>> > %7 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() >>> > %8 = mul i32 %6, %7, >>> > %9 = call i32 @llvm.nv...

[LLVMdev] Clang support for CUDA

2012 Feb 23

[LLVMdev] Clang support for CUDA

Hi, I am trying to convert a simple CUDA program to LLVM IR using clang 3.0. The program is as follows, #include<stdio.h> #nclude<clang/test/SemaCUDA/cuda.h> __global__ void kernfunc(int *a) { *a=threadIdx.x+blockIdx.x*blockDim.x; } int main() { int *h_a,*d_a,n; n=sizeof(int); h_a=(int*)malloc(n); *h_a=5; cudaMalloc((void*)&d_a,n); cudaMemcpy(d_a,h_a,n,cudaMemcpyHostToDevice); kernelfunc<<<1,1>>>(d_a); cudaMemcpy(h_a,d_a,n,cudaMemcpyDeviceToHost); printf("%d",*h_a)...

NVPTX - Reordering load instructions

2018 Jun 21

NVPTX - Reordering load instructions

...doing > > some basic arithmetic with values loaded from shared memory: > > > >> #define BLOCK_SIZE 16 > >> > >> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE]; > >> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE]; > >> > >> int idx = threadIdx.x - BLOCK_SIZE; > >> for (int i = 0; i < BLOCK_SIZE; i++) { > >> for (int j = 0; j < i; j++) > >> peri_col[idx][i] -= peri_col[idx][j] * dia[j][i]; > >> peri_col[idx][i] /= dia[i][i]; > >> } > > NVCC emits PTX instructions where all lo...

search for: threadidx