thr3ads.net - similar to: "[CUDA/NVPTX] is inlining _

Displaying 20 results from an estimated 1000 matches similar to: "[CUDA/NVPTX] is inlining __syncthreads allowed?"

[CUDA/NVPTX] is inlining __syncthreads allowed?

2015 Aug 21

[CUDA/NVPTX] is inlining __syncthreads allowed?

I'm using 7.0. I am attaching the reduced example. nvcc sync.cu -arch=sm_35 -ptx gives // .globl _Z3foov .visible .entry _Z3foov( ) { .reg .pred %p<2>; .reg .s32 %r<3>; mov.u32 %r1, %tid.x; and.b32 %r2, %r1, 1; setp.eq.b32 %p1, %r2, 1; @!%p1 bra BB7_2; bra.uni

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

2014 Apr 19

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Hi, We wrote an optimization that eliminates common sub-expressions in a group of similar GEPs for the NVPTX backend. It speeds up some of our benchmarks by up to 20%, which convinces us to try to upstream it. Here's a brief description of why we wrote this optimization, what we did, and how we did it. Loops in CUDA programs are often extensively unrolled by programmers and compilers,

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

2014 Apr 21

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Hi Hal, Thanks for your comments! I'm inlining my responses below. Jingyue On Sat, Apr 19, 2014 at 6:38 AM, Hal Finkel <hfinkel at anl.gov> wrote: > Jingyue, > > I can't speak for the NVPTX backend, but I think this looks useful as an > (optional) target-independent pass. A few thoughts: > > - Running GVN tends to be pretty expensive; have you tried EarlyCSE

NVPTX - Reordering load instructions

2018 Jun 21

NVPTX - Reordering load instructions

Hi all, I'm looking into the performance difference of a benchmark compiled with NVCC vs NVPTX (coming from Julia, not CUDA C) and I'm seeing a significant difference due to PTX instruction ordering. The relevant source code consists of two nested loops that get fully unrolled, doing some basic arithmetic with values loaded from shared memory: > #define BLOCK_SIZE 16 > >

[LLVMdev] CUDA front-end (CUDA to LLVM IR)

2015 Apr 08

[LLVMdev] CUDA front-end (CUDA to LLVM IR)

Hi, I wanted to ask whether there is ongoing effort (or an already established tool) that enables to convert CUDA kernels (that uses CUDA specific intrinsics, e.g., threadId.x, __syncthreads(), ...) to LLVM IR. I am aware that I can do this for OpenCL with the help of libclc but I can not find something similar for CUDA. Thanks -------------- next part -------------- An HTML attachment was

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 16

[LLVMdev] Attaching range metadata to IntrinsicInst

Hi, The range metadata can only be attached to LoadInst for now. I am considering extending its usage to IntrinsicInst so that the frontend can annotate the range of the return value of an intrinsic call. e.g., %a = call i32 @llvm.xxx(), !range !0 !0 = metadata !{ i32 0, i23 1024 } The motivation behind this extension is some optimizations we are working on for CUDA programs. Some special

[LLVMdev] CUDA front-end (CUDA to LLVM IR)

2015 Apr 08

[LLVMdev] CUDA front-end (CUDA to LLVM IR)

On Wed, Apr 8, 2015 at 10:12 AM, Dmitry Mikushin <dmitry at kernelgen.org> wrote: > A tool of this kind here: https://github.com/apc-llc/nvcc-llvm-ir > > 2015-04-08 19:01 GMT+02:00 Ahmed ElTantawy <ahmede at ece.ubc.ca>: > >> Hi, >> >> I wanted to ask whether there is ongoing effort (or an already >> established tool) that enables to convert CUDA

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

On Tue, Jun 17, 2014 at 2:33 PM, Jingyue Wu <jingyue at google.com> wrote: > Hi Eric, > > In the IR, besides "target datalayout" and "target triple", we have a > special "target cpu" string which is set by the Clang front-end according to > its -target-cpu flag. We also write a Module::getTargetCPU() method to > retrieve this string from the

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

On 17 June 2014 06:41, Eli Bendersky <eliben at google.com> wrote: > On Tue, Jun 17, 2014 at 1:38 AM, Nick Lewycky <nicholas at mxc.ca> wrote: > >> Chandler Carruth wrote: >> >>> This seems fine to me, but I'd like to make sure it looks OK to Nick as >>> well. >>> >> >> I strongly prefer baking in knowledge about the

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

Chandler Carruth wrote: > This seems fine to me, but I'd like to make sure it looks OK to Nick as > well. I strongly prefer baking in knowledge about the intrinsics themselves into the passes if possible. Metadata will always be secondary. Separately, should value tracking look use range metadata when it's available? Absolutely. I think it should apply to all CallInst not just

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

Eh? How do you envision this? -eric On Tue, Jun 17, 2014 at 2:09 PM, Jingyue Wu <jingyue at google.com> wrote: > Hi Nick, > > That makes sense. I think a main issue here is that the ranges of these PTX > special registers (e.g., threadIdx.x) depend on -target-cpu which is only > visible to clang and llc. Would you mind we specify "target cpu" in the IR > similar

NVPTX - Reordering load instructions

2018 Jun 21

NVPTX - Reordering load instructions

We already have a pass that vectorizes loads and stores in nvptx and amdgpu. Not at my laptop, I forget the exact filename, but it's called load-store vectorizer. I think the question is, why is LSV not vectorizing this code? I think the answer is, llvm can't tell that the loads are aligned. Ptxas can, but only because it's (apparently) doing vectorization *after* it reesolves the

[LLVMdev] Proposal: pragma for branch divergence

2015 Jan 24

[LLVMdev] Proposal: pragma for branch divergence

*Hi, I am considering a language extension to Clang for optimizing GPU programs. This extension will allow the compiler to use different optimization strategies for divergent and non-divergent branches (to be explained below). We have observed significant performance gain by leveraging this proposed extension, so I want to discuss it here to see how the community likes/dislikes the idea. I will

[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence

2015 Jan 24

[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence

In our experience, as Owen also suggests, a pragma or a language extension can be avoided by a combination of static and dynamic analysis. We prefer this approach in our compiler ;) Regards, Vinod On Sat, Jan 24, 2015 at 12:09 AM, Owen Anderson <resistor at mac.com> wrote: > Hi Jingyue, > > Have you considered using dynamic uniformity checks? In my experience you > can

Separate compilation of CUDA code?

2017 Jun 14

Separate compilation of CUDA code?

Hi, I wonder whether the current version of LLVM supports separate compilation and linking of device code, i.e., is there a flag analogous to nvcc's --relocatable-device-code flag? If not, is there any plan to support this? Thanks! Yuanfeng Peng -------------- next part -------------- An HTML attachment was scrubbed... URL:

problem on compiling cuda program with clang++

2016 Oct 27

problem on compiling cuda program with clang++

Hi all, I compiled the *llvm3.9* source code on the *Nvidia TX1* board. And now I am following the document in the docs/CompileCudaWithLLVM.rst to compile cuda program with clang++. However, when I compile `axpy.cu` using `nvcc`, *nvcc* can generate the correct the binary; while compiling `axpy.cu` using clang++, the detailed command is `clang++ axpy.cu -o axpy --cuda-gpu-arch=sm_53

Help needed using 3rd party C library/functions from within R (Nvidia CUDA)

2008 Nov 04

Help needed using 3rd party C library/functions from within R (Nvidia CUDA)

Hello, I'm trying to combine the parallel computing power available through NVIDIA CUDA (www.nvidia.com/cuda) from within R. CUDA is an extension to the C language, so I thought it would be possible to do this. If I have a C file with an empty function which includes a needed CUDA library (cutil.h) and compile this to an .so file using a NVIDIA compiler (nvcc), called 'myFunc.so' I

[LLVMdev] [NVPTX] Backend cannot handle array-of-arrays constant

2012 Sep 03

[LLVMdev] [NVPTX] Backend cannot handle array-of-arrays constant

Dear all, Looks like the NVPTX backend cannot handle array-of-arrays contant (please see the reporocase below). Is it supposed to work? Any ideas how to get it working? Important for our target applications. Thanks, - Dima. $ cat test.ll ; ModuleID = '__kernelgen_main_module' target datalayout =

[LLVMdev] [llvm-commits] [PATCH][RFC] NVPTX Backend

2012 May 01

[LLVMdev] [llvm-commits] [PATCH][RFC] NVPTX Backend

> -----Original Message----- > From: Dan Bailey [mailto:dan at dneg.com] > Sent: Sunday, April 29, 2012 8:46 AM > To: Justin Holewinski > Cc: Jim Grosbach; llvm-commits at cs.uiuc.edu; Vinod Grover; > llvmdev at cs.uiuc.edu > Subject: Re: [llvm-commits] [PATCH][RFC] NVPTX Backend > > Justin, > > Firstly, this is great! It seems to be so much further forward in

[LLVMdev] RFC: Convergent attribute

2015 Aug 14

[LLVMdev] RFC: Convergent attribute

Hi Jingyue, Convergent is not intended to prevent inlining. It’s tricky to formalize this inter-procedurally, but the intended interpretation is that a convergent operation cannot be move either into or out of a conditionally executed region. Normal inlining would not violate that. I would imagine that it would make sense to use a combination of convergent and noduplicate for barrier-like

similar to: [CUDA/NVPTX] is inlining __syncthreads allowed?