thr3ads.net - similar to: "[LLVMdev] Attaching range metadata to IntrinsicInst"

Displaying 20 results from an estimated 4000 matches similar to: "[LLVMdev] Attaching range metadata to IntrinsicInst"

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

Chandler Carruth wrote: > This seems fine to me, but I'd like to make sure it looks OK to Nick as > well. I strongly prefer baking in knowledge about the intrinsics themselves into the passes if possible. Metadata will always be secondary. Separately, should value tracking look use range metadata when it's available? Absolutely. I think it should apply to all CallInst not just

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

On 17 June 2014 06:41, Eli Bendersky <eliben at google.com> wrote: > On Tue, Jun 17, 2014 at 1:38 AM, Nick Lewycky <nicholas at mxc.ca> wrote: > >> Chandler Carruth wrote: >> >>> This seems fine to me, but I'd like to make sure it looks OK to Nick as >>> well. >>> >> >> I strongly prefer baking in knowledge about the

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

Eh? How do you envision this? -eric On Tue, Jun 17, 2014 at 2:09 PM, Jingyue Wu <jingyue at google.com> wrote: > Hi Nick, > > That makes sense. I think a main issue here is that the ranges of these PTX > special registers (e.g., threadIdx.x) depend on -target-cpu which is only > visible to clang and llc. Would you mind we specify "target cpu" in the IR > similar

[LLVMdev] Attaching range metadata to IntrinsicInst

2014 Jun 17

[LLVMdev] Attaching range metadata to IntrinsicInst

On Tue, Jun 17, 2014 at 2:33 PM, Jingyue Wu <jingyue at google.com> wrote: > Hi Eric, > > In the IR, besides "target datalayout" and "target triple", we have a > special "target cpu" string which is set by the Clang front-end according to > its -target-cpu flag. We also write a Module::getTargetCPU() method to > retrieve this string from the

[LLVMdev] Proposal: pragma for branch divergence

2015 Jan 24

[LLVMdev] Proposal: pragma for branch divergence

*Hi, I am considering a language extension to Clang for optimizing GPU programs. This extension will allow the compiler to use different optimization strategies for divergent and non-divergent branches (to be explained below). We have observed significant performance gain by leveraging this proposed extension, so I want to discuss it here to see how the community likes/dislikes the idea. I will

[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence

2015 Jan 24

[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence

In our experience, as Owen also suggests, a pragma or a language extension can be avoided by a combination of static and dynamic analysis. We prefer this approach in our compiler ;) Regards, Vinod On Sat, Jan 24, 2015 at 12:09 AM, Owen Anderson <resistor at mac.com> wrote: > Hi Jingyue, > > Have you considered using dynamic uniformity checks? In my experience you > can

[CUDA/NVPTX] is inlining __syncthreads allowed?

2015 Aug 21

[CUDA/NVPTX] is inlining __syncthreads allowed?

I'm using 7.0. I am attaching the reduced example. nvcc sync.cu -arch=sm_35 -ptx gives // .globl _Z3foov .visible .entry _Z3foov( ) { .reg .pred %p<2>; .reg .s32 %r<3>; mov.u32 %r1, %tid.x; and.b32 %r2, %r1, 1; setp.eq.b32 %p1, %r2, 1; @!%p1 bra BB7_2; bra.uni

Information about the number of indices in memory accesses

2020 Sep 23

Information about the number of indices in memory accesses

Hi all, For loads and stores i want to extract information about the number of indices accessed. For instance: struct S {int X, int *Y}; __global__ void kernel(int *A, int **B, struct S) { int x = A[..][..]; // -> L: A[..][..] int y = *B[2]; // -> L: B[0][2] int z = S.y[..]; // -> L: S.1[..] // etc.. } I am performing some preprocessing on IR to: 1. Move constant

Information about the number of indices in memory accesses

2020 Oct 03

Information about the number of indices in memory accesses

Michael makes a great point about aliasing here and different indexing that accesses the same element! Another note: x = A[0][2] is fundamentally different depending on the type of `A`. If e.g. A was declared: int A[10][20], there's only _one_ load. A is a (and is treated as) a linear buffer, and GEPs only pinpoint the specific position of A[0][2] in this buffer (i.e. 0*10 + 2). But if A was

[CUDA/NVPTX] is inlining __syncthreads allowed?

2015 Aug 21

[CUDA/NVPTX] is inlining __syncthreads allowed?

Hi Justin, Is a compiler allowed to inline a function that calls __syncthreads? I saw nvcc does that, but not sure it's valid though. For example, void foo() { __syncthreads(); } if (threadIdx.x % 2 == 0) { ... foo(); } else { ... foo(); } Before inlining, all threads meet at one __syncthreads(). After inlining if (threadIdx.x % 2 == 0) { ... __syncthreads(); } else { ...

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

2014 Apr 19

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Hi, We wrote an optimization that eliminates common sub-expressions in a group of similar GEPs for the NVPTX backend. It speeds up some of our benchmarks by up to 20%, which convinces us to try to upstream it. Here's a brief description of why we wrote this optimization, what we did, and how we did it. Loops in CUDA programs are often extensively unrolled by programmers and compilers,

instrumenting device code with gpucc

2016 Mar 15

instrumenting device code with gpucc

Hi Jingyue, Sorry to ask again, but how exactly could I glue the fatbin with the instrumented host code? Or does it mean we actually cannot instrument both the host & device code at the same time? Thanks! yuanfeng On Tue, Mar 15, 2016 at 10:09 AM, Jingyue Wu <jingyue at google.com> wrote: > Including fatbin into host code should be done in frontend. > > On Mon, Mar 14, 2016

instrumenting device code with gpucc

2016 Mar 13

instrumenting device code with gpucc

Hey Jingyue, Thanks for being so responsive! I finally figured out a way to resolve the issue: all I have to do is to use `-only-needed` when merging the device bitcodes with llvm-link. However, since we actually need to instrument the host code as well, I encountered another issue when I tried to glue the instrumented host code and fatbin together. When I only instrumented the device code, I

instrumenting device code with gpucc

2016 Mar 12

instrumenting device code with gpucc

Hey Jingyue, Though I tried `opt -nvvm-reflect` on both bc files, the nvvm reflect anchor didn't go away; ptxas is still complaining about the duplicate definition of of function '_ZL21__nvvm_reflect_anchorv' . Did I misused the nvvm-reflect pass? Thanks! yuanfeng On Fri, Mar 11, 2016 at 10:10 AM, Jingyue Wu <jingyue at google.com> wrote: > According to the examples you

instrumenting device code with gpucc

2016 Mar 10

instrumenting device code with gpucc

It's hard to tell what is wrong without a concrete example. E.g., what is the program you are instrumenting? What is the definition of the hook function? How did you link that definition with the binary? One thing suspicious to me is that you may have linked the definition of _Cool_MemRead_Hook as a host function instead of a device function. AFAIK, PTX assembly cannot be linked. So, if you

Information about the number of indices in memory accesses

2020 Oct 03

Information about the number of indices in memory accesses

Hi Ees, SCEV Delinearization is the closest I know. But it has its problems. Well for one your expression should be SCEVable. But more importantly, SCEV Delinearization is trying to deduce something that is high-level (actually source-level) from a low-level IR in which a lot of this info has been lost. So, since there's not a 1-1 mapping from high-level code to LLVM IR, going backwards will

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

2014 Apr 21

[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Hi Hal, Thanks for your comments! I'm inlining my responses below. Jingyue On Sat, Apr 19, 2014 at 6:38 AM, Hal Finkel <hfinkel at anl.gov> wrote: > Jingyue, > > I can't speak for the NVPTX backend, but I think this looks useful as an > (optional) target-independent pass. A few thoughts: > > - Running GVN tends to be pretty expensive; have you tried EarlyCSE

[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence

2015 Jan 25

[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence

Hi Owen and Vinod, Thanks for sharing the paper! I like the idea a lot. Regarding the paper itself, Vinod, are the consensual branches (e.g., cbranch.ifnone) you mentioned in the paper publicly available in PTX ISA? Owen, could you explain more on the approach of using branch-if-none instructions in your mind? I believe you have lots of great insights, but I don't see how cbranch.ifnone

instrumenting device code with gpucc

2016 Mar 05

instrumenting device code with gpucc

On Fri, Mar 4, 2016 at 5:50 PM, Yuanfeng Peng <yuanfeng.jack.peng at gmail.com> wrote: > Hi Jingyue, > > My name is Yuanfeng Peng, I'm a PhD student at UPenn. I'm sorry to bother > you, but I'm having trouble with gpucc in my project, and I would be really > grateful for your help! > > Currently we're trying to instrument CUDA code using LLVM 3.9, and

[LLVMdev] should InstCombine preserve @llvm.assume?

2015 Jun 10

[LLVMdev] should InstCombine preserve @llvm.assume?

Hi, I have some WIP that leverages @llvm.assume in some optimization passes other than InstCombine. However, it doesn't work yet because InstCombine removes @llvm.assume calls that are useful for later optimizations. For example, given define i32 @foo(i32 %a, i32 %b) { %sum = add i32 %a, %b %1 = icmp sge i32 %sum, 0 call void @llvm.assume(i1 %1) ret i32 %sum } "opt

similar to: [LLVMdev] Attaching range metadata to IntrinsicInst