thr3ads.net - similar to: "[LLVMdev] loop vectorizer and storing to uniform addresses"

Displaying 20 results from an estimated 1000 matches similar to: "[LLVMdev] loop vectorizer and storing to uniform addresses"

[LLVMdev] loop vectorizer and storing to uniform addresses

2013 Nov 08

[LLVMdev] loop vectorizer and storing to uniform addresses

On 7 November 2013 17:18, Frank Winter <fwinter at jlab.org> wrote: > LV: We don't allow storing to uniform addresses > This is triggering because it didn't recognize as a reduction variable during the canVectorizeInstrs() but did recognize that sum[q] is loop invariant in canVectorizeMemory(). I'm guessing the nested loop was unrolled because of the low trip-count, and

[LLVMdev] loop vectorizer and storing to uniform addresses

2013 Nov 08

[LLVMdev] loop vectorizer and storing to uniform addresses

I changed the input C to using a 64 bit type for the loop index (this eliminates 'sext' instructions in the IR) Here the IR produced with clang -O0 define float @foo(i64 %start, i64 %end, float* %A) #0 { entry: %start.addr = alloca i64, align 8 %end.addr = alloca i64, align 8 %A.addr = alloca float*, align 8 %sum = alloca [4 x float], align 16 %i = alloca i64, align 8

[LLVMdev] SIV tests in LoopDependence Analysis, Sanjoy's patch

2012 Apr 23

[LLVMdev] SIV tests in LoopDependence Analysis, Sanjoy's patch

Hi, When I write various test cases and explore how they're handled by the code in LoopDependenceAnalysis::analysePair, I'm surprised. This loop collects pairs of subscripts from the source and destination refs. * // Collect GEP operand pairs (FIXME: use GetGEPOperands from BasicAA), adding* * // trailing zeroes to the smaller GEP, if needed.* * GEPOpdsTy destOpds, srcOpds;* *

[LLVMdev] SIV tests in LoopDependence Analysis, Sanjoy's patch

2012 Apr 12

[LLVMdev] SIV tests in LoopDependence Analysis, Sanjoy's patch

Hi, Here is a preliminary (monolithic) version you can comment on. This is still buggy, however, and I'll be testing for and fixing bugs over the next few days. I've used your version of the strong siv test. Thanks! -- Sanjoy Das. http://playingwithpointers.com -------------- next part -------------- A non-text attachment was scrubbed... Name: patch.diff Type: application/octet-stream

[LLVMdev] [Vectorization] Mis match in code generated

2014 Sep 18

[LLVMdev] [Vectorization] Mis match in code generated

Hi Nadav, Thanks for the quick reply !! Ok, so as of now we are lacking capability to handle flat large reductions. I did go through function vectorizeChainsInBlock() (line number 2862). In this function, we try to vectorize if we have phi nodes in the IR (several if's check for phi nodes) i.e we try to construct tree that starts at chains. Any pointers on how to join multiple trees? I

[LLVMdev] [Vectorization] Mis match in code generated

2014 Sep 19

[LLVMdev] [Vectorization] Mis match in code generated

Hi Arnold, Thanks for your reply. I tried test case as suggested by you. *void foo(int *a, int *sum) {*sum = a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]+a[8]+a[9]+a[10]+a[11]+a[12]+a[13]+a[14]+a[15];}* so that it has a 'store' in its IR. *IR before vectorization :*target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128" target triple =

[LLVMdev] [Vectorization] Mis match in code generated

2014 Sep 18

[LLVMdev] [Vectorization] Mis match in code generated

Hi, I am trying to understand LLVM vectorization implementation and was looking into both loop and SLP vectorization. test case 1: *int foo(int *a) {int sum = 0,i;for(i=0; i<16; i++) sum += a[i];return sum;}* This code is vectorized by loop vectorizer where we calculate scalar loop cost as 4 and vector loop cost as 2. Since vector loop cost is less and above reduction is legal to

[LLVMdev] [Vectorization] Mis match in code generated

2014 Nov 10

[LLVMdev] [Vectorization] Mis match in code generated

Hi Suyog, Thanks for looking at this. This has recently got itself onto my TODO list too. > I am not sure how much all this will improve the code quality for horizontal reduction > (donno how frequently such pattern of horizontal reduction from same array occurs in real world/SPECS). Actually the main loop of 470.lbm can be SLP vectorized like this. We have three parts to it: A fully

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

2013 Aug 16

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Hi Sebpop, Thanks for your explanation. I noticed that Polly would finally run the SROA pass to transform these load/store instructions into scalar operations. Is it possible to run such a pass before polly-dependence analysis? Star Tan At 2013-08-15 21:12:53,"Sebastian Pop" <sebpop at gmail.com> wrote: >Codeprepare and independent blocks are introducing these loads and

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

2013 Aug 15

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Codeprepare and independent blocks are introducing these loads and stores. These are prepasses that polly runs prior to building the dependence graph to transform scalar dependences into data dependences. Ether was working on eliminating the rewrite of scalar dependences. On Thu, Aug 15, 2013 at 5:32 AM, Star Tan <tanmx_star at yeah.net> wrote: > Hi all, > > I have investigated the

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

2013 Aug 15

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Hi all, I have investigated the 6X extra compile-time overhead when Polly compiles the simple nestedloop benchmark in LLVM-testsuite. (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28). Preliminary results show that such compile-time overhead is resulted by the complicated polly-dependence analysis. However, the key seems to be the polly-prepare pass, which introduces

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

2013 Aug 16

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

I do not think that running SROA before polly is a good idea: it would defeat the purpose of the code preparation passes that polly intentionally schedules for the data dependence analysis. If you remove the data references before polly runs, you would miss them in the dependence graph: that could lead to incorrect transforms. On Thu, Aug 15, 2013 at 7:28 PM, Star Tan <tanmx_star at

[LLVMdev] Loss of precision with very large branch weights

2015 Apr 24

[LLVMdev] Loss of precision with very large branch weights

In PR 22718, we are looking at issues with long running applications producing non-representative frequencies. For example, in these two loops: int g = 0; __attribute__((noinline)) void bar() { g++; } extern int printf(const char*, ...); int main() { int i, j, k; for (i = 0; i < 1000000; i++) bar(); printf ("g = %d\n", g); g = 0; for (i = 0; i < 500000; i++)

[LLVMdev] loop vectorizer says Bad stride

2013 Oct 28

[LLVMdev] loop vectorizer says Bad stride

Verifying function running passes ... LV: Checking a loop in "bar" LV: Found a loop: L0 LV: Found an induction variable. LV: We need to do 0 pointer comparisons. LV: Checking memory dependencies LV: Bad stride - Not an AddRecExpr pointer %13 = getelementptr float* %arg2, i32 %1 SCEV: ((4 * (sext i32 {(256 + %arg0),+,1}<nw><%L0> to i64)) + %arg2) LV: Src Scev: {((4 * (sext

[LLVMdev] dynamic data dependence extraction using llvm

2014 Dec 11

[LLVMdev] dynamic data dependence extraction using llvm

Hi LLVM-ers, I try to develop my custom dynamic data dependence tool (focusing on nested loops), currently I can successfully get the trace including load/store address, loop information, etc. However, when I try to analyze dynamic data dependence based on the pairwise method described in [1], the load/store for iteration variables may interfere my analysis (I only care about the load/store for

[LLVMdev] loop vectorizer says Bad stride

2013 Oct 28

[LLVMdev] loop vectorizer says Bad stride

Frank, It looks like the loop vectorizer is unable to tell that the two stores in your code never overlap. This is probably because of the sign-extend in your code. Can you extend the indices to 64bit ? Thanks, Nadav On Oct 28, 2013, at 1:38 PM, Frank Winter <fwinter at jlab.org> wrote: > Verifying function > running passes ... > LV: Checking a loop in "bar" > LV:

[LLVMdev] Limit loop vectorizer to SSE

2013 Nov 15

[LLVMdev] Limit loop vectorizer to SSE

On 15 November 2013 20:05, Frank Winter <fwinter at jlab.org> wrote: > Good catch! That was the problem in my case too. I totally > overlooked the alignment requirement for AVX. I wonder if the validation mechanism shouldn't have caught it earlier... Do you guys run validate on the modules before JIT-ing? --renato -------------- next part -------------- An HTML attachment was

[LLVMdev] loop vectorizer

2013 Nov 06

[LLVMdev] loop vectorizer

Good that you bring this up. I still have no solution to this vectorization problem. However, I can rewrite the code and insert a second loop which eliminates the 'urem' and 'div' instructions in the index calculations. In this case, the inner loop's trip count would be equal to the SIMD length and the loop vectorizer ignores the loop. Unrolling the loop and SLP is not an

[LLVMdev] loop vectorizer

2013 Oct 31

[LLVMdev] loop vectorizer

On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org> wrote: > const std::uint64_t ir0 = (i+0)%4; // not working > I thought this would be the case when I saw the original expression. Maybe we need to teach module arithmetic to SCEV? --renato -------------- next part -------------- An HTML attachment was scrubbed... URL:

[LLVMdev] SLP vectorizer on AVX feature

2015 Jul 01

[LLVMdev] SLP vectorizer on AVX feature

On 1 July 2015 at 21:22, Frank Winter <fwinter at jlab.org> wrote: > there were two follow-up emails. I only got one... weird... > The issue is solved. The SLP vectorizer has > a magic number built into the code which determines the max. vector length > to search for. That was set to 128 bits. Increasing it to 256 bits solved > the issue. That looks like a simple fix. Is

similar to: [LLVMdev] loop vectorizer and storing to uniform addresses