thr3ads.net - similar to: "[LLVMdev] Packed instructions generaetd by LoopVectorize?"

Displaying 20 results from an estimated 1000 matches similar to: "[LLVMdev] Packed instructions generaetd by LoopVectorize?"

[LLVMdev] Packed instructions generaetd by LoopVectorize?

2013 Apr 03

[LLVMdev] Packed instructions generaetd by LoopVectorize?

Hi Tyler, Try adding -ffast-math. We can only vectorize reduction variables if it is safe to reorder floating point operations. Thanks, Nadav On Apr 3, 2013, at 10:29 AM, "Nowicki, Tyler" <tyler.nowicki at intel.com> wrote: > Hi, > > I have a question about LoopVectorize. I wrote a simple test case, a dot product loop and found that packed instructions are

[LLVMdev] Packed instructions generaetd by LoopVectorize?

2013 Apr 04

[LLVMdev] Packed instructions generaetd by LoopVectorize?

Thanks, that did it! Are there any plans to enable the loop vectorizer by default? From: Nadav Rotem [mailto:nrotem at apple.com] Sent: Wednesday, April 03, 2013 13:33 PM To: Nowicki, Tyler Cc: LLVM Developers Mailing List Subject: Re: Packed instructions generaetd by LoopVectorize? Hi Tyler, Try adding -ffast-math. We can only vectorize reduction variables if it is safe to reorder floating

[LLVMdev] X86 - Help on fixing a poor code generation bug

2013 Dec 05

[LLVMdev] X86 - Help on fixing a poor code generation bug

Hi all, I noticed that the x86 backend tends to emit unnecessary vector insert instructions immediately after sse scalar fp instructions like addss/mulss. For example: ///////////////////////////////// __m128 foo(__m128 A, __m128 B) { _mm_add_ss(A, B); } ///////////////////////////////// produces the sequence: addss %xmm0, %xmm1 movss %xmm1, %xmm0 which could be easily optimized into

[LLVMdev] X86 - Help on fixing a poor code generation bug

2013 Dec 05

[LLVMdev] X86 - Help on fixing a poor code generation bug

Hi Andrea, Thanks for working on this. I can see two approaches to solving this problem. The first one (that you suggested) is to catch this pattern after register allocation. The second approach is to eliminate this redundancy during instruction selection. Can you please look into catching this pattern during iSel? The idea is that ADDSS does an ADD plus BLEND operations, and you can easily

dot products

2012 Mar 07

dot products

Hello, I need to take a dot product of each row of a dataframe and a vector. The number of columns will be dynamic. The way I've been doing it so far is contorted. Is there a better way? dotproduct <- function(dataf, v2) { apply(t(t(as.matrix(a)) * v2),1,sum) #contorted! } df = data.frame(a=c(1,2,3),b=c(4,5,6)) vec = c(4,5) dotproduct(df, vec) thanks,

[LLVMdev] better code for IV

2014 Feb 19

[LLVMdev] better code for IV

Hi Andrew, The issue below refers to LSR, so I'll appreciate your feedback. It also refers to instruction combining and might impact backends other than X86, so if you know of others that might be interested you are more than welcome to add them. Thanks, Anat _____________________________________________ From: Shemer, Anat Sent: Tuesday, February 18, 2014 15:07 To: 'llvmdev at

[LLVMdev] State of Loop Unrolling and Vectorization in LLVM

2013 Apr 15

[LLVMdev] State of Loop Unrolling and Vectorization in LLVM

Hi , I have a test case (and a micro benchmark made out of the test case) to check if loop unrolling and loop vectorization is efficiently done on LLVM. Here is the test case (credits: Tyler Nowicki) {code} extern float * array; extern int array_size; float g() { int i; float total = 0; for(i = 0; i < array_size; i++) { total += array[i]; } return total; } {code} When

[LLVMdev] Poor floating point optimizations?

2010 Nov 20

[LLVMdev] Poor floating point optimizations?

I wanted to use LLVM for my math parser but it seems that floating point optimizations are poor. For example consider such C code: float foo(float x) { return x+x+x; } and here is the code generated with "optimized" live demo: define float @foo(float %x) nounwind readnone { entry: %0 = fmul float %x, 2.000000e+00 ; <float> [#uses=1] %1 = fadd float %0, %x

[LLVMdev] Poor floating point optimizations?

2010 Nov 20

[LLVMdev] Poor floating point optimizations?

And also the resulting assembly code is very poor: 00460013 movss xmm0,dword ptr [esp+8] 00460019 movaps xmm1,xmm0 0046001C addss xmm1,xmm1 00460020 pxor xmm2,xmm2 00460024 addss xmm2,xmm1 00460028 addss xmm2,xmm0 0046002C movss dword ptr [esp],xmm2 00460031 fld dword ptr [esp] Especially pxor&and instead of movss (which is

[PATCH] Make SSE Run Time option. Add Win32 SSE code

2004 Aug 06

[PATCH] Make SSE Run Time option. Add Win32 SSE code

All, Attached is a patch that does two things. First it makes the use of the current SSE code a run time option through the use of speex_decoder_ctl() and speex_encoder_ctl It does this twofold. First there is a modification to the configure.in script which introduces a check based upon platform. It will compile in the sse assembly if you are on an i?86 based platform by making a

[LLVMdev] LLVM x86 Code Generator discards Instruction-level Parallelism

2010 Nov 03

[LLVMdev] LLVM x86 Code Generator discards Instruction-level Parallelism

Dear LLVMdev, I've noticed an unusual behavior of the LLVM x86 code generator (with default options) that results in nearly a 4x slow-down in floating-point throughput for my microbenchmark. I've written a compute-intensive microbenchmark to approach theoretical peak throughput of the target processor by issuing a large number of independent floating-point multiplies. The distance

Trouble when suppressing a portion of fast-math-transformations

2017 Sep 29

Trouble when suppressing a portion of fast-math-transformations

Hi all, In a mailing-list post last November: http://lists.llvm.org/pipermail/llvm-dev/2016-November/107104.html I raised some concerns that having the IR-level fast-math-flag 'fast' act as an "umbrella" to implicitly turn on all the lower-level fast-math-flags, causes some fundamental problems. Those fundamental problems are related to situations where a user wants to

LoopVectorize fails to vectorize code with condition on reduction

2018 Jun 11

LoopVectorize fails to vectorize code with condition on reduction

Hello. I'm not able to vectorize this simple C loop doing basically what could be called predicated sum-reduction: #define NMAX 1000 int colOccupied[NMAX]; void Func(int N) { int numSol = 0; for (int c = 0; c < N; c++) { if (colOccupied[c] == 0) numSol++; } return numSol; } The compiler

[LLVMdev] Missuse of xmm register on X86-64

2010 May 07

[LLVMdev] Missuse of xmm register on X86-64

All, I've been working on a new scheduler and have somehow affected register selection. My problem is that an xmm register is being used as an index expression. Specifically, addss (%xmm1,%rax,4), %xmm0 I like the idea of a floating-point index, but, like the assembler, I don't know what that means. Any suggestions on where I should look for a solution to my problem?

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

2013 Feb 21

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

On Thu, Feb 21, 2013 at 12:14 PM, Nadav Rotem <nrotem at apple.com> wrote: > You can change the input LLVM-IR. > > On Feb 21, 2013, at 7:16 AM, "Nowicki, Tyler" <tyler.nowicki at intel.com> > wrote: > > Hi,**** > > ** ** > > I am interested in evaluating the performance of packed vs scalar > double-precision floating point instructions on

[LLVMdev] 8-bit DIV IR irregularities

2012 Jun 27

[LLVMdev] 8-bit DIV IR irregularities

Hi, I noticed that when dividing with signed 8-bit values the IR uses a 32-bit signed divide, however, when unsigned 8-bit values are used the IR uses an 8-bit unsigned divide. Why not use a 8-bit signed divide when using 8-bit signed values? Here is the C code and IR: char idiv8(char a, char b) { char c = a / b; return c; } define signext i8 @idiv8(i8 signext %a, i8 signext %b) nounwind

[LLVMdev] [PROPOSAL] Improve uses of LEA on Atom

2012 Sep 28

[LLVMdev] [PROPOSAL] Improve uses of LEA on Atom

Hi, Here is an update on our proposal to improve the uses of LEA on Atom processors. 1. Disable current generation of LEAs Due to a 3 cycle stall between the ALU and the AGU any address generation done using math instruction will cause a stall on loads and stores which are within 3 cycles of the address generation. Consequently, the heuristics for using LEAs efficiently must know how many

[LLVMdev] RFC: Tail call optimization X86

2007 Sep 24

[LLVMdev] RFC: Tail call optimization X86

On 24 Sep 2007, at 09:18, Evan Cheng wrote: > +; RUN: llvm-as < %s | llc -march=x86 -mattr=+sse2 -stats -info- > output-file - | grep asm-printer | grep 9 > +; change preceeding line form ... | grep 8 to ..| grep 9 since > +; with new fastcc has std call semantics causing a stack adjustment > +; after the function call > > Not sure if I understand this. Can you illustrate

2017 Jul 17

A bug related with undef value when bootstrap MemorySSA.cpp

Hello, some of the patches had conflicts with LLVM head, so I updated them. If you experienced patch failure before then you can try it again. I compiled your code (1.c) with LLVM r308173 with the 5 patches applied, and it generated assembly like this. Now it contains store to c(%rip). It tries to store a(%rip) + b(%rip) to c(%rip). I wish this translation is now correct. ``` 73 .globl hoo

[LLVMdev] 8-bit DIV IR irregularities

2012 Jun 28

[LLVMdev] 8-bit DIV IR irregularities

I understand, but this sounds like legalization. Does every architecture trigger an overflow exception, as opposed to setting a bit? Perhaps it makes more sense to do this in the backends that trigger an overflow exception? I'm working on a modification for DIV right now in the x86 backend for Intel Atom that will improve performance, however because the *actual* operation has been replaced

similar to: [LLVMdev] Packed instructions generaetd by LoopVectorize?