thr3ads.net - llvm dev - [LLVMdev] Is there pass to break down <4 x float> to scalars [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Liu Xin

2013-Oct-25 13:49 UTC

[LLVMdev] Is there pass to break down <4 x float> to scalars

Hi, Richard,

I think we are solving a same problem. I am working on shader language
too.  I am not satisfied with current binaries because vector operations
are kept in llvm opt.

glsl shader language has an operation called "swizzle". It can select
sub-components of a vector. If a shader only takes components "xy" for
a
vec4. it's certainly wasteful to generate 4 operations for a scalar
processor.

i think a good solution for llvm is in codegen. Many compiler has codegen
optimizer. A DSE is good enough.

Which posted patch about TBAA? you have yet another solution except
decompose-vectors?


thanks,
--lx



On Fri, Oct 25, 2013 at 6:06 PM, Richard Sandiford <
rsandifo at linux.vnet.ibm.com> wrote:
> Liu Xin <navy.xliu at gmail.com> writes:
> > Hi, LLVM community,
> >
> > I write some code in hand using LLVM IR. for simplicity, I write them
in
> <4
> > x float>. now I found some stores for elements are useless.
> >
> > for example, If I store {0.0, 1.0, 2.0, 3.0} to a <4 x float>
%a. maybe
> > only %a.xy is alive in my program.  our target doesn't feature
SIMD
> > instruction, which means we have to lower vector to many  scalar
> > instructions. I found llvm doesn't have DSE in codegen , right?
> >
> >
> > Is there a pass which can break down vector operation to scalars?
>
> I wanted the same thing for SystemZ, which doesn't have vectors,
> in order to improve the llvmpipe code.  FWIW, here's what I have
locally.
>
> It is able to decompose loads and stores, but I found in the llvmpipe case
> that this made things worse with TBAA, because DAGCombiner::GaterAllAliases
> has some fairly strict limits.  So I disabled that by default; use
> -decompose-vector-load-store to reenable.
>
> The main motivation for z was instead to get InstCombine to rewrite
> things like scalarised selects.
>
> I haven't submitted it yet because it's less of a win than the TBAA
> DAGCombiner patch I posted, so I didn't want to distract from that.
> It would also need some TargetTransformInfo hooks to decide which
> vectors should be decomposed.
>
> Thanks,
> Richard
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131025/4784cfdb/attachment.html>

Richard Sandiford

2013-Oct-25 14:19 UTC

head link

[LLVMdev] Is there pass to break down <4 x float> to scalars

Liu Xin <navy.xliu at gmail.com> writes:> I think we are solving a same problem. I am working on shader language
> too.  I am not satisfied with current binaries because vector operations
> are kept in llvm opt.
>
> glsl shader language has an operation called "swizzle". It can
select
> sub-components of a vector. If a shader only takes components
"xy" for a
> vec4. it's certainly wasteful to generate 4 operations for a scalar
> processor.
>
> i think a good solution for llvm is in codegen. Many compiler has codegen
> optimizer. A DSE is good enough.
>
> Which posted patch about TBAA? you have yet another solution except
> decompose-vectors?
Ah, no, the TBAA thing is separate really.  llvmpipe generally operates
on 4 rows at a time, so some functions end up with patterns like:

   load <16 x i8> row0 ...
   load <16 x i8> row1 ...
   load <16 x i8> row2 ...
   load <16 x i8> row3 ...
   ... do stuff ...
   store <16 x i8> row0 ...
   store <16 x i8> row1 ...
   store <16 x i8> row2 ...
   store <16 x i8> row3 ...

Since the row stride is variable, llvm doesn't have enough information
to tell that these rows don't alias.  So it has to keep the loads and
stores in order.  And z only has 16 general registers, so a naively-
scalarised 16 x i8 operation rapidly runs out.  With unmodified llvmpipe
IR we get lots of spills.

Since z also has x86-like register-memory operations, a few spills are
usually OK.  But in this case we have to load i8s and immediately
spill them.

So the idea was to add TBAA information to the llvmpipe IR to say that
the rows don't alias.  (At the moment I'm only doing that by hand on
saved IR, I've not done it in llvmpipe itself yet.)  Combined with
-combiner-alias-analysis -combiner-global-alias-analysis, this allows
the loads and stores to be reordered, which gives much better code.

However, the problem at the moment is that there are other scalar loads
that get rewritten by DAGCombiner and the legalisation code, and in the
process lose their TBAA info.  This then interferes with the optimisation
above.  So I wanted to make sure that the TBAA information is kept around:

  http://llvm-reviews.chandlerc.com/D1894

It was just that if I had a choice of only getting one of the two patches in,
it'd definitely be the D1894 one.  It sounds like there's more interest
in
the DecomposeVectors patch than I'd expected though, so I'll get back to
it.

Maybe as a first cut we can have a TargetTransformInfo hook to enable or
disable the pass wholesale, with a command-line option to override it.

Thanks to you an Renato for the feedback.

Richard

Pekka Jääskeläinen

2013-Oct-25 15:19 UTC

head link

[LLVMdev] Is there pass to break down <4 x float> to scalars

On 10/25/2013 05:19 PM, Richard Sandiford wrote:> Since the row stride is variable, llvm doesn't have enough information
> to tell that these rows don't alias.
This sounds like a use case for parallelism metadata for
unrolled parallel code (which a "parallel program AA"
would exploit). I proposed a format for it some time ago here,
but haven't had time to go on with it.

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-March/060270.html

-- 
Pekka

Liu Xin

2013-Oct-30 09:04 UTC

head link

[LLVMdev] Is there pass to break down <4 x float> to scalars

Hi, Richard,

Your decompose vector patch works perfect on my site.  Unfortunately, I
still get stupid code because llvm '-dse'  fails followed by
'decompose-vector' .
I read the DSE code and it is definitely capable of eliminating unused
memory stores if its AA works.  I don't think basic AA works for me. I
found my program have complex memory accesses, such as bi-dimentional
arrays.

Sorry, I am not good at AA. In my concept, TBAA is just for C++.  Do you
mean that you can make use of TBAA to help DSE?
Why TBAA is total null for my program ? basicaa is even better than -tbaa.

liuxin at rd58:~/testbed$ opt  -tbaa -aa-eval -decompose-vectors -mem2reg -dse
test.bc -debug-pass=Structure -o test.opt.bc  -stats
Pass Arguments:  -targetlibinfo -no-aa -tbaa -aa-eval -decompose-vectors
-domtree -mem2reg -memdep -dse -preverify -verify
Target Library Information
No Alias Analysis (always returns 'may' alias)
Type-Based Alias Analysis
  ModulePass Manager
    FunctionPass Manager
      Exhaustive Alias Analysis Precision Evaluator
      Decompose vector operations into smaller pieces
      Dominator Tree Construction
      Promote Memory to Register
      Memory Dependence Analysis
      Dead Store Elimination
      Preliminary module verification
      Module Verifier
    Bitcode Writer
===== Alias Analysis Evaluator Report ====  1176 Total Alias Queries Performed
  0 no alias responses (0.0%)
  1176 may alias responses (100.0%)
  0 partial alias responses (0.0%)
  0 must alias responses (0.0%)
  Alias Analysis Evaluator Pointer Alias Summary: 0%/100%/0%/0%
  49 Total ModRef Queries Performed
  0 no mod/ref responses (0.0%)
  0 mod responses (0.0%)
  0 ref responses (0.0%)
  49 mod & ref responses (100.0%)
  Alias Analysis Evaluator Mod/Ref Summary: 0%/0%/0%/100%

Our c/c++ compiler uses steensguaard's points-to algorithm, so I turns to
find -steens-aa. It seems that llvm's poolalloc implements steens-aa,
right? does it still maintain?
I found I can not build rDSA using the latest llvm headers.

thanks,
--lx


On Fri, Oct 25, 2013 at 10:19 PM, Richard Sandiford <
rsandifo at linux.vnet.ibm.com> wrote:
> Liu Xin <navy.xliu at gmail.com> writes:
> > I think we are solving a same problem. I am working on shader language
> > too.  I am not satisfied with current binaries because vector
operations
> > are kept in llvm opt.
> >
> > glsl shader language has an operation called "swizzle". It
can select
> > sub-components of a vector. If a shader only takes components
"xy" for a
> > vec4. it's certainly wasteful to generate 4 operations for a
scalar
> > processor.
> >
> > i think a good solution for llvm is in codegen. Many compiler has
codegen
> > optimizer. A DSE is good enough.
> >
> > Which posted patch about TBAA? you have yet another solution except
> > decompose-vectors?
>
> Ah, no, the TBAA thing is separate really.  llvmpipe generally operates
> on 4 rows at a time, so some functions end up with patterns like:
>
>    load <16 x i8> row0 ...
>    load <16 x i8> row1 ...
>    load <16 x i8> row2 ...
>    load <16 x i8> row3 ...
>    ... do stuff ...
>    store <16 x i8> row0 ...
>    store <16 x i8> row1 ...
>    store <16 x i8> row2 ...
>    store <16 x i8> row3 ...
>
> Since the row stride is variable, llvm doesn't have enough information
> to tell that these rows don't alias.  So it has to keep the loads and
> stores in order.  And z only has 16 general registers, so a naively-
> scalarised 16 x i8 operation rapidly runs out.  With unmodified llvmpipe
> IR we get lots of spills.
>
> Since z also has x86-like register-memory operations, a few spills are
> usually OK.  But in this case we have to load i8s and immediately
> spill them.
>
> So the idea was to add TBAA information to the llvmpipe IR to say that
> the rows don't alias.  (At the moment I'm only doing that by hand
on
> saved IR, I've not done it in llvmpipe itself yet.)  Combined with
> -combiner-alias-analysis -combiner-global-alias-analysis, this allows
> the loads and stores to be reordered, which gives much better code.
>
> However, the problem at the moment is that there are other scalar loads
> that get rewritten by DAGCombiner and the legalisation code, and in the
> process lose their TBAA info.  This then interferes with the optimisation
> above.  So I wanted to make sure that the TBAA information is kept around:
>
>   http://llvm-reviews.chandlerc.com/D1894
>
> It was just that if I had a choice of only getting one of the two patches
> in,
> it'd definitely be the D1894 one.  It sounds like there's more
interest in
> the DecomposeVectors patch than I'd expected though, so I'll get
back to
> it.
>
> Maybe as a first cut we can have a TargetTransformInfo hook to enable or
> disable the pass wholesale, with a command-line option to override it.
>
> Thanks to you an Renato for the feedback.
>
> Richard
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/a323c533/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Oct 2013 - [LLVMdev] Is there pass to break down <4 x float> to scalars

[LLVMdev] Is there pass to break down <4 x float> to scalars

[LLVMdev] Is there pass to break down <4 x float> to scalars

[LLVMdev] Is there pass to break down <4 x float> to scalars

[LLVMdev] Is there pass to break down <4 x float> to scalars

Maybe Matching Threads