thr3ads.net - llvm dev - [llvm-dev] [RFC] A New Divergence Analysis for LLVM [May 2018]

If this information is useful, please help other people find it:
Share via:
Simon Moll via llvm-dev
2018-May-28 17:38 UTC
[llvm-dev] [RFC] A New Divergence Analysis for LLVM

TL;DR This RFC is a joint effort by Intel and Saarland University to 
bring the divergence analysis of the Region Vectorizer [1,2,3,4,5] 
(dubbed the vectorization analysis of RV) to LLVM. The implementation is 
available on github for feedback [0]. The existing divergence analysis 
infrastructure in LLVM has conceptual limitations (structured control, 
SCEV based). The new analysis resolves bugs for the GPU backends from 
the first patch on and will be used by VPlan [6] in the mid term. The 
implementation is based on RV, the Region Vectorizer.


-- RV --

The Region Vectorizer (github [1]) is an analysis and transformation 
framework for outer-loop and whole-function vectorization. RV vectorizes 
arbitrary, reducible control flow including nested divergent loops 
through partial control-flow linearization [2]. RV is being used by the 
Impala [7] and the PACXX [3,8] high performance programming frameworks 
and implements OpenMP #pragma omp simd and #pragma omp declare simd.


-- The Divergence Analysis --

The Divergence Analysis determines how instructions will behave if 
executed in lockstep for multiple threads or vector lanes. The loop 
vectorizer (VPlan) uses divergence information to generate correct and 
efficient vector code. The GPU backends rely on it to re-structure the CFG.
An instruction is uniform if it evaluates to the same result for all 
(active) threads. Otherwise, it is is called divergent. Divergent branch 
conditions cause control to diverge at branches and can turn loops into 
divergent loops. Divergent branches induce divergence in phi nodes, if 
the branch divergence implies that a phi node could be reached by 
different threads from two distinct predecessors at once or from two 
distinct loop iterations at once (temporal divergence).

Example A: Divergence analysis result with regard to the outer loop.
for (int thread_id = 0; thread_id < n; ++thread_id) {// thread_id divergent
  V = 0; // V uniform
  if (A[thread_id] > 0) { // divergent branch
    V = 1; // V uniform
  }
  // divergent phi for V (control induced)
  int j = 0; // uniform
  for (; j < m; ++j) { // uniform loop exit
    if (B[j] > 50) { // uniform branch
      C[j][thread_id] = 42;
    }
    if (D[j] < V) break; // divergent loop exit
  }
  x = j; // divergent (temporal; due to divergent loop)
}


-- Motivation --

The LoopVectorizer and VPlan currently use SCEV to detect uniform 
values. The StructurizeCFG pass and the AMDGPU backends rely on LLVM’s 
existing DivergenceAnalysis pass for SPMD kernels. Both approaches have 
shortcomings that are resolved by the new analysis.


- Why the current DivergenceAnalysis is broken -

LLVM already includes a DivergenceAnalysis for GPU kernels. It is used 
in the GPU backends and to re-structure CFGs (StructurizeCFG pass). 
However, the existing DivergenceAnalysis of LLVM is broken for 
unstructured CFGs (bug report [9]):

Example B: undetected divergence in unstructured control.
entry:
  divergent branch to B and C
B:
  uniform branch to C and D
C:
  %phi.var.hidden = phi .. <- undetected DIVERGENT phi
  br D
D:
  %phi.var = phi .. <- detected DIVERGENT phi in IPostDom of entry
  Ret

The existing DivergenceAnalysis expects control to re-converge only at 
the immediate post dominator of a branch, which is not the case for 
unstructured control as shown in the example. The new analysis supports 
unstructured control flow.


-- Why SCEV is not enough --

SCEV is restricted to induction variables and (piecewise) polynomial 
functions. The divergence analysis on the other hand is able to detect 
uniform instructions and branches in arbitrary programs. The key 
takeaway here is that “there are uniform values that are not invariant.” 
(http://lists.llvm.org/pipermail/llvm-dev/2017-December/119522.html).

Example A: non-affine loop (Test NonAffineUniLoop.ll)
for (int i = 0; i < n; ++i) {
  for (int len = 1; len * 2 < n; len *= 2) {
    for (int k = 0; k < n k += 2 * len) {
      A[k * n + i] += A[(k + len) * n + i];
    }
  }
}

The variable “len” is uniform with regards to the outer i-loop. SCEV 
does not detect this:
%mul44 U: [0,-1) S: [-2147483648,2147483647)
Exits: <<Unknown>>
LoopDispositions: { %for.body8.lr.ph: Variant, %for.body: Variant, 
%for.body8: Invariant }


-- The new DivergenceAnalysis --

The proposed DivergenceAnalysis detects divergence in unstructured, 
reducible control flow and nested divergent loops. The implementation 
provides light-weight frontends for use in VPlan 
(LoopDivergenceAnalysis) and the analysis of GPU kernels 
(GPUDivergenceAnalysis). Both frontends build on a single, shared 
implementation of the actual divergence analysis algorithm.

Apart from solving current issues with divergence detection in LLVM, the 
new analysis will enable the adoption of advanced vectorization 
techniques in VPlan that are already implemented in RV, for example 
partial linearization [2], automatic conversion of divergent loops, etc).


--- Planned patches ---

-- Patch set 1 (reference impl. [0]) --
o New DivergenceAnalysis class, which implements a basic divergence 
analysis (uniform/divergent) with support for unstructured, reducible 
control and nested divergent loops. The analysis operates on LLVM 
IR.GPUDivergenceAnalysis: analysis frontend for SPMD kernels (fixes the 
unstructured control bug [9]). Acts as a drop-in replacement for the 
existing DivergenceAnalysis in LLVM (opt-in flag -use-rv-da).
o LoopDivergenceAnalysis: frontend for loop vectorization in preparation 
of VPlan [6].
o The existing DivergenceAnalysis gets renamed to KernelDivergenceAnalysis.
o LIT tests for both frontends.

-- Patch set 2 --
o Integration with VPlan to detect uniform values in outer loop 
vectorization (opt-in flag -vectorizer-use-da).
o More expressive analysis lattice (uniform/consecutive/strided…  + 
alignment). This is the full sa lattice (stride+alignment lattice) used 
in RV [3].

-- Patch set 3 --
o Templatize the divergence analysis to operate directly on VPlans also 
(block/instruction type parametric).


We provide the implementation of this RFC as a fork of LLVM on github 
[0]. We welcome your feedback. This implementation will be the source of 
upcoming patches.


Thanks,

Simon




[0] VPlan+RV: https://github.com/cdl-saarland/vplan-rv
[1] RV on github: https://github.com/cdl-saarland/rv
[2] Partial Control-Flow Linearization, Moll, Hack, PLDI ‘18 (to 
appear): http://compilers.cs.uni-saarland.de/papers/moll_parlin_pldi18.pdf
[3] PACXXv2 + RV: An LLVM-based Portable High-Performance Programming 
Model: https://dl.acm.org/citation.cfm?id=3148185
[4] VPlan+RV SRC talk at US DevMtg ‘17: 
https://llvm.org/devmtg/2017-10/#src1
[5] VPlan+RV Lightning talk at EuroLLVM ‘18: 
http://llvm.org/devmtg/2018-04/talks.html#Lightning_16
[6] VPlan proposal: https://llvm.org/docs/Proposals/VectorizationPlan.html
[7] Impala: https://anydsl.github.io/Impala
[8] PACXX: https://github.com/pacxx
[9] DivergenceAnalysis bug https://bugs.llvm.org/show_bug.cgi?id=37185

-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 :moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  :http://compilers.cs.uni-saarland.de/people/moll

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180528/8f1d7d48/attachment.html>
Possibly Parallel Threads

Search for more seemingly similar threads
llvm dev - May 2018 - [RFC] A New Divergence Analysis for LLVM

[llvm-dev] [RFC] A New Divergence Analysis for LLVM

Possibly Parallel Threads

Wisdom of the Ancients