Frank Winter
2015-Jun-03 17:37 UTC
[LLVMdev] Replacing a repetitive sequence of code with a loop
Hey guys, in an HPC project I am working on I am given an LLVM program consisting of a linear sequence of repetitive junks of code with an uniform memory access pattern. Each code junk does the following: 1) loads some memory, 2) performs some arithmetic operations, 3) stores the result back to memory. The memory stride between consecutive junks is constant over the whole program, thus the whole program could be replaced with a single loop with it's loop body containing a generic version of said code junk.Here's an example (a short one, the real world program would be much longer): define void @vec_plus_vec(float* noalias %arg0, float* noalias %arg1, float* noalias %arg2) { entrypoint: %0 = bitcast float* %arg1 to <4 x float>* %1 = bitcast float* %arg2 to <4 x float>* %2 = load <4 x float>* %0, align 16 %3 = load <4 x float>* %1, align 16 %4 = fadd <4 x float> %3, %2 %5 = bitcast float* %arg0 to <4 x float>* store <4 x float> %4, <4 x float>* %5, align 16 %6 = getelementptr float* %arg1, i64 4 %7 = getelementptr float* %arg2, i64 4 %8 = getelementptr float* %arg0, i64 4 %9 = bitcast float* %6 to <4 x float>* %10 = bitcast float* %7 to <4 x float>* %11 = load <4 x float>* %9, align 16 %12 = load <4 x float>* %10, align 16 %13 = fadd <4 x float> %12, %11 %14 = bitcast float* %8 to <4 x float>* store <4 x float> %13, <4 x float>* %14, align 16 %15 = getelementptr float* %arg1, i64 8 %16 = getelementptr float* %arg2, i64 8 %17 = getelementptr float* %arg0, i64 8 %18 = bitcast float* %15 to <4 x float>* %19 = bitcast float* %16 to <4 x float>* %20 = load <4 x float>* %18, align 16 %21 = load <4 x float>* %19, align 16 %22 = fadd <4 x float> %21, %20 %23 = bitcast float* %17 to <4 x float>* store <4 x float> %22, <4 x float>* %23, align 16 %24 = getelementptr float* %arg1, i64 12 %25 = getelementptr float* %arg2, i64 12 %26 = getelementptr float* %arg0, i64 12 %27 = bitcast float* %24 to <4 x float>* %28 = bitcast float* %25 to <4 x float>* %29 = load <4 x float>* %27, align 16 %30 = load <4 x float>* %28, align 16 %31 = fadd <4 x float> %30, %29 %32 = bitcast float* %26 to <4 x float>* store <4 x float> %31, <4 x float>* %32, align 16 ret void } The stride between two consecutive junks is 4*sizeof(float), thus could be condensed into a single loop. (The scenario with the constant stride between repetitive code junks is the first, most simple version. Later I will have to deal with arbitrary strides between junks. Requires an index array stored in the code as static data I guess)... For this simple scenario what already existent LLVM pass would be closest to what I am trying to achieve? Thanks, Frank
Benjamin Kramer
2015-Jun-03 18:57 UTC
[LLVMdev] Replacing a repetitive sequence of code with a loop
> On 03.06.2015, at 19:37, Frank Winter <fwinter at jlab.org> wrote: > > Hey guys, in an HPC project I am working on I am given an LLVM program consisting of a linear sequence of repetitive junks of code with an uniform memory access pattern. Each code junk does the following: 1) loads some memory, 2) performs some arithmetic operations, 3) stores the result back to memory. The memory stride between consecutive junks is constant over the whole program, thus the whole program could be replaced with a single loop with it's loop body containing a generic version of said code junk.Here's an example (a short one, the real world program would be much longer): > > define void @vec_plus_vec(float* noalias %arg0, float* noalias %arg1, float* noalias %arg2) { > entrypoint: > %0 = bitcast float* %arg1 to <4 x float>* > %1 = bitcast float* %arg2 to <4 x float>* > %2 = load <4 x float>* %0, align 16 > %3 = load <4 x float>* %1, align 16 > %4 = fadd <4 x float> %3, %2 > %5 = bitcast float* %arg0 to <4 x float>* > store <4 x float> %4, <4 x float>* %5, align 16 > %6 = getelementptr float* %arg1, i64 4 > %7 = getelementptr float* %arg2, i64 4 > %8 = getelementptr float* %arg0, i64 4 > %9 = bitcast float* %6 to <4 x float>* > %10 = bitcast float* %7 to <4 x float>* > %11 = load <4 x float>* %9, align 16 > %12 = load <4 x float>* %10, align 16 > %13 = fadd <4 x float> %12, %11 > %14 = bitcast float* %8 to <4 x float>* > store <4 x float> %13, <4 x float>* %14, align 16 > %15 = getelementptr float* %arg1, i64 8 > %16 = getelementptr float* %arg2, i64 8 > %17 = getelementptr float* %arg0, i64 8 > %18 = bitcast float* %15 to <4 x float>* > %19 = bitcast float* %16 to <4 x float>* > %20 = load <4 x float>* %18, align 16 > %21 = load <4 x float>* %19, align 16 > %22 = fadd <4 x float> %21, %20 > %23 = bitcast float* %17 to <4 x float>* > store <4 x float> %22, <4 x float>* %23, align 16 > %24 = getelementptr float* %arg1, i64 12 > %25 = getelementptr float* %arg2, i64 12 > %26 = getelementptr float* %arg0, i64 12 > %27 = bitcast float* %24 to <4 x float>* > %28 = bitcast float* %25 to <4 x float>* > %29 = load <4 x float>* %27, align 16 > %30 = load <4 x float>* %28, align 16 > %31 = fadd <4 x float> %30, %29 > %32 = bitcast float* %26 to <4 x float>* > store <4 x float> %31, <4 x float>* %32, align 16 > ret void > } > > The stride between two consecutive junks is 4*sizeof(float), thus could be condensed into a single loop. (The scenario with the constant stride between repetitive code junks is the first, most simple version. Later I will have to deal with arbitrary strides between junks. Requires an index array stored in the code as static data I guess)... > > For this simple scenario what already existent LLVM pass would be closest to what I am trying to achieve?There's a loop reroll pass in LLVM trunk that should do exactly this transformation. - Ben
Renato Golin
2015-Jun-03 20:13 UTC
[LLVMdev] Replacing a repetitive sequence of code with a loop
On 3 June 2015 at 19:57, Benjamin Kramer <benny.kra at gmail.com> wrote:> There's a loop reroll pass in LLVM trunk that should do exactly this transformation.Though that's a loop pass (runOnLoop). What you could do is add a previous pass that would recognize the pattern and create a loop of 1 iteration around the code and then run the reroll pass. If your pattern recognition is too good, it'll be redundant with the reroll pass, so I'm not sure how to do that without duplicating effort or destroying the IR (in case you're wrong). Adding 1-iteration loops to all functions won't work either. You'll have to balance the heuristics and make the pass optional, maybe with a pragma to help the compiler. cheers, --renato