thr3ads.net - llvm dev - [LLVMdev] MCJIT generates MOVAPS on unaligned address [Aug 2014]

If this information is useful, please help other people find it:
Share via:

Frank Winter

2014-Aug-07 19:42 UTC

[LLVMdev] MCJIT generates MOVAPS on unaligned address

MCJIT when lowering to x86-64 generates a MOVAPS (Move Aligned Packed 
Single-Precision Floating-Point Values) on a non-aligned memory address:

     movaps    88(%rdx), %xmm0

where %rdx comes in as a function argument with only natural alignment 
(float*). This x86 instruction requires the memory address to be 16 byte 
aligned which 88 plus something aligned to 4 byte isn't.

Here the according IR code which was produced from the SLP vectorizer:

define void @func(float* noalias %arg0, float* noalias %arg1, float* 
noalias %arg2) {
entrypoint:
...
   %104 = getelementptr float* %arg0, i32 22
...
   %204 = bitcast float* %104 to <4 x float>*
   store <4 x float> %198, <4 x float>* %204

This in itself not wrong. However, shouldn't the lowering pass recognize 
the wrong alignment?

I am using LLVM 3.4.2 as available as source code from llvm.org.

Thanks,
Frank

Arnold Schwaighofer

2014-Aug-07 20:29 UTC

head link

[LLVMdev] MCJIT generates MOVAPS on unaligned address

> On Aug 7, 2014, at 12:42 PM, Frank Winter <fwinter at jlab.org>
wrote:
> 
> MCJIT when lowering to x86-64 generates a MOVAPS (Move Aligned Packed
Single-Precision Floating-Point Values) on a non-aligned memory address:
> 
>    movaps    88(%rdx), %xmm0
> 
> where %rdx comes in as a function argument with only natural alignment
(float*). This x86 instruction requires the memory address to be 16 byte aligned
which 88 plus something aligned to 4 byte isn't.
> 
> Here the according IR code which was produced from the SLP vectorizer:
> 
> define void @func(float* noalias %arg0, float* noalias %arg1, float*
noalias %arg2) {
> entrypoint:
> ...
>  %104 = getelementptr float* %arg0, i32 22
> ...
>  %204 = bitcast float* %104 to <4 x float>*
>  store <4 x float> %198, <4 x float>* %204
> 
> This in itself not wrong. However, shouldn't the lowering pass
recognize the wrong alignment?
The LLVM IR is wrong. Omitting the align directive on the store means abi
alignment of the target. The backend is “right” wrt to LLVM IR semantics to
produce the movaps.

The error is in the  producer (looks like the SLP vectorizer) of said vector
store. Could you provide a full test case where running the SLP vectorizer (opt
-slp-vectorize < t.ll) produces such an output?

The following code in the SLP vectorizer should have made sure that we created
an alignment of “4 bytes” given a data layout
(http://llvm.org/docs/LangRef.html#data-layout) that specifies f32:32:32.

    case Instruction::Store: {
      StoreInst *SI = cast<StoreInst>(VL0);
      unsigned Alignment = SI->getAlignment();
      ...
      StoreInst *S = Builder.CreateStore(VecValue, VecPtr);
      if (!Alignment)
        Alignment =
DL->getABITypeAlignment(SI->getPointerOperand()->getType()); //
<< Get the 4byte alignment for the scalar float store from the data layout
string.
      S->setAlignment(Alignment);

Frank Winter

2014-Aug-07 20:36 UTC

head link

[LLVMdev] MCJIT generates MOVAPS on unaligned address

Makes sense. I overlooked the missing alignment tag in the IR.

Attached the full test case.

Frank


On 08/07/2014 04:29 PM, Arnold Schwaighofer wrote:>> On Aug 7, 2014, at 12:42 PM, Frank Winter <fwinter at jlab.org>
wrote:
>>
>> MCJIT when lowering to x86-64 generates a MOVAPS (Move Aligned Packed
Single-Precision Floating-Point Values) on a non-aligned memory address:
>>
>>     movaps    88(%rdx), %xmm0
>>
>> where %rdx comes in as a function argument with only natural alignment
(float*). This x86 instruction requires the memory address to be 16 byte aligned
which 88 plus something aligned to 4 byte isn't.
>>
>> Here the according IR code which was produced from the SLP vectorizer:
>>
>> define void @func(float* noalias %arg0, float* noalias %arg1, float*
noalias %arg2) {
>> entrypoint:
>> ...
>>   %104 = getelementptr float* %arg0, i32 22
>> ...
>>   %204 = bitcast float* %104 to <4 x float>*
>>   store <4 x float> %198, <4 x float>* %204
>>
>> This in itself not wrong. However, shouldn't the lowering pass
recognize the wrong alignment?
> The LLVM IR is wrong. Omitting the align directive on the store means abi
alignment of the target. The backend is “right” wrt to LLVM IR semantics to
produce the movaps.
>
> The error is in the  producer (looks like the SLP vectorizer) of said
vector store. Could you provide a full test case where running the SLP
vectorizer (opt -slp-vectorize < t.ll) produces such an output?
>
> The following code in the SLP vectorizer should have made sure that we
created an alignment of “4 bytes” given a data layout
(http://llvm.org/docs/LangRef.html#data-layout) that specifies f32:32:32.
>
>      case Instruction::Store: {
>        StoreInst *SI = cast<StoreInst>(VL0);
>        unsigned Alignment = SI->getAlignment();
>        ...
>        StoreInst *S = Builder.CreateStore(VecValue, VecPtr);
>        if (!Alignment)
>          Alignment =
DL->getABITypeAlignment(SI->getPointerOperand()->getType()); //
<< Get the 4byte alignment for the scalar float store from the data layout
string.
>        S->setAlignment(Alignment);

-------------- next part --------------
;; ModuleID = 'module'
target triple = "x86_64-unknown-linux-gnu"

declare float @sinf(float)

declare float @acosf(float)

declare float @asinf(float)

declare float @atanf(float)

declare float @ceilf(float)

declare float @floorf(float)

declare float @cosf(float)

declare float @coshf(float)

declare float @expf(float)

declare float @logf(float)

declare float @log10f(float)

declare float @sinhf(float)

declare float @tanf(float)

declare float @tanhf(float)

declare float @fabsf(float)

declare float @sqrtf(float)

declare float @powf(float, float)

declare float @atan2f(float, float)

declare double @sin(double)

declare double @acos(double)

declare double @asin(double)

declare double @atan(double)

declare double @ceil(double)

declare double @floor(double)

declare double @cos(double)

declare double @cosh(double)

declare double @exp(double)

declare double @log(double)

declare double @log10(double)

declare double @sinh(double)

declare double @tan(double)

declare double @tanh(double)

declare double @fabs(double)

declare double @sqrt(double)

declare double @pow(double, double)

declare double @atan2(double, double)

define void @main(float* noalias %arg0, float* noalias %arg1, float* noalias
%arg2) {
entrypoint:
  %0 = getelementptr float* %arg1, i32 0
  %1 = load float* %0
  %2 = getelementptr float* %arg1, i32 4
  %3 = load float* %2
  %4 = getelementptr float* %arg1, i32 8
  %5 = load float* %4
  %6 = getelementptr float* %arg1, i32 12
  %7 = load float* %6
  %8 = getelementptr float* %arg1, i32 16
  %9 = load float* %8
  %10 = getelementptr float* %arg1, i32 20
  %11 = load float* %10
  %12 = getelementptr float* %arg2, i32 0
  %13 = load float* %12
  %14 = getelementptr float* %arg2, i32 4
  %15 = load float* %14
  %16 = getelementptr float* %arg2, i32 8
  %17 = load float* %16
  %18 = getelementptr float* %arg2, i32 12
  %19 = load float* %18
  %20 = getelementptr float* %arg2, i32 16
  %21 = load float* %20
  %22 = getelementptr float* %arg2, i32 20
  %23 = load float* %22
  %24 = fadd float %15, %3
  %25 = fadd float %13, %1
  %26 = fadd float %19, %7
  %27 = fadd float %17, %5
  %28 = fadd float %23, %11
  %29 = fadd float %21, %9
  %30 = getelementptr float* %arg0, i32 0
  store float %25, float* %30
  %31 = getelementptr float* %arg0, i32 4
  store float %24, float* %31
  %32 = getelementptr float* %arg0, i32 8
  store float %27, float* %32
  %33 = getelementptr float* %arg0, i32 12
  store float %26, float* %33
  %34 = getelementptr float* %arg0, i32 16
  store float %29, float* %34
  %35 = getelementptr float* %arg0, i32 20
  store float %28, float* %35
  %36 = getelementptr float* %arg1, i32 1
  %37 = load float* %36
  %38 = getelementptr float* %arg1, i32 5
  %39 = load float* %38
  %40 = getelementptr float* %arg1, i32 9
  %41 = load float* %40
  %42 = getelementptr float* %arg1, i32 13
  %43 = load float* %42
  %44 = getelementptr float* %arg1, i32 17
  %45 = load float* %44
  %46 = getelementptr float* %arg1, i32 21
  %47 = load float* %46
  %48 = getelementptr float* %arg2, i32 1
  %49 = load float* %48
  %50 = getelementptr float* %arg2, i32 5
  %51 = load float* %50
  %52 = getelementptr float* %arg2, i32 9
  %53 = load float* %52
  %54 = getelementptr float* %arg2, i32 13
  %55 = load float* %54
  %56 = getelementptr float* %arg2, i32 17
  %57 = load float* %56
  %58 = getelementptr float* %arg2, i32 21
  %59 = load float* %58
  %60 = fadd float %51, %39
  %61 = fadd float %49, %37
  %62 = fadd float %55, %43
  %63 = fadd float %53, %41
  %64 = fadd float %59, %47
  %65 = fadd float %57, %45
  %66 = getelementptr float* %arg0, i32 1
  store float %61, float* %66
  %67 = getelementptr float* %arg0, i32 5
  store float %60, float* %67
  %68 = getelementptr float* %arg0, i32 9
  store float %63, float* %68
  %69 = getelementptr float* %arg0, i32 13
  store float %62, float* %69
  %70 = getelementptr float* %arg0, i32 17
  store float %65, float* %70
  %71 = getelementptr float* %arg0, i32 21
  store float %64, float* %71
  %72 = getelementptr float* %arg1, i32 2
  %73 = load float* %72
  %74 = getelementptr float* %arg1, i32 6
  %75 = load float* %74
  %76 = getelementptr float* %arg1, i32 10
  %77 = load float* %76
  %78 = getelementptr float* %arg1, i32 14
  %79 = load float* %78
  %80 = getelementptr float* %arg1, i32 18
  %81 = load float* %80
  %82 = getelementptr float* %arg1, i32 22
  %83 = getelementptr float* %arg2, i32 2
  %84 = load float* %83
  %85 = getelementptr float* %arg2, i32 6
  %86 = load float* %85
  %87 = getelementptr float* %arg2, i32 10
  %88 = load float* %87
  %89 = getelementptr float* %arg2, i32 14
  %90 = load float* %89
  %91 = getelementptr float* %arg2, i32 18
  %92 = load float* %91
  %93 = getelementptr float* %arg2, i32 22
  %94 = fadd float %86, %75
  %95 = fadd float %84, %73
  %96 = fadd float %90, %79
  %97 = fadd float %88, %77
  %98 = fadd float %92, %81
  %99 = getelementptr float* %arg0, i32 2
  store float %95, float* %99
  %100 = getelementptr float* %arg0, i32 6
  store float %94, float* %100
  %101 = getelementptr float* %arg0, i32 10
  store float %97, float* %101
  %102 = getelementptr float* %arg0, i32 14
  store float %96, float* %102
  %103 = getelementptr float* %arg0, i32 18
  store float %98, float* %103
  %104 = getelementptr float* %arg0, i32 22
  %105 = getelementptr float* %arg1, i32 3
  %106 = load float* %105
  %107 = getelementptr float* %arg1, i32 7
  %108 = load float* %107
  %109 = getelementptr float* %arg1, i32 11
  %110 = load float* %109
  %111 = getelementptr float* %arg1, i32 15
  %112 = load float* %111
  %113 = getelementptr float* %arg1, i32 19
  %114 = load float* %113
  %115 = getelementptr float* %arg1, i32 23
  %116 = getelementptr float* %arg2, i32 3
  %117 = load float* %116
  %118 = getelementptr float* %arg2, i32 7
  %119 = load float* %118
  %120 = getelementptr float* %arg2, i32 11
  %121 = load float* %120
  %122 = getelementptr float* %arg2, i32 15
  %123 = load float* %122
  %124 = getelementptr float* %arg2, i32 19
  %125 = load float* %124
  %126 = getelementptr float* %arg2, i32 23
  %127 = fadd float %119, %108
  %128 = fadd float %117, %106
  %129 = fadd float %123, %112
  %130 = fadd float %121, %110
  %131 = fadd float %125, %114
  %132 = getelementptr float* %arg0, i32 3
  store float %128, float* %132
  %133 = getelementptr float* %arg0, i32 7
  store float %127, float* %133
  %134 = getelementptr float* %arg0, i32 11
  store float %130, float* %134
  %135 = getelementptr float* %arg0, i32 15
  store float %129, float* %135
  %136 = getelementptr float* %arg0, i32 19
  store float %131, float* %136
  %137 = getelementptr float* %arg0, i32 23
  %138 = getelementptr float* %arg1, i32 24
  %139 = getelementptr float* %arg1, i32 28
  %140 = load float* %139
  %141 = getelementptr float* %arg1, i32 32
  %142 = load float* %141
  %143 = getelementptr float* %arg1, i32 36
  %144 = load float* %143
  %145 = getelementptr float* %arg1, i32 40
  %146 = load float* %145
  %147 = getelementptr float* %arg1, i32 44
  %148 = load float* %147
  %149 = getelementptr float* %arg2, i32 24
  %150 = getelementptr float* %arg2, i32 28
  %151 = load float* %150
  %152 = getelementptr float* %arg2, i32 32
  %153 = load float* %152
  %154 = getelementptr float* %arg2, i32 36
  %155 = load float* %154
  %156 = getelementptr float* %arg2, i32 40
  %157 = load float* %156
  %158 = getelementptr float* %arg2, i32 44
  %159 = load float* %158
  %160 = fadd float %151, %140
  %161 = fadd float %155, %144
  %162 = fadd float %153, %142
  %163 = fadd float %159, %148
  %164 = fadd float %157, %146
  %165 = getelementptr float* %arg0, i32 24
  %166 = getelementptr float* %arg0, i32 28
  store float %160, float* %166
  %167 = getelementptr float* %arg0, i32 32
  store float %162, float* %167
  %168 = getelementptr float* %arg0, i32 36
  store float %161, float* %168
  %169 = getelementptr float* %arg0, i32 40
  store float %164, float* %169
  %170 = getelementptr float* %arg0, i32 44
  store float %163, float* %170
  %171 = getelementptr float* %arg1, i32 25
  %172 = bitcast float* %82 to <4 x float>*
  %173 = load <4 x float>* %172
  %174 = getelementptr float* %arg1, i32 29
  %175 = load float* %174
  %176 = getelementptr float* %arg1, i32 33
  %177 = load float* %176
  %178 = getelementptr float* %arg1, i32 37
  %179 = load float* %178
  %180 = getelementptr float* %arg1, i32 41
  %181 = load float* %180
  %182 = getelementptr float* %arg1, i32 45
  %183 = load float* %182
  %184 = getelementptr float* %arg2, i32 25
  %185 = bitcast float* %93 to <4 x float>*
  %186 = load <4 x float>* %185
  %187 = getelementptr float* %arg2, i32 29
  %188 = load float* %187
  %189 = getelementptr float* %arg2, i32 33
  %190 = load float* %189
  %191 = getelementptr float* %arg2, i32 37
  %192 = load float* %191
  %193 = getelementptr float* %arg2, i32 41
  %194 = load float* %193
  %195 = getelementptr float* %arg2, i32 45
  %196 = load float* %195
  %197 = fadd float %188, %175
  %198 = fadd <4 x float> %186, %173
  %199 = fadd float %192, %179
  %200 = fadd float %190, %177
  %201 = fadd float %196, %183
  %202 = fadd float %194, %181
  %203 = getelementptr float* %arg0, i32 25
  %204 = bitcast float* %104 to <4 x float>*
  store <4 x float> %198, <4 x float>* %204
  %205 = getelementptr float* %arg0, i32 29
  store float %197, float* %205
  %206 = getelementptr float* %arg0, i32 33
  store float %200, float* %206
  %207 = getelementptr float* %arg0, i32 37
  store float %199, float* %207
  %208 = getelementptr float* %arg0, i32 41
  store float %202, float* %208
  %209 = getelementptr float* %arg0, i32 45
  store float %201, float* %209
  ret void
}

define void @main_extern([8 x i8]* %arg_ptr) {
entrypoint:
  %0 = getelementptr [8 x i8]* %arg_ptr, i32 0
  %1 = bitcast [8 x i8]* %0 to float**
  %2 = load float** %1
  %3 = getelementptr [8 x i8]* %arg_ptr, i32 1
  %4 = bitcast [8 x i8]* %3 to float**
  %5 = load float** %4
  %6 = getelementptr [8 x i8]* %arg_ptr, i32 2
  %7 = bitcast [8 x i8]* %6 to float**
  %8 = load float** %7
  call void @main(float* %2, float* %5, float* %8)
  ret void
}

Frank Winter

2014-Aug-07 21:18 UTC

head link

[LLVMdev] MCJIT generates MOVAPS on unaligned address

It's not reproducible with 'opt'. I call the SLP pass from my 
application and only then the wrong IR gets generated.

On the attached module I call via the function pass manager:

1) TargetLibraryInfo with the target triple
2) Set the data layout
3) Basic Alias Analysis
4) SLP vectorizer

This produces the wrong IR. On the other hand running the attached 
module through 'opt -slp-vectorizer' results in no code changes.

What could I be missing here?

Frank


On 08/07/2014 04:29 PM, Arnold Schwaighofer wrote:>> On Aug 7, 2014, at 12:42 PM, Frank Winter <fwinter at jlab.org>
wrote:
>>
>> MCJIT when lowering to x86-64 generates a MOVAPS (Move Aligned Packed
Single-Precision Floating-Point Values) on a non-aligned memory address:
>>
>>     movaps    88(%rdx), %xmm0
>>
>> where %rdx comes in as a function argument with only natural alignment
(float*). This x86 instruction requires the memory address to be 16 byte aligned
which 88 plus something aligned to 4 byte isn't.
>>
>> Here the according IR code which was produced from the SLP vectorizer:
>>
>> define void @func(float* noalias %arg0, float* noalias %arg1, float*
noalias %arg2) {
>> entrypoint:
>> ...
>>   %104 = getelementptr float* %arg0, i32 22
>> ...
>>   %204 = bitcast float* %104 to <4 x float>*
>>   store <4 x float> %198, <4 x float>* %204
>>
>> This in itself not wrong. However, shouldn't the lowering pass
recognize the wrong alignment?
> The LLVM IR is wrong. Omitting the align directive on the store means abi
alignment of the target. The backend is “right” wrt to LLVM IR semantics to
produce the movaps.
>
> The error is in the  producer (looks like the SLP vectorizer) of said
vector store. Could you provide a full test case where running the SLP
vectorizer (opt -slp-vectorize < t.ll) produces such an output?
>
> The following code in the SLP vectorizer should have made sure that we
created an alignment of “4 bytes” given a data layout
(http://llvm.org/docs/LangRef.html#data-layout) that specifies f32:32:32.
>
>      case Instruction::Store: {
>        StoreInst *SI = cast<StoreInst>(VL0);
>        unsigned Alignment = SI->getAlignment();
>        ...
>        StoreInst *S = Builder.CreateStore(VecValue, VecPtr);
>        if (!Alignment)
>          Alignment =
DL->getABITypeAlignment(SI->getPointerOperand()->getType()); //
<< Get the 4byte alignment for the scalar float store from the data layout
string.
>        S->setAlignment(Alignment);

-------------- next part --------------
A non-text attachment was scrubbed...
Name: module_H7ktW0.ll.gz
Type: application/x-gzip
Size: 65781 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140807/a42fe088/attachment.bin>

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Aug 2014 - [LLVMdev] MCJIT generates MOVAPS on unaligned address

[LLVMdev] MCJIT generates MOVAPS on unaligned address

[LLVMdev] MCJIT generates MOVAPS on unaligned address

[LLVMdev] MCJIT generates MOVAPS on unaligned address

[LLVMdev] MCJIT generates MOVAPS on unaligned address

Apparently Analagous Threads