Displaying 6 results from an estimated 6 matches for "vblendvp".
Did you mean:
vblendvps
2011 Jun 01
4
[LLVMdev] AVX Status?
...%a,
<8 x float> %b, i8 1) nounwind readnone
%res = tail call <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>
%a, <8 x float> %b, <8 x float> %cmp) nounwind readnone
ret <8 x float> %res
}
This works fine and produces the expected assembly (VCMPLTPS + VBLENDVPS).
On the other hand, this does not work:
define <8 x float> @test2(<8 x float> %a, <8 x float> %b, <8 x i32> %m)
nounwind readnone {
entry:
%cmp = tail call <8 x float> @llvm.x86.avx.cmp.ps.256(<8 x float> %a,
<8 x float> %b, i8 1) nounwind readnone...
2011 Jun 02
0
[LLVMdev] AVX Status?
...8 1) nounwind readnone
> %res = tail call <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>
> %a, <8 x float> %b, <8 x float> %cmp) nounwind readnone
> ret <8 x float> %res
> }
>
> This works fine and produces the expected assembly (VCMPLTPS + VBLENDVPS).
>
> On the other hand, this does not work:
>
> define <8 x float> @test2(<8 x float> %a, <8 x float> %b, <8 x i32> %m)
> nounwind readnone {
> entry:
> %cmp = tail call <8 x float> @llvm.x86.avx.cmp.ps.256(<8 x float> %a,
> <8 x f...
2011 Jun 03
1
[LLVMdev] AVX Status?
...;> %res = tail call<8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>
>> %a,<8 x float> %b,<8 x float> %cmp) nounwind readnone
>> ret<8 x float> %res
>> }
>>
>> This works fine and produces the expected assembly (VCMPLTPS + VBLENDVPS).
>>
>> On the other hand, this does not work:
>>
>> define<8 x float> @test2(<8 x float> %a,<8 x float> %b,<8 x i32> %m)
>> nounwind readnone {
>> entry:
>> %cmp = tail call<8 x float> @llvm.x86.avx.cmp.ps.256(<8...
2012 Jul 27
3
[LLVMdev] X86 FMA4
...o be entered allowing you to mix/match with normal SSE and take advantage of the lack of implicit arguments/nice non-destructive encoding if you choose to (in case you can't tell I like the non-destructive encoding a lot).
* This is important since SSE instructions with implicit operands (i.e. vblendvps) now have an explicit operand when instantiated as a 128 bit AVX instruction.
** NOTE The dirty state is not synonymous with the upper bits of all of the ymm registers being zero. See the Intel AVX optimization guide.
> As I am sure you are aware, we cannot use SSE (movaps) instructions in an...
2012 Jul 27
0
[LLVMdev] X86 FMA4
Hey Michael,
Thanks for the legwork!
It appears that the stats you listed are for movaps [SSE], not vmovaps
[AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256),
since they are both AVX instructions. Although, yes, I agree that this is
not clear from Agner's report. Please correct me if I am misunderstanding.
As I am sure you are aware, we cannot use SSE (movaps)
2012 Jul 27
2
[LLVMdev] X86 FMA4
Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory.
vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5.
vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1.
He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it is a