thr3ads.net - llvm dev - [llvm-dev] X86 TRUNCATE cost for AVX & AVX2 mode [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Nema, Ashutosh via llvm-dev

2016-Apr-12 09:48 UTC

[llvm-dev] X86 TRUNCATE cost for AVX & AVX2 mode

<Copied Cong>

Thanks Elena.

Mostly I was interested in why such a high cost 30 kept for TRUNCATE v16i32 to
v16i8 in SSE41.
Looking at the code it appears like TRUNCATE v16i32 to v16i8 in SSE41 is very
expensive
vs SSE2. I feel this number should be same/close to the cost mentioned for same
operation in SSE2ConversionTbl.

Below patch from Cong Hou reduce cost for same operation in SSE2 mode.
http://reviews.llvm.org/rL256194

Looks like as the part of same patch we should reduce cost for TRUNCATE v16i32
to v16i8 in SSE4.1 as well.

Regards,
Ashutosh

From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
Sent: Monday, April 11, 2016 9:05 PM
To: Nema, Ashutosh <Ashutosh.Nema at amd.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>; Zuckerman, Michael
<michael.zuckerman at intel.com>
Subject: RE: X86 TRUNCATE cost for AVX & AVX2 mode

Hi,

One day I worked hard and refactored the cost calculation for all X86 targets.
http://reviews.llvm.org/D15604
But this revision was not accepted.

I fixed conversions, but assume that truncation suffers from the same problem.
I used "SplitFactor" in order to process wide types.

I'll be happy if you'll try to reanimate this work or part of it,
because the huge numbers causes a non-optimal vectorization factor to be chosen.

-           Elena

From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
Sent: Monday, April 11, 2016 16:51
To: Demikhovsky, Elena <elena.demikhovsky at
intel.com<mailto:elena.demikhovsky at intel.com>>; Zuckerman, Michael
<michael.zuckerman at intel.com<mailto:michael.zuckerman at
intel.com>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: X86 TRUNCATE cost for AVX & AVX2 mode

Hi,

I was going through the X86TTIImpl::getCastInstrCost, and got a doubt on cost
calculation for TRUNCATE instruction in AVX mode.

In AVX2ConversionTbl & AVXConversionTbl table there is no cost defined for
TRUNCATE v16i32 to v16i8, as a fallback it goes to SSE41ConversionTbl table and
there
it finds cost as 30 for this operation. 30 cost for this operation looks very
high.

Wondering why such a high cost kept for this, any pointers to understand this
will be helpful.
In few cases this restricts better vectorization opportunities.

Other observations:
Cost for TRUNCATE v16i32 to v16i8 in SSE2ConversionTbl as 7.
Cost for TRUNCATE v8i32 to v8i8 is 2 in AVX2 and 4 in AVX mode.

Thanks,
Ashutosh



---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160412/25ce4114/attachment.html>

Demikhovsky, Elena via llvm-dev

2016-Apr-12 11:05 UTC

head link

[llvm-dev] X86 TRUNCATE cost for AVX & AVX2 mode

Where the problem is? In non-optimal code generated for TRUNCATE or in the cost
calculation in the conversion tables?
In the revision bellow Cong optimized the code and put the new numbers in the
cost model.

-           Elena

From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
Sent: Tuesday, April 12, 2016 12:48
To: Demikhovsky, Elena <elena.demikhovsky at intel.com>; congh at
google.com
Cc: llvm-dev <llvm-dev at lists.llvm.org>; Zuckerman, Michael
<michael.zuckerman at intel.com>
Subject: RE: X86 TRUNCATE cost for AVX & AVX2 mode

<Copied Cong>

Thanks Elena.

Mostly I was interested in why such a high cost 30 kept for TRUNCATE v16i32 to
v16i8 in SSE41.
Looking at the code it appears like TRUNCATE v16i32 to v16i8 in SSE41 is very
expensive
vs SSE2. I feel this number should be same/close to the cost mentioned for same
operation in SSE2ConversionTbl.

Below patch from Cong Hou reduce cost for same operation in SSE2 mode.
http://reviews.llvm.org/rL256194

Looks like as the part of same patch we should reduce cost for TRUNCATE v16i32
to v16i8 in SSE4.1 as well.

Regards,
Ashutosh

From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
Sent: Monday, April 11, 2016 9:05 PM
To: Nema, Ashutosh <Ashutosh.Nema at amd.com<mailto:Ashutosh.Nema at
amd.com>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>; Zuckerman, Michael <michael.zuckerman at
intel.com<mailto:michael.zuckerman at intel.com>>
Subject: RE: X86 TRUNCATE cost for AVX & AVX2 mode

Hi,

One day I worked hard and refactored the cost calculation for all X86 targets.
http://reviews.llvm.org/D15604
But this revision was not accepted.

I fixed conversions, but assume that truncation suffers from the same problem.
I used "SplitFactor" in order to process wide types.

I'll be happy if you'll try to reanimate this work or part of it,
because the huge numbers causes a non-optimal vectorization factor to be chosen.

-           Elena

From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
Sent: Monday, April 11, 2016 16:51
To: Demikhovsky, Elena <elena.demikhovsky at
intel.com<mailto:elena.demikhovsky at intel.com>>; Zuckerman, Michael
<michael.zuckerman at intel.com<mailto:michael.zuckerman at
intel.com>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: X86 TRUNCATE cost for AVX & AVX2 mode

Hi,

I was going through the X86TTIImpl::getCastInstrCost, and got a doubt on cost
calculation for TRUNCATE instruction in AVX mode.

In AVX2ConversionTbl & AVXConversionTbl table there is no cost defined for
TRUNCATE v16i32 to v16i8, as a fallback it goes to SSE41ConversionTbl table and
there
it finds cost as 30 for this operation. 30 cost for this operation looks very
high.

Wondering why such a high cost kept for this, any pointers to understand this
will be helpful.
In few cases this restricts better vectorization opportunities.

Other observations:
Cost for TRUNCATE v16i32 to v16i8 in SSE2ConversionTbl as 7.
Cost for TRUNCATE v8i32 to v8i8 is 2 in AVX2 and 4 in AVX mode.

Thanks,
Ashutosh

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160412/d62edd69/attachment.html>

Nema, Ashutosh via llvm-dev

2016-Apr-12 12:46 UTC

head link

[llvm-dev] X86 TRUNCATE cost for AVX & AVX2 mode

Problem is in the cost table for SSE4.1 where cost for TRUNCATE (v16i32 to
v16i8) is defined very high.
Because of this in few cases compiler finds VF16 costly and selects VF<16,
which results in less optimal code generation.

The patch I have mentioned it optimize TRUNCATE (v16i32 to v16i8) for SSE2 &
SSE4.1.
But it may not see this instruction as vectorizer might not generate this
because of high cost.

In that patch TRUNCATE (v16i32 to v16i8) cost for SSE2 got already changed.
But looks like we missed changing cost for SSE4.1.

Thanks,
Ashutosh

From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
Sent: Tuesday, April 12, 2016 4:35 PM
To: Nema, Ashutosh <Ashutosh.Nema at amd.com>; congh at google.com
Cc: llvm-dev <llvm-dev at lists.llvm.org>; Zuckerman, Michael
<michael.zuckerman at intel.com>
Subject: RE: X86 TRUNCATE cost for AVX & AVX2 mode

Where the problem is? In non-optimal code generated for TRUNCATE or in the cost
calculation in the conversion tables?
In the revision bellow Cong optimized the code and put the new numbers in the
cost model.

-           Elena

From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
Sent: Tuesday, April 12, 2016 12:48
To: Demikhovsky, Elena <elena.demikhovsky at
intel.com<mailto:elena.demikhovsky at intel.com>>; congh at
google.com<mailto:congh at google.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>; Zuckerman, Michael <michael.zuckerman at
intel.com<mailto:michael.zuckerman at intel.com>>
Subject: RE: X86 TRUNCATE cost for AVX & AVX2 mode

<Copied Cong>

Thanks Elena.

Mostly I was interested in why such a high cost 30 kept for TRUNCATE v16i32 to
v16i8 in SSE41.
Looking at the code it appears like TRUNCATE v16i32 to v16i8 in SSE41 is very
expensive
vs SSE2. I feel this number should be same/close to the cost mentioned for same
operation in SSE2ConversionTbl.

Below patch from Cong Hou reduce cost for same operation in SSE2 mode.
http://reviews.llvm.org/rL256194

Looks like as the part of same patch we should reduce cost for TRUNCATE v16i32
to v16i8 in SSE4.1 as well.

Regards,
Ashutosh

From: Demikhovsky, Elena [mailto:elena.demikhovsky at intel.com]
Sent: Monday, April 11, 2016 9:05 PM
To: Nema, Ashutosh <Ashutosh.Nema at amd.com<mailto:Ashutosh.Nema at
amd.com>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>; Zuckerman, Michael <michael.zuckerman at
intel.com<mailto:michael.zuckerman at intel.com>>
Subject: RE: X86 TRUNCATE cost for AVX & AVX2 mode

Hi,

One day I worked hard and refactored the cost calculation for all X86 targets.
http://reviews.llvm.org/D15604
But this revision was not accepted.

I fixed conversions, but assume that truncation suffers from the same problem.
I used "SplitFactor" in order to process wide types.

I'll be happy if you'll try to reanimate this work or part of it,
because the huge numbers causes a non-optimal vectorization factor to be chosen.

-           Elena

From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
Sent: Monday, April 11, 2016 16:51
To: Demikhovsky, Elena <elena.demikhovsky at
intel.com<mailto:elena.demikhovsky at intel.com>>; Zuckerman, Michael
<michael.zuckerman at intel.com<mailto:michael.zuckerman at
intel.com>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: X86 TRUNCATE cost for AVX & AVX2 mode

Hi,

I was going through the X86TTIImpl::getCastInstrCost, and got a doubt on cost
calculation for TRUNCATE instruction in AVX mode.

In AVX2ConversionTbl & AVXConversionTbl table there is no cost defined for
TRUNCATE v16i32 to v16i8, as a fallback it goes to SSE41ConversionTbl table and
there
it finds cost as 30 for this operation. 30 cost for this operation looks very
high.

Wondering why such a high cost kept for this, any pointers to understand this
will be helpful.
In few cases this restricts better vectorization opportunities.

Other observations:
Cost for TRUNCATE v16i32 to v16i8 in SSE2ConversionTbl as 7.
Cost for TRUNCATE v8i32 to v8i8 is 2 in AVX2 and 4 in AVX mode.

Thanks,
Ashutosh



---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160412/1a761eb8/attachment.html>

llvm dev - Apr 2016 - X86 TRUNCATE cost for AVX & AVX2 mode

[llvm-dev] X86 TRUNCATE cost for AVX & AVX2 mode

[llvm-dev] X86 TRUNCATE cost for AVX & AVX2 mode

[llvm-dev] X86 TRUNCATE cost for AVX & AVX2 mode