thr3ads.net - llvm dev - [LLVMdev] Loop Vectorization and Store-Load Forwarding issue [Jun 2015]

If this information is useful, please help other people find it:
Share via:

Das, Dibyendu

2015-Jun-12 05:11 UTC

[LLVMdev] Loop Vectorization and Store-Load Forwarding issue

I have been looking into this small test case (Part A) where loop vectorization
is disabled due to possible store-load forwarding conflict (Part B). As you can
see, due to the presence of dependence distance 2 the loop is vectorizable only
for a width of 2. However, the presence of dependence distance 15 (due to
y[j-15]) results in store-load forwarding issue as store packet of y[16:17]
(iteration j=16) partially overlaps with load packets of y[15:16] (iteration
j=30) and  y[17:18] (iteration j=32). As conflicts introduce additional delays
in the store->load forwarding pipes, this fact is modeled in the method
MemoryDepChecker::couldPreventStoreLoadForward() in LoopAccessAnalysis.cpp. The
function may turn off vectorization in the presence of such conflicts. Looking
through the code gives me the feeling that it may be more conservative than
desired. The reason being, if the dependence distance is high, the conflicting
store may flush out of the store pipe by the time the load is issued. And
vectorization may become beneficial.

I am seeing some performance improvements when I disable the method above. This
is for x86. Hence I am seeking some advice on how to improve the following
logic. Can we better model NumCyclesForStoreLoadThroughMemory ? This may be way
too high ? Or there are other ways to circumvent the basic problem ?

-TIA
Dibyendu

Part A:
  const unsigned NumCyclesForStoreLoadThroughMemory = 8*TypeByteSize;  // 512
for the test case shown
  // Maximum vector factor.
  unsigned MaxVFWithoutSLForwardIssues = VectorizerParams::MaxVectorWidth *
TypeByteSize;
  if(MaxSafeDepDistBytes < MaxVFWithoutSLForwardIssues)
    MaxVFWithoutSLForwardIssues = MaxSafeDepDistBytes;

  for (unsigned vf = 2*TypeByteSize; vf <= MaxVFWithoutSLForwardIssues; vf *=
2) {
    if (Distance % vf && Distance / vf <
NumCyclesForStoreLoadThroughMemory) {
      MaxVFWithoutSLForwardIssues = (vf >>=1);
      break;
    }
  }

  if (MaxVFWithoutSLForwardIssues< 2*TypeByteSize) {
    DEBUG(dbgs() << "LAA: Distance " << Distance <<
          " that could cause a store-load forwarding conflict\n");
    return true;
  }
----------------------------
Part B:
typedef unsigned long long uint64;

void foo(const unsigned char *m, unsigned int block, uint64 y[80])
{
    const unsigned char *sblock;
    int i, j;

    for (i = 0; i < (int) block; i++) {
        sblock = m + (i << 7);

        for (j = 16; j < 80; j++) {
           y[j] = y[j - 2] + y[j - 15] ;
        }
    }
}
Part C:
<snip> from the debug dump during the LoopAccessAnalysis phase:

LAA: Checking memory dependencies
LAA: Src Scev: {(8 + %y),+,8}<%for.body3>Sink Scev: {(128 +
%y),+,8}<nsw><%for.body3>(Induction step: 1)
LAA: Distance for   %3 = load i64, i64* %arrayidx6, align 8 to   store i64 %add,
i64* %arrayidx8, align 8: 120
LAA: Distance 120 that could cause a store-load forwarding conflict




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150612/490fd9fb/attachment.html>

Das, Dibyendu

2015-Jun-12 05:16 UTC

head link

[LLVMdev] Loop Vectorization and Store-Load Forwarding issue

Typo.  The first sentence should read as:
... test case (Part B) where loop vectorization is disabled due to possible
store-load forwarding conflict (Part C).

From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Das, Dibyendu
Sent: Friday, June 12, 2015 10:42 AM
To: llvmdev at cs.uiuc.edu
Subject: [LLVMdev] Loop Vectorization and Store-Load Forwarding issue

I have been looking into this small test case (Part A) where loop vectorization
is disabled due to possible store-load forwarding conflict (Part B). As you can
see, due to the presence of dependence distance 2 the loop is vectorizable only
for a width of 2. However, the presence of dependence distance 15 (due to
y[j-15]) results in store-load forwarding issue as store packet of y[16:17]
(iteration j=16) partially overlaps with load packets of y[15:16] (iteration
j=30) and  y[17:18] (iteration j=32). As conflicts introduce additional delays
in the store->load forwarding pipes, this fact is modeled in the method
MemoryDepChecker::couldPreventStoreLoadForward() in LoopAccessAnalysis.cpp. The
function may turn off vectorization in the presence of such conflicts. Looking
through the code gives me the feeling that it may be more conservative than
desired. The reason being, if the dependence distance is high, the conflicting
store may flush out of the store pipe by the time the load is issued. And
vectorization may become beneficial.

I am seeing some performance improvements when I disable the method above. This
is for x86. Hence I am seeking some advice on how to improve the following
logic. Can we better model NumCyclesForStoreLoadThroughMemory ? This may be way
too high ? Or there are other ways to circumvent the basic problem ?

-TIA
Dibyendu

Part A:
  const unsigned NumCyclesForStoreLoadThroughMemory = 8*TypeByteSize;  // 512
for the test case shown
  // Maximum vector factor.
  unsigned MaxVFWithoutSLForwardIssues = VectorizerParams::MaxVectorWidth *
TypeByteSize;
  if(MaxSafeDepDistBytes < MaxVFWithoutSLForwardIssues)
    MaxVFWithoutSLForwardIssues = MaxSafeDepDistBytes;

  for (unsigned vf = 2*TypeByteSize; vf <= MaxVFWithoutSLForwardIssues; vf *=
2) {
    if (Distance % vf && Distance / vf <
NumCyclesForStoreLoadThroughMemory) {
      MaxVFWithoutSLForwardIssues = (vf >>=1);
      break;
    }
  }

  if (MaxVFWithoutSLForwardIssues< 2*TypeByteSize) {
    DEBUG(dbgs() << "LAA: Distance " << Distance <<
          " that could cause a store-load forwarding conflict\n");
    return true;
  }
----------------------------
Part B:
typedef unsigned long long uint64;

void foo(const unsigned char *m, unsigned int block, uint64 y[80])
{
    const unsigned char *sblock;
    int i, j;

    for (i = 0; i < (int) block; i++) {
        sblock = m + (i << 7);

        for (j = 16; j < 80; j++) {
           y[j] = y[j - 2] + y[j - 15] ;
        }
    }
}
Part C:
<snip> from the debug dump during the LoopAccessAnalysis phase:

LAA: Checking memory dependencies
LAA: Src Scev: {(8 + %y),+,8}<%for.body3>Sink Scev: {(128 +
%y),+,8}<nsw><%for.body3>(Induction step: 1)
LAA: Distance for   %3 = load i64, i64* %arrayidx6, align 8 to   store i64 %add,
i64* %arrayidx8, align 8: 120
LAA: Distance 120 that could cause a store-load forwarding conflict

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150612/4eb06d18/attachment.html>

Gerolf Hoflehner

2015-Jun-12 22:06 UTC

head link

[LLVMdev] Loop Vectorization and Store-Load Forwarding issue

+Adam

I’m seeing more cases where the compiler makes guesses about the processor
rather than querying a machine model. Rather than a sophisticated model there
could be a basic/lightweight machine description file that can be queried when
it is available. In this specific example a formula like 'dependence
distance/ width > store2load_fwd_delay' would help conflict modeling.
Does that sound like a promising path forward?

Cheers
Gerolf



> On Jun 11, 2015, at 10:11 PM, Das, Dibyendu <Dibyendu.Das at amd.com>
wrote:
> 
> I have been looking into this small test case (Part A) where loop
vectorization is disabled due to possible store-load forwarding conflict (Part
B). As you can see, due to the presence of dependence distance 2 the loop is
vectorizable only for a width of 2. However, the presence of dependence distance
15 (due to y[j-15]) results in store-load forwarding issue as store packet of
y[16:17] (iteration j=16) partially overlaps with load packets of y[15:16]
(iteration j=30) and  y[17:18] (iteration j=32). As conflicts introduce
additional delays in the store->load forwarding pipes, this fact is modeled
in the method MemoryDepChecker::couldPreventStoreLoadForward() in
LoopAccessAnalysis.cpp. The function may turn off vectorization in the presence
of such conflicts. Looking through the code gives me the feeling that it may be
more conservative than desired. The reason being, if the dependence distance is
high, the conflicting store may flush out of the store pipe by the time the load
is issued. And vectorization may become beneficial.
>  
> I am seeing some performance improvements when I disable the method above.
This is for x86. Hence I am seeking some advice on how to improve the following
logic. Can we better model NumCyclesForStoreLoadThroughMemory ? This may be way
too high ? Or there are other ways to circumvent the basic problem ?
>  
> -TIA
> Dibyendu
>  
> Part A:
>   const unsigned NumCyclesForStoreLoadThroughMemory = 8*TypeByteSize;  //
512 for the test case shown
>   // Maximum vector factor.
>   unsigned MaxVFWithoutSLForwardIssues = VectorizerParams::MaxVectorWidth *
TypeByteSize;
>   if(MaxSafeDepDistBytes < MaxVFWithoutSLForwardIssues)
>     MaxVFWithoutSLForwardIssues = MaxSafeDepDistBytes;
>  
>   for (unsigned vf = 2*TypeByteSize; vf <= MaxVFWithoutSLForwardIssues;
vf *= 2) {
>     if (Distance % vf && Distance / vf <
NumCyclesForStoreLoadThroughMemory) {
>       MaxVFWithoutSLForwardIssues = (vf >>=1);
>       break;
>     }
>   }
>  
>   if (MaxVFWithoutSLForwardIssues< 2*TypeByteSize) {
>     DEBUG(dbgs() << "LAA: Distance " << Distance
<<
>           " that could cause a store-load forwarding
conflict\n");
>     return true;
>   }
> ----------------------------
> Part B:
> typedef unsigned long long uint64;
>  
> void foo(const unsigned char *m, unsigned int block, uint64 y[80])
> {
>     const unsigned char *sblock;
>     int i, j;
>  
>     for (i = 0; i < (int) block; i++) {
>         sblock = m + (i << 7);
>  
>         for (j = 16; j < 80; j++) {
>            y[j] = y[j - 2] + y[j - 15] ;
>         }
>     }
> }
> Part C:
> <snip> from the debug dump during the LoopAccessAnalysis phase:
>  
> LAA: Checking memory dependencies
> LAA: Src Scev: {(8 + %y),+,8}<%for.body3>Sink Scev: {(128 +
%y),+,8}<nsw><%for.body3>(Induction step: 1)
> LAA: Distance for   %3 = load i64, i64* %arrayidx6, align 8 to   store i64
%add, i64* %arrayidx8, align 8: 120
> LAA: Distance 120 that could cause a store-load forwarding conflict
>  
>  
>  
>  
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150612/e9d0a0a3/attachment.html>

Das, Dibyendu

2015-Jun-13 15:04 UTC

head link

[LLVMdev] Loop Vectorization and Store-Load Forwarding issue

Thx Gerolf. Let me investigate your suggestion.

From: Gerolf Hoflehner [mailto:ghoflehner at apple.com]
Sent: Saturday, June 13, 2015 3:37 AM
To: Das, Dibyendu; Adam Nemet
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Loop Vectorization and Store-Load Forwarding issue

+Adam

I’m seeing more cases where the compiler makes guesses about the processor
rather than querying a machine model. Rather than a sophisticated model there
could be a basic/lightweight machine description file that can be queried when
it is available. In this specific example a formula like 'dependence
distance/ width > store2load_fwd_delay' would help conflict modeling.
Does that sound like a promising path forward?

Cheers
Gerolf

On Jun 11, 2015, at 10:11 PM, Das, Dibyendu <Dibyendu.Das at
amd.com<mailto:Dibyendu.Das at amd.com>> wrote:

I have been looking into this small test case (Part A) where loop vectorization
is disabled due to possible store-load forwarding conflict (Part B). As you can
see, due to the presence of dependence distance 2 the loop is vectorizable only
for a width of 2. However, the presence of dependence distance 15 (due to
y[j-15]) results in store-load forwarding issue as store packet of y[16:17]
(iteration j=16) partially overlaps with load packets of y[15:16] (iteration
j=30) and  y[17:18] (iteration j=32). As conflicts introduce additional delays
in the store->load forwarding pipes, this fact is modeled in the method
MemoryDepChecker::couldPreventStoreLoadForward() in LoopAccessAnalysis.cpp. The
function may turn off vectorization in the presence of such conflicts. Looking
through the code gives me the feeling that it may be more conservative than
desired. The reason being, if the dependence distance is high, the conflicting
store may flush out of the store pipe by the time the load is issued. And
vectorization may become beneficial.

I am seeing some performance improvements when I disable the method above. This
is for x86. Hence I am seeking some advice on how to improve the following
logic. Can we better model NumCyclesForStoreLoadThroughMemory ? This may be way
too high ? Or there are other ways to circumvent the basic problem ?

-TIA
Dibyendu

Part A:
  const unsigned NumCyclesForStoreLoadThroughMemory = 8*TypeByteSize;  // 512
for the test case shown
  // Maximum vector factor.
  unsigned MaxVFWithoutSLForwardIssues = VectorizerParams::MaxVectorWidth *
TypeByteSize;
  if(MaxSafeDepDistBytes < MaxVFWithoutSLForwardIssues)
    MaxVFWithoutSLForwardIssues = MaxSafeDepDistBytes;

  for (unsigned vf = 2*TypeByteSize; vf <= MaxVFWithoutSLForwardIssues; vf *=
2) {
    if (Distance % vf && Distance / vf <
NumCyclesForStoreLoadThroughMemory) {
      MaxVFWithoutSLForwardIssues = (vf >>=1);
      break;
    }
  }

  if (MaxVFWithoutSLForwardIssues< 2*TypeByteSize) {
    DEBUG(dbgs() << "LAA: Distance " << Distance <<
          " that could cause a store-load forwarding conflict\n");
    return true;
  }
----------------------------
Part B:
typedef unsigned long long uint64;

void foo(const unsigned char *m, unsigned int block, uint64 y[80])
{
    const unsigned char *sblock;
    int i, j;

    for (i = 0; i < (int) block; i++) {
        sblock = m + (i << 7);

        for (j = 16; j < 80; j++) {
           y[j] = y[j - 2] + y[j - 15] ;
        }
    }
}
Part C:
<snip> from the debug dump during the LoopAccessAnalysis phase:

LAA: Checking memory dependencies
LAA: Src Scev: {(8 + %y),+,8}<%for.body3>Sink Scev: {(128 +
%y),+,8}<nsw><%for.body3>(Induction step: 1)
LAA: Distance for   %3 = load i64, i64* %arrayidx6, align 8 to   store i64 %add,
i64* %arrayidx8, align 8: 120
LAA: Distance 120 that could cause a store-load forwarding conflict

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu<http://llvm.cs.uiuc.edu/>
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150613/41b615ff/attachment.html>

Arnold

2015-Jun-13 19:25 UTC

head link

[LLVMdev] Loop Vectorization and Store-Load Forwarding issue

I think this should be a call on TargetTransformInfo (TTI) similar to the
instruction costs (TTI is a machine model). Targets can override this with the
right value.

If we add a(nother) machine model we also  have to implement the APIs to query
it. I don't think we would save complexity here and we add another model to
maintain.

If we want have this in a description fiIe I think we should express in terms of
the existing machine sched model. We could for example  express this in terms of
the existing load and store latencies and then have TTI query it through
targetloweringinfo. Though I am not convinced that is necessary.

Sent from my iPhone
> On Jun 12, 2015, at 3:06 PM, Gerolf Hoflehner <ghoflehner at
apple.com> wrote:
> 
> +Adam
> 
> I’m seeing more cases where the compiler makes guesses about the processor
rather than querying a machine model. Rather than a sophisticated model there
could be a basic/lightweight machine description file that can be queried when
it is available. In this specific example a formula like 'dependence
distance/ width > store2load_fwd_delay' would help conflict modeling.
Does that sound like a promising path forward?
> 
> Cheers
> Gerolf
> 
> 
> 
> 
>> On Jun 11, 2015, at 10:11 PM, Das, Dibyendu <Dibyendu.Das at
amd.com> wrote:
>> 
>> I have been looking into this small test case (Part A) where loop
vectorization is disabled due to possible store-load forwarding conflict (Part
B). As you can see, due to the presence of dependence distance 2 the loop is
vectorizable only for a width of 2. However, the presence of dependence distance
15 (due to y[j-15]) results in store-load forwarding issue as store packet of
y[16:17] (iteration j=16) partially overlaps with load packets of y[15:16]
(iteration j=30) and  y[17:18] (iteration j=32). As conflicts introduce
additional delays in the store->load forwarding pipes, this fact is modeled
in the method MemoryDepChecker::couldPreventStoreLoadForward() in
LoopAccessAnalysis.cpp. The function may turn off vectorization in the presence
of such conflicts. Looking through the code gives me the feeling that it may be
more conservative than desired. The reason being, if the dependence distance is
high, the conflicting store may flush out of the store pipe by the time the load
is issued. And vectorization may become beneficial.
>>  
>> I am seeing some performance improvements when I disable the method
above. This is for x86. Hence I am seeking some advice on how to improve the
following logic. Can we better model NumCyclesForStoreLoadThroughMemory ? This
may be way too high ? Or there are other ways to circumvent the basic problem ?
>>  
>> -TIA
>> Dibyendu
>>  
>> Part A:
>>   const unsigned NumCyclesForStoreLoadThroughMemory = 8*TypeByteSize; 
// 512 for the test case shown
>>   // Maximum vector factor.
>>   unsigned MaxVFWithoutSLForwardIssues =
VectorizerParams::MaxVectorWidth * TypeByteSize;
>>   if(MaxSafeDepDistBytes < MaxVFWithoutSLForwardIssues)
>>     MaxVFWithoutSLForwardIssues = MaxSafeDepDistBytes;
>>  
>>   for (unsigned vf = 2*TypeByteSize; vf <=
MaxVFWithoutSLForwardIssues; vf *= 2) {
>>     if (Distance % vf && Distance / vf <
NumCyclesForStoreLoadThroughMemory) {
>>       MaxVFWithoutSLForwardIssues = (vf >>=1);
>>       break;
>>     }
>>   }
>>  
>>   if (MaxVFWithoutSLForwardIssues< 2*TypeByteSize) {
>>     DEBUG(dbgs() << "LAA: Distance " << Distance
<<
>>           " that could cause a store-load forwarding
conflict\n");
>>     return true;
>>   }
>> ----------------------------
>> Part B:
>> typedef unsigned long long uint64;
>>  
>> void foo(const unsigned char *m, unsigned int block, uint64 y[80])
>> {
>>     const unsigned char *sblock;
>>     int i, j;
>>  
>>     for (i = 0; i < (int) block; i++) {
>>         sblock = m + (i << 7);
>>  
>>         for (j = 16; j < 80; j++) {
>>            y[j] = y[j - 2] + y[j - 15] ;
>>         }
>>     }
>> }
>> Part C:
>> <snip> from the debug dump during the LoopAccessAnalysis phase:
>>  
>> LAA: Checking memory dependencies
>> LAA: Src Scev: {(8 + %y),+,8}<%for.body3>Sink Scev: {(128 +
%y),+,8}<nsw><%for.body3>(Induction step: 1)
>> LAA: Distance for   %3 = load i64, i64* %arrayidx6, align 8 to   store
i64 %add, i64* %arrayidx8, align 8: 120
>> LAA: Distance 120 that could cause a store-load forwarding conflict
>>  
>>  
>>  
>>  
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150613/3ff1b6f2/attachment.html>

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Jun 2015 - [LLVMdev] Loop Vectorization and Store-Load Forwarding issue

[LLVMdev] Loop Vectorization and Store-Load Forwarding issue

[LLVMdev] Loop Vectorization and Store-Load Forwarding issue

[LLVMdev] Loop Vectorization and Store-Load Forwarding issue

[LLVMdev] Loop Vectorization and Store-Load Forwarding issue

[LLVMdev] Loop Vectorization and Store-Load Forwarding issue

Seemingly Similar Threads