thr3ads.net - llvm dev - [llvm-dev] Lowering llvm.memset for ARM target [Sep 2017]

If this information is useful, please help other people find it:
Share via:

Bharathi Seshadri via llvm-dev

2017-Sep-05 19:23 UTC

[llvm-dev] Lowering llvm.memset for ARM target

As reported in an earlier thread
(http://clang-developers.42468.n3.nabble.com/Disable-memset-synthesis-tp4057810.html),
we noticed in some cases that the llvm.memset intrinsic, if lowered to
stores, could help with performance.

Here's a test case: If LIMIT is > 8, I see that a call to memset is
emitted for arm & aarch64, but not for x86 target.

typedef struct {
    int v0[100];
} test;
#define LIMIT 9
void init(test *t)
{
    int i;
    for (i = 0; i < LIMIT ; i++)
      t->v0[i] = 0;
}
int main() {
test t;
init(&t);
return 0;
}

Looking at the llvm sources, I see that there are two key target
specific variables, MaxStoresPerMemset and MaxStoresPerMemsetOptSize,
that determine if the intrinsic llvm.memset can be lowered into store
operations. For ARM, these variables are set to 8 and 4 respectively.

I do not know as to how the default values for these two variables are
arrived at, but doubling these values (similar to that for the x86
target) seems to help our case and we observe a 7% increase in
performance of our networking application. We use -O3 and -flto and
32-bit arm.

I can prepare a patch and post for review if such a change, say under
CodeGenOpt::Aggressive would be acceptable.

Thanks,
Bharathi

Evgeny Astigeevich via llvm-dev

2017-Sep-07 15:24 UTC

head link

[llvm-dev] Lowering llvm.memset for ARM target

Hi Bharathi,

MaxStoresPerMemset  was changed from 16 to 8 in r 169791. The commit comment:

"Some enhancements for memcpy / memset inline expansion.
1. Teach it to use overlapping unaligned load / store to copy / set the trailing
   bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte copies.
2. Use f64 for memcpy / memset on targets where i64 is not legal but f64 is.
e.g.
   x86 and ARM.
3. When memcpy from a constant string, do *not* replace the load with a constant
   if it's not possible to materialize an integer immediate with a single
   instruction (required a new target hook: TLI.isIntImmLegal()).
4. Use unaligned load / stores more aggressively if target hooks indicates they
   are "fast".
5. Update ARM target hooks to use unaligned load / stores. e.g. vld1.8 / vst1.8.
   Also increase the threshold to something reasonable (8 for memset, 4 pairs
   for memcpy).

This significantly improves Dhrystone, up to 50% on ARM iOS devices.

rdar://12760078"

It's strange. According to the comment the threshold was increased but it is
decreased. I think the code needs to be revisited and benchmarked.
I'll do some benchmarking.

Thanks,
Evgeny Astigeevich | Arm Compiler Optimization Team Lead

> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
> Bharathi Seshadri via llvm-dev
> Sent: Tuesday, September 05, 2017 8:24 PM
> To: llvm-dev at lists.llvm.org
> Subject: [llvm-dev] Lowering llvm.memset for ARM target
> 
> As reported in an earlier thread
> (http://clang-developers.42468.n3.nabble.com/Disable-memset-synthesis-
> tp4057810.html),
> we noticed in some cases that the llvm.memset intrinsic, if lowered to
stores,
> could help with performance.
> 
> Here's a test case: If LIMIT is > 8, I see that a call to memset is
emitted for arm
> & aarch64, but not for x86 target.
> 
> typedef struct {
>     int v0[100];
> } test;
> #define LIMIT 9
> void init(test *t)
> {
>     int i;
>     for (i = 0; i < LIMIT ; i++)
>       t->v0[i] = 0;
> }
> int main() {
> test t;
> init(&t);
> return 0;
> }
> 
> Looking at the llvm sources, I see that there are two key target specific
> variables, MaxStoresPerMemset and MaxStoresPerMemsetOptSize, that
> determine if the intrinsic llvm.memset can be lowered into store
operations.
> For ARM, these variables are set to 8 and 4 respectively.
> 
> I do not know as to how the default values for these two variables are
> arrived at, but doubling these values (similar to that for the x86
> target) seems to help our case and we observe a 7% increase in performance
> of our networking application. We use -O3 and -flto and 32-bit arm.
> 
> I can prepare a patch and post for review if such a change, say under
> CodeGenOpt::Aggressive would be acceptable.
> 
> Thanks,
> Bharathi
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Evgeny Astigeevich via llvm-dev

2017-Sep-08 15:22 UTC

head link

[llvm-dev] Lowering llvm.memset for ARM target

Hi Bharathi,

From the discussion you provided I found that the issue happens for a big-endian
ARM target.
For the little-endian target the intrinsic in your test case is lowered to store
instructions.
Some debugging is needed to figure out why it's not happening for
big-endian.

-Evgeny
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
> Evgeny Astigeevich via llvm-dev
> Sent: Thursday, September 07, 2017 4:25 PM
> To: Bharathi Seshadri
> Cc: llvm-dev; nd
> Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target
> 
> Hi Bharathi,
> 
> MaxStoresPerMemset  was changed from 16 to 8 in r 169791. The commit
> comment:
> 
> "Some enhancements for memcpy / memset inline expansion.
> 1. Teach it to use overlapping unaligned load / store to copy / set the
trailing
>    bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte
copies.
> 2. Use f64 for memcpy / memset on targets where i64 is not legal but f64
is.
> e.g.
>    x86 and ARM.
> 3. When memcpy from a constant string, do *not* replace the load with a
> constant
>    if it's not possible to materialize an integer immediate with a
single
>    instruction (required a new target hook: TLI.isIntImmLegal()).
> 4. Use unaligned load / stores more aggressively if target hooks indicates
> they
>    are "fast".
> 5. Update ARM target hooks to use unaligned load / stores. e.g. vld1.8 /
> vst1.8.
>    Also increase the threshold to something reasonable (8 for memset, 4
pairs
>    for memcpy).
> 
> This significantly improves Dhrystone, up to 50% on ARM iOS devices.
> 
> rdar://12760078"
> 
> It's strange. According to the comment the threshold was increased but
it is
> decreased. I think the code needs to be revisited and benchmarked.
> I'll do some benchmarking.
> 
> Thanks,
> Evgeny Astigeevich | Arm Compiler Optimization Team Lead
> 
> 
> > -----Original Message-----
> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf
Of
> > Bharathi Seshadri via llvm-dev
> > Sent: Tuesday, September 05, 2017 8:24 PM
> > To: llvm-dev at lists.llvm.org
> > Subject: [llvm-dev] Lowering llvm.memset for ARM target
> >
> > As reported in an earlier thread
> > (http://clang-developers.42468.n3.nabble.com/Disable-memset-synthesis-
> > tp4057810.html),
> > we noticed in some cases that the llvm.memset intrinsic, if lowered to
> > stores, could help with performance.
> >
> > Here's a test case: If LIMIT is > 8, I see that a call to
memset is
> > emitted for arm & aarch64, but not for x86 target.
> >
> > typedef struct {
> >     int v0[100];
> > } test;
> > #define LIMIT 9
> > void init(test *t)
> > {
> >     int i;
> >     for (i = 0; i < LIMIT ; i++)
> >       t->v0[i] = 0;
> > }
> > int main() {
> > test t;
> > init(&t);
> > return 0;
> > }
> >
> > Looking at the llvm sources, I see that there are two key target
> > specific variables, MaxStoresPerMemset and
> MaxStoresPerMemsetOptSize,
> > that determine if the intrinsic llvm.memset can be lowered into store
> operations.
> > For ARM, these variables are set to 8 and 4 respectively.
> >
> > I do not know as to how the default values for these two variables are
> > arrived at, but doubling these values (similar to that for the x86
> > target) seems to help our case and we observe a 7% increase in
> > performance of our networking application. We use -O3 and -flto and
32-bit
> arm.
> >
> > I can prepare a patch and post for review if such a change, say under
> > CodeGenOpt::Aggressive would be acceptable.
> >
> > Thanks,
> > Bharathi
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Evgeny Astigeevich via llvm-dev

2017-Sep-11 13:27 UTC

head link

[llvm-dev] Lowering llvm.memset for ARM target

Hi Bharathi,

'-mfpu=vfp ' is  the root cause of the problem. It means:  VFPv2,
disabled Advanced SIMD extension.
The intrinsic is lowered into stores only when Advanced SIMD extension is
enabled. So, if your target supports the Advanced SIMD extension, the workaround
is '-mfpu=neon'.
I'll check what is happening when the Advanced SIMD extension is disabled.

Thanks,
Evgeny
> -----Original Message-----
> From: Bharathi Seshadri [mailto:bharathi.seshadri at gmail.com]
> Sent: Friday, September 08, 2017 9:39 PM
> To: Evgeny Astigeevich
> Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target
> 
> Hi Evgeny,
> 
> Even for a litte-endian ARM target,  I don't see that the intrinsic is
lowered
> into stores. I checked with llvm38, llvm40 and a somewhat recent trunk
> (about a month old). I'm not sure what I'm missing.
> 
> For my test case compiled using  -O3 -c --target=arm-linux-gnueabi -
> march=armv8-a+crc -mfloat-abi=hard -no-integrated-as -mfpu=vfp, I get
> 
> bash-4.1$ cat trymem2.c
> typedef struct {
>     int v0[100];
> } test;
> #define LIMIT 9
> void init(test *t)
> {
>     int i;
>     for (i = 0; i < LIMIT ; i++)
>       t->v0[i] = 0;
> }
> int main() {
> test t;
> init(&t);
> return 0;
> }
> 
> 
> $objdump -d
> 00000000 <init>:
>    0:   e92d4800        push    {fp, lr}
>    4:   e1a0b00d        mov     fp, sp
>    8:   e3a01000        mov     r1, #0
>    c:   e3a02024        mov     r2, #36 ; 0x24
>   10:   ebfffffe        bl      0 <memset> <====== Call to memset
>   14:   e8bd8800        pop     {fp, pc}
> 00000018 <main>:
>   18:   e3a00000        mov     r0, #0
>   1c:   e12fff1e        bx      lr
> 
> With my patched clang to modify the MaxMemsetPerStores for ARM to 16, I
> get Disassembly of section .text:
> 00000000 <init>:
>    0:   e3a01000        mov     r1, #0
>    4:   e5801020        str     r1, [r0, #32]
>    8:   e5801004        str     r1, [r0, #4]
>    c:   e5801008        str     r1, [r0, #8]
>   10:   e580100c        str     r1, [r0, #12]
>   14:   e5801010        str     r1, [r0, #16]
>   18:   e5801014        str     r1, [r0, #20]
>   1c:   e5801018        str     r1, [r0, #24]
>   20:   e580101c        str     r1, [r0, #28]
>   24:   e5801000        str     r1, [r0]
>   28:   e12fff1e        bx      lr
> 0000002c <main>:
>   2c:   e3a00000        mov     r0, #0
>   30:   e12fff1e        bx      lr
> 
> 
> Thanks,
> 
> Bharathi
> 
> On Fri, Sep 8, 2017 at 8:22 AM, Evgeny Astigeevich
> <Evgeny.Astigeevich at arm.com> wrote:
> > Hi Bharathi,
> >
> > From the discussion you provided I found that the issue happens for a
big-
> endian ARM target.
> > For the little-endian target the intrinsic in your test case is
lowered to store
> instructions.
> > Some debugging is needed to figure out why it's not happening for
big-
> endian.
> >
> > -Evgeny
> >
> >> -----Original Message-----
> >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On
Behalf Of
> >> Evgeny Astigeevich via llvm-dev
> >> Sent: Thursday, September 07, 2017 4:25 PM
> >> To: Bharathi Seshadri
> >> Cc: llvm-dev; nd
> >> Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target
> >>
> >> Hi Bharathi,
> >>
> >> MaxStoresPerMemset  was changed from 16 to 8 in r 169791. The
commit
> >> comment:
> >>
> >> "Some enhancements for memcpy / memset inline expansion.
> >> 1. Teach it to use overlapping unaligned load / store to copy /
set the
> trailing
> >>    bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31
byte
> copies.
> >> 2. Use f64 for memcpy / memset on targets where i64 is not legal
but f64
> is.
> >> e.g.
> >>    x86 and ARM.
> >> 3. When memcpy from a constant string, do *not* replace the load
with
> >> a constant
> >>    if it's not possible to materialize an integer immediate
with a single
> >>    instruction (required a new target hook: TLI.isIntImmLegal()).
> >> 4. Use unaligned load / stores more aggressively if target hooks
> >> indicates they
> >>    are "fast".
> >> 5. Update ARM target hooks to use unaligned load / stores. e.g.
> >> vld1.8 / vst1.8.
> >>    Also increase the threshold to something reasonable (8 for
memset, 4
> pairs
> >>    for memcpy).
> >>
> >> This significantly improves Dhrystone, up to 50% on ARM iOS
devices.
> >>
> >> rdar://12760078"
> >>
> >> It's strange. According to the comment the threshold was
increased
> >> but it is decreased. I think the code needs to be revisited and
> benchmarked.
> >> I'll do some benchmarking.
> >>
> >> Thanks,
> >> Evgeny Astigeevich | Arm Compiler Optimization Team Lead
> >>
> >>
> >> > -----Original Message-----
> >> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On
Behalf
> >> > Of Bharathi Seshadri via llvm-dev
> >> > Sent: Tuesday, September 05, 2017 8:24 PM
> >> > To: llvm-dev at lists.llvm.org
> >> > Subject: [llvm-dev] Lowering llvm.memset for ARM target
> >> >
> >> > As reported in an earlier thread
> >> > (http://clang-developers.42468.n3.nabble.com/Disable-memset-
> synthes
> >> > is-
> >> > tp4057810.html),
> >> > we noticed in some cases that the llvm.memset intrinsic, if
lowered
> >> > to stores, could help with performance.
> >> >
> >> > Here's a test case: If LIMIT is > 8, I see that a call
to memset is
> >> > emitted for arm & aarch64, but not for x86 target.
> >> >
> >> > typedef struct {
> >> >     int v0[100];
> >> > } test;
> >> > #define LIMIT 9
> >> > void init(test *t)
> >> > {
> >> >     int i;
> >> >     for (i = 0; i < LIMIT ; i++)
> >> >       t->v0[i] = 0;
> >> > }
> >> > int main() {
> >> > test t;
> >> > init(&t);
> >> > return 0;
> >> > }
> >> >
> >> > Looking at the llvm sources, I see that there are two key
target
> >> > specific variables, MaxStoresPerMemset and
> >> MaxStoresPerMemsetOptSize,
> >> > that determine if the intrinsic llvm.memset can be lowered
into
> >> > store
> >> operations.
> >> > For ARM, these variables are set to 8 and 4 respectively.
> >> >
> >> > I do not know as to how the default values for these two
variables
> >> > are arrived at, but doubling these values (similar to that
for the
> >> > x86
> >> > target) seems to help our case and we observe a 7% increase
in
> >> > performance of our networking application. We use -O3 and
-flto and
> >> > 32-bit
> >> arm.
> >> >
> >> > I can prepare a patch and post for review if such a change,
say
> >> > under CodeGenOpt::Aggressive would be acceptable.
> >> >
> >> > Thanks,
> >> > Bharathi
> >> > _______________________________________________
> >> > LLVM Developers mailing list
> >> > llvm-dev at lists.llvm.org
> >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Evgeny Astigeevich via llvm-dev

2017-Oct-25 15:22 UTC

head link

[llvm-dev] Lowering llvm.memset for ARM target

Hi Bharathi,

I did some debugging, the current problem is that the same threshold values are
used for SIMD and non-SIMD memory instructions.
For the test you provided, when the SIMD extension is disabled the current
implementation of llvm.memset lowering finds out 9 store instructions will be
required. As 9 > 8 llvm.memset is lowered to a call.
When the SIMD is enabled, it finds out 3 stores (two vst1.32 + one str) will be
enough. As 3 < 8 llvm.memset is lowered to a sequence of stores.

So before changing the threshold values we need to figure out:

1. Do we need separate thresholds for SIMD and non-SIMD memory instructions? For
example, 8 for SIMD and 10-16 for non-SIMD. Some benchmarking is needed to find
proper value.
2. If we keep the single value, how much it should be increased? This might
affect performance of SIMD using applications. So benchmarking again.
3. If STRD are used then only 5 instructions are needed and llvm.memset is
lowered as expected. I don’t know why the variant with STRD is not considered,
maybe to avoid register pressure.

Hope this helps.

Thanks,
Evgeny Astigeevich 

-----Original Message-----
From: Evgeny Astigeevich <Evgeny.Astigeevich at arm.com>
Date: Monday, 11 September 2017 at 14:27
To: Bharathi Seshadri <bharathi.seshadri at gmail.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>, nd <nd at arm.com>
Subject: RE: [llvm-dev] Lowering llvm.memset for ARM target

    Hi Bharathi,
    
    '-mfpu=vfp ' is  the root cause of the problem. It means:  VFPv2,
disabled Advanced SIMD extension.
    The intrinsic is lowered into stores only when Advanced SIMD extension is
enabled. So, if your target supports the Advanced SIMD extension, the workaround
is '-mfpu=neon'.
    I'll check what is happening when the Advanced SIMD extension is
disabled.
    
    Thanks,
    Evgeny
    
    > -----Original Message-----
    > From: Bharathi Seshadri [mailto:bharathi.seshadri at gmail.com]
    > Sent: Friday, September 08, 2017 9:39 PM
    > To: Evgeny Astigeevich
    > Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target
    > 
    > Hi Evgeny,
    > 
    > Even for a litte-endian ARM target,  I don't see that the intrinsic
is lowered
    > into stores. I checked with llvm38, llvm40 and a somewhat recent trunk
    > (about a month old). I'm not sure what I'm missing.
    > 
    > For my test case compiled using  -O3 -c --target=arm-linux-gnueabi -
    > march=armv8-a+crc -mfloat-abi=hard -no-integrated-as -mfpu=vfp, I get
    > 
    > bash-4.1$ cat trymem2.c
    > typedef struct {
    >     int v0[100];
    > } test;
    > #define LIMIT 9
    > void init(test *t)
    > {
    >     int i;
    >     for (i = 0; i < LIMIT ; i++)
    >       t->v0[i] = 0;
    > }
    > int main() {
    > test t;
    > init(&t);
    > return 0;
    > }
    > 
    > 
    > $objdump -d
    > 00000000 <init>:
    >    0:   e92d4800        push    {fp, lr}
    >    4:   e1a0b00d        mov     fp, sp
    >    8:   e3a01000        mov     r1, #0
    >    c:   e3a02024        mov     r2, #36 ; 0x24
    >   10:   ebfffffe        bl      0 <memset> <====== Call to
memset
    >   14:   e8bd8800        pop     {fp, pc}
    > 00000018 <main>:
    >   18:   e3a00000        mov     r0, #0
    >   1c:   e12fff1e        bx      lr
    > 
    > With my patched clang to modify the MaxMemsetPerStores for ARM to 16, I
    > get Disassembly of section .text:
    > 00000000 <init>:
    >    0:   e3a01000        mov     r1, #0
    >    4:   e5801020        str     r1, [r0, #32]
    >    8:   e5801004        str     r1, [r0, #4]
    >    c:   e5801008        str     r1, [r0, #8]
    >   10:   e580100c        str     r1, [r0, #12]
    >   14:   e5801010        str     r1, [r0, #16]
    >   18:   e5801014        str     r1, [r0, #20]
    >   1c:   e5801018        str     r1, [r0, #24]
    >   20:   e580101c        str     r1, [r0, #28]
    >   24:   e5801000        str     r1, [r0]
    >   28:   e12fff1e        bx      lr
    > 0000002c <main>:
    >   2c:   e3a00000        mov     r0, #0
    >   30:   e12fff1e        bx      lr
    > 
    > 
    > Thanks,
    > 
    > Bharathi
    > 
    > On Fri, Sep 8, 2017 at 8:22 AM, Evgeny Astigeevich
    > <Evgeny.Astigeevich at arm.com> wrote:
    > > Hi Bharathi,
    > >
    > > From the discussion you provided I found that the issue happens
for a big-
    > endian ARM target.
    > > For the little-endian target the intrinsic in your test case is
lowered to store
    > instructions.
    > > Some debugging is needed to figure out why it's not happening
for big-
    > endian.
    > >
    > > -Evgeny
    > >
    > >> -----Original Message-----
    > >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On
Behalf Of
    > >> Evgeny Astigeevich via llvm-dev
    > >> Sent: Thursday, September 07, 2017 4:25 PM
    > >> To: Bharathi Seshadri
    > >> Cc: llvm-dev; nd
    > >> Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target
    > >>
    > >> Hi Bharathi,
    > >>
    > >> MaxStoresPerMemset  was changed from 16 to 8 in r 169791. The
commit
    > >> comment:
    > >>
    > >> "Some enhancements for memcpy / memset inline expansion.
    > >> 1. Teach it to use overlapping unaligned load / store to copy
/ set the
    > trailing
    > >>    bytes. e.g. On 86, use two pairs of movups / movaps for 17
- 31 byte
    > copies.
    > >> 2. Use f64 for memcpy / memset on targets where i64 is not
legal but f64
    > is.
    > >> e.g.
    > >>    x86 and ARM.
    > >> 3. When memcpy from a constant string, do *not* replace the
load with
    > >> a constant
    > >>    if it's not possible to materialize an integer
immediate with a single
    > >>    instruction (required a new target hook:
TLI.isIntImmLegal()).
    > >> 4. Use unaligned load / stores more aggressively if target
hooks
    > >> indicates they
    > >>    are "fast".
    > >> 5. Update ARM target hooks to use unaligned load / stores.
e.g.
    > >> vld1.8 / vst1.8.
    > >>    Also increase the threshold to something reasonable (8 for
memset, 4
    > pairs
    > >>    for memcpy).
    > >>
    > >> This significantly improves Dhrystone, up to 50% on ARM iOS
devices.
    > >>
    > >> rdar://12760078"
    > >>
    > >> It's strange. According to the comment the threshold was
increased
    > >> but it is decreased. I think the code needs to be revisited
and
    > benchmarked.
    > >> I'll do some benchmarking.
    > >>
    > >> Thanks,
    > >> Evgeny Astigeevich | Arm Compiler Optimization Team Lead
    > >>
    > >>
    > >> > -----Original Message-----
    > >> > From: llvm-dev [mailto:llvm-dev-bounces at
lists.llvm.org] On Behalf
    > >> > Of Bharathi Seshadri via llvm-dev
    > >> > Sent: Tuesday, September 05, 2017 8:24 PM
    > >> > To: llvm-dev at lists.llvm.org
    > >> > Subject: [llvm-dev] Lowering llvm.memset for ARM target
    > >> >
    > >> > As reported in an earlier thread
    > >> >
(http://clang-developers.42468.n3.nabble.com/Disable-memset-
    > synthes
    > >> > is-
    > >> > tp4057810.html),
    > >> > we noticed in some cases that the llvm.memset intrinsic,
if lowered
    > >> > to stores, could help with performance.
    > >> >
    > >> > Here's a test case: If LIMIT is > 8, I see that a
call to memset is
    > >> > emitted for arm & aarch64, but not for x86 target.
    > >> >
    > >> > typedef struct {
    > >> >     int v0[100];
    > >> > } test;
    > >> > #define LIMIT 9
    > >> > void init(test *t)
    > >> > {
    > >> >     int i;
    > >> >     for (i = 0; i < LIMIT ; i++)
    > >> >       t->v0[i] = 0;
    > >> > }
    > >> > int main() {
    > >> > test t;
    > >> > init(&t);
    > >> > return 0;
    > >> > }
    > >> >
    > >> > Looking at the llvm sources, I see that there are two key
target
    > >> > specific variables, MaxStoresPerMemset and
    > >> MaxStoresPerMemsetOptSize,
    > >> > that determine if the intrinsic llvm.memset can be
lowered into
    > >> > store
    > >> operations.
    > >> > For ARM, these variables are set to 8 and 4 respectively.
    > >> >
    > >> > I do not know as to how the default values for these two
variables
    > >> > are arrived at, but doubling these values (similar to
that for the
    > >> > x86
    > >> > target) seems to help our case and we observe a 7%
increase in
    > >> > performance of our networking application. We use -O3 and
-flto and
    > >> > 32-bit
    > >> arm.
    > >> >
    > >> > I can prepare a patch and post for review if such a
change, say
    > >> > under CodeGenOpt::Aggressive would be acceptable.
    > >> >
    > >> > Thanks,
    > >> > Bharathi
    > >> > _______________________________________________
    > >> > LLVM Developers mailing list
    > >> > llvm-dev at lists.llvm.org
    > >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
    > >> _______________________________________________
    > >> LLVM Developers mailing list
    > >> llvm-dev at lists.llvm.org
    > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Sep 2017 - Lowering llvm.memset for ARM target

[llvm-dev] Lowering llvm.memset for ARM target

[llvm-dev] Lowering llvm.memset for ARM target

[llvm-dev] Lowering llvm.memset for ARM target

[llvm-dev] Lowering llvm.memset for ARM target

[llvm-dev] Lowering llvm.memset for ARM target

Possibly Parallel Threads