Bharathi Seshadri via llvm-dev
2017-Sep-05 19:23 UTC
[llvm-dev] Lowering llvm.memset for ARM target
As reported in an earlier thread (http://clang-developers.42468.n3.nabble.com/Disable-memset-synthesis-tp4057810.html), we noticed in some cases that the llvm.memset intrinsic, if lowered to stores, could help with performance. Here's a test case: If LIMIT is > 8, I see that a call to memset is emitted for arm & aarch64, but not for x86 target. typedef struct { int v0[100]; } test; #define LIMIT 9 void init(test *t) { int i; for (i = 0; i < LIMIT ; i++) t->v0[i] = 0; } int main() { test t; init(&t); return 0; } Looking at the llvm sources, I see that there are two key target specific variables, MaxStoresPerMemset and MaxStoresPerMemsetOptSize, that determine if the intrinsic llvm.memset can be lowered into store operations. For ARM, these variables are set to 8 and 4 respectively. I do not know as to how the default values for these two variables are arrived at, but doubling these values (similar to that for the x86 target) seems to help our case and we observe a 7% increase in performance of our networking application. We use -O3 and -flto and 32-bit arm. I can prepare a patch and post for review if such a change, say under CodeGenOpt::Aggressive would be acceptable. Thanks, Bharathi
Evgeny Astigeevich via llvm-dev
2017-Sep-07 15:24 UTC
[llvm-dev] Lowering llvm.memset for ARM target
Hi Bharathi, MaxStoresPerMemset was changed from 16 to 8 in r 169791. The commit comment: "Some enhancements for memcpy / memset inline expansion. 1. Teach it to use overlapping unaligned load / store to copy / set the trailing bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte copies. 2. Use f64 for memcpy / memset on targets where i64 is not legal but f64 is. e.g. x86 and ARM. 3. When memcpy from a constant string, do *not* replace the load with a constant if it's not possible to materialize an integer immediate with a single instruction (required a new target hook: TLI.isIntImmLegal()). 4. Use unaligned load / stores more aggressively if target hooks indicates they are "fast". 5. Update ARM target hooks to use unaligned load / stores. e.g. vld1.8 / vst1.8. Also increase the threshold to something reasonable (8 for memset, 4 pairs for memcpy). This significantly improves Dhrystone, up to 50% on ARM iOS devices. rdar://12760078" It's strange. According to the comment the threshold was increased but it is decreased. I think the code needs to be revisited and benchmarked. I'll do some benchmarking. Thanks, Evgeny Astigeevich | Arm Compiler Optimization Team Lead> -----Original Message----- > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of > Bharathi Seshadri via llvm-dev > Sent: Tuesday, September 05, 2017 8:24 PM > To: llvm-dev at lists.llvm.org > Subject: [llvm-dev] Lowering llvm.memset for ARM target > > As reported in an earlier thread > (http://clang-developers.42468.n3.nabble.com/Disable-memset-synthesis- > tp4057810.html), > we noticed in some cases that the llvm.memset intrinsic, if lowered to stores, > could help with performance. > > Here's a test case: If LIMIT is > 8, I see that a call to memset is emitted for arm > & aarch64, but not for x86 target. > > typedef struct { > int v0[100]; > } test; > #define LIMIT 9 > void init(test *t) > { > int i; > for (i = 0; i < LIMIT ; i++) > t->v0[i] = 0; > } > int main() { > test t; > init(&t); > return 0; > } > > Looking at the llvm sources, I see that there are two key target specific > variables, MaxStoresPerMemset and MaxStoresPerMemsetOptSize, that > determine if the intrinsic llvm.memset can be lowered into store operations. > For ARM, these variables are set to 8 and 4 respectively. > > I do not know as to how the default values for these two variables are > arrived at, but doubling these values (similar to that for the x86 > target) seems to help our case and we observe a 7% increase in performance > of our networking application. We use -O3 and -flto and 32-bit arm. > > I can prepare a patch and post for review if such a change, say under > CodeGenOpt::Aggressive would be acceptable. > > Thanks, > Bharathi > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Evgeny Astigeevich via llvm-dev
2017-Sep-08 15:22 UTC
[llvm-dev] Lowering llvm.memset for ARM target
Hi Bharathi, From the discussion you provided I found that the issue happens for a big-endian ARM target. For the little-endian target the intrinsic in your test case is lowered to store instructions. Some debugging is needed to figure out why it's not happening for big-endian. -Evgeny> -----Original Message----- > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of > Evgeny Astigeevich via llvm-dev > Sent: Thursday, September 07, 2017 4:25 PM > To: Bharathi Seshadri > Cc: llvm-dev; nd > Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target > > Hi Bharathi, > > MaxStoresPerMemset was changed from 16 to 8 in r 169791. The commit > comment: > > "Some enhancements for memcpy / memset inline expansion. > 1. Teach it to use overlapping unaligned load / store to copy / set the trailing > bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte copies. > 2. Use f64 for memcpy / memset on targets where i64 is not legal but f64 is. > e.g. > x86 and ARM. > 3. When memcpy from a constant string, do *not* replace the load with a > constant > if it's not possible to materialize an integer immediate with a single > instruction (required a new target hook: TLI.isIntImmLegal()). > 4. Use unaligned load / stores more aggressively if target hooks indicates > they > are "fast". > 5. Update ARM target hooks to use unaligned load / stores. e.g. vld1.8 / > vst1.8. > Also increase the threshold to something reasonable (8 for memset, 4 pairs > for memcpy). > > This significantly improves Dhrystone, up to 50% on ARM iOS devices. > > rdar://12760078" > > It's strange. According to the comment the threshold was increased but it is > decreased. I think the code needs to be revisited and benchmarked. > I'll do some benchmarking. > > Thanks, > Evgeny Astigeevich | Arm Compiler Optimization Team Lead > > > > -----Original Message----- > > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of > > Bharathi Seshadri via llvm-dev > > Sent: Tuesday, September 05, 2017 8:24 PM > > To: llvm-dev at lists.llvm.org > > Subject: [llvm-dev] Lowering llvm.memset for ARM target > > > > As reported in an earlier thread > > (http://clang-developers.42468.n3.nabble.com/Disable-memset-synthesis- > > tp4057810.html), > > we noticed in some cases that the llvm.memset intrinsic, if lowered to > > stores, could help with performance. > > > > Here's a test case: If LIMIT is > 8, I see that a call to memset is > > emitted for arm & aarch64, but not for x86 target. > > > > typedef struct { > > int v0[100]; > > } test; > > #define LIMIT 9 > > void init(test *t) > > { > > int i; > > for (i = 0; i < LIMIT ; i++) > > t->v0[i] = 0; > > } > > int main() { > > test t; > > init(&t); > > return 0; > > } > > > > Looking at the llvm sources, I see that there are two key target > > specific variables, MaxStoresPerMemset and > MaxStoresPerMemsetOptSize, > > that determine if the intrinsic llvm.memset can be lowered into store > operations. > > For ARM, these variables are set to 8 and 4 respectively. > > > > I do not know as to how the default values for these two variables are > > arrived at, but doubling these values (similar to that for the x86 > > target) seems to help our case and we observe a 7% increase in > > performance of our networking application. We use -O3 and -flto and 32-bit > arm. > > > > I can prepare a patch and post for review if such a change, say under > > CodeGenOpt::Aggressive would be acceptable. > > > > Thanks, > > Bharathi > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Evgeny Astigeevich via llvm-dev
2017-Sep-11 13:27 UTC
[llvm-dev] Lowering llvm.memset for ARM target
Hi Bharathi, '-mfpu=vfp ' is the root cause of the problem. It means: VFPv2, disabled Advanced SIMD extension. The intrinsic is lowered into stores only when Advanced SIMD extension is enabled. So, if your target supports the Advanced SIMD extension, the workaround is '-mfpu=neon'. I'll check what is happening when the Advanced SIMD extension is disabled. Thanks, Evgeny> -----Original Message----- > From: Bharathi Seshadri [mailto:bharathi.seshadri at gmail.com] > Sent: Friday, September 08, 2017 9:39 PM > To: Evgeny Astigeevich > Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target > > Hi Evgeny, > > Even for a litte-endian ARM target, I don't see that the intrinsic is lowered > into stores. I checked with llvm38, llvm40 and a somewhat recent trunk > (about a month old). I'm not sure what I'm missing. > > For my test case compiled using -O3 -c --target=arm-linux-gnueabi - > march=armv8-a+crc -mfloat-abi=hard -no-integrated-as -mfpu=vfp, I get > > bash-4.1$ cat trymem2.c > typedef struct { > int v0[100]; > } test; > #define LIMIT 9 > void init(test *t) > { > int i; > for (i = 0; i < LIMIT ; i++) > t->v0[i] = 0; > } > int main() { > test t; > init(&t); > return 0; > } > > > $objdump -d > 00000000 <init>: > 0: e92d4800 push {fp, lr} > 4: e1a0b00d mov fp, sp > 8: e3a01000 mov r1, #0 > c: e3a02024 mov r2, #36 ; 0x24 > 10: ebfffffe bl 0 <memset> <====== Call to memset > 14: e8bd8800 pop {fp, pc} > 00000018 <main>: > 18: e3a00000 mov r0, #0 > 1c: e12fff1e bx lr > > With my patched clang to modify the MaxMemsetPerStores for ARM to 16, I > get Disassembly of section .text: > 00000000 <init>: > 0: e3a01000 mov r1, #0 > 4: e5801020 str r1, [r0, #32] > 8: e5801004 str r1, [r0, #4] > c: e5801008 str r1, [r0, #8] > 10: e580100c str r1, [r0, #12] > 14: e5801010 str r1, [r0, #16] > 18: e5801014 str r1, [r0, #20] > 1c: e5801018 str r1, [r0, #24] > 20: e580101c str r1, [r0, #28] > 24: e5801000 str r1, [r0] > 28: e12fff1e bx lr > 0000002c <main>: > 2c: e3a00000 mov r0, #0 > 30: e12fff1e bx lr > > > Thanks, > > Bharathi > > On Fri, Sep 8, 2017 at 8:22 AM, Evgeny Astigeevich > <Evgeny.Astigeevich at arm.com> wrote: > > Hi Bharathi, > > > > From the discussion you provided I found that the issue happens for a big- > endian ARM target. > > For the little-endian target the intrinsic in your test case is lowered to store > instructions. > > Some debugging is needed to figure out why it's not happening for big- > endian. > > > > -Evgeny > > > >> -----Original Message----- > >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of > >> Evgeny Astigeevich via llvm-dev > >> Sent: Thursday, September 07, 2017 4:25 PM > >> To: Bharathi Seshadri > >> Cc: llvm-dev; nd > >> Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target > >> > >> Hi Bharathi, > >> > >> MaxStoresPerMemset was changed from 16 to 8 in r 169791. The commit > >> comment: > >> > >> "Some enhancements for memcpy / memset inline expansion. > >> 1. Teach it to use overlapping unaligned load / store to copy / set the > trailing > >> bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte > copies. > >> 2. Use f64 for memcpy / memset on targets where i64 is not legal but f64 > is. > >> e.g. > >> x86 and ARM. > >> 3. When memcpy from a constant string, do *not* replace the load with > >> a constant > >> if it's not possible to materialize an integer immediate with a single > >> instruction (required a new target hook: TLI.isIntImmLegal()). > >> 4. Use unaligned load / stores more aggressively if target hooks > >> indicates they > >> are "fast". > >> 5. Update ARM target hooks to use unaligned load / stores. e.g. > >> vld1.8 / vst1.8. > >> Also increase the threshold to something reasonable (8 for memset, 4 > pairs > >> for memcpy). > >> > >> This significantly improves Dhrystone, up to 50% on ARM iOS devices. > >> > >> rdar://12760078" > >> > >> It's strange. According to the comment the threshold was increased > >> but it is decreased. I think the code needs to be revisited and > benchmarked. > >> I'll do some benchmarking. > >> > >> Thanks, > >> Evgeny Astigeevich | Arm Compiler Optimization Team Lead > >> > >> > >> > -----Original Message----- > >> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf > >> > Of Bharathi Seshadri via llvm-dev > >> > Sent: Tuesday, September 05, 2017 8:24 PM > >> > To: llvm-dev at lists.llvm.org > >> > Subject: [llvm-dev] Lowering llvm.memset for ARM target > >> > > >> > As reported in an earlier thread > >> > (http://clang-developers.42468.n3.nabble.com/Disable-memset- > synthes > >> > is- > >> > tp4057810.html), > >> > we noticed in some cases that the llvm.memset intrinsic, if lowered > >> > to stores, could help with performance. > >> > > >> > Here's a test case: If LIMIT is > 8, I see that a call to memset is > >> > emitted for arm & aarch64, but not for x86 target. > >> > > >> > typedef struct { > >> > int v0[100]; > >> > } test; > >> > #define LIMIT 9 > >> > void init(test *t) > >> > { > >> > int i; > >> > for (i = 0; i < LIMIT ; i++) > >> > t->v0[i] = 0; > >> > } > >> > int main() { > >> > test t; > >> > init(&t); > >> > return 0; > >> > } > >> > > >> > Looking at the llvm sources, I see that there are two key target > >> > specific variables, MaxStoresPerMemset and > >> MaxStoresPerMemsetOptSize, > >> > that determine if the intrinsic llvm.memset can be lowered into > >> > store > >> operations. > >> > For ARM, these variables are set to 8 and 4 respectively. > >> > > >> > I do not know as to how the default values for these two variables > >> > are arrived at, but doubling these values (similar to that for the > >> > x86 > >> > target) seems to help our case and we observe a 7% increase in > >> > performance of our networking application. We use -O3 and -flto and > >> > 32-bit > >> arm. > >> > > >> > I can prepare a patch and post for review if such a change, say > >> > under CodeGenOpt::Aggressive would be acceptable. > >> > > >> > Thanks, > >> > Bharathi > >> > _______________________________________________ > >> > LLVM Developers mailing list > >> > llvm-dev at lists.llvm.org > >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >> _______________________________________________ > >> LLVM Developers mailing list > >> llvm-dev at lists.llvm.org > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Evgeny Astigeevich via llvm-dev
2017-Oct-25 15:22 UTC
[llvm-dev] Lowering llvm.memset for ARM target
Hi Bharathi, I did some debugging, the current problem is that the same threshold values are used for SIMD and non-SIMD memory instructions. For the test you provided, when the SIMD extension is disabled the current implementation of llvm.memset lowering finds out 9 store instructions will be required. As 9 > 8 llvm.memset is lowered to a call. When the SIMD is enabled, it finds out 3 stores (two vst1.32 + one str) will be enough. As 3 < 8 llvm.memset is lowered to a sequence of stores. So before changing the threshold values we need to figure out: 1. Do we need separate thresholds for SIMD and non-SIMD memory instructions? For example, 8 for SIMD and 10-16 for non-SIMD. Some benchmarking is needed to find proper value. 2. If we keep the single value, how much it should be increased? This might affect performance of SIMD using applications. So benchmarking again. 3. If STRD are used then only 5 instructions are needed and llvm.memset is lowered as expected. I don’t know why the variant with STRD is not considered, maybe to avoid register pressure. Hope this helps. Thanks, Evgeny Astigeevich -----Original Message----- From: Evgeny Astigeevich <Evgeny.Astigeevich at arm.com> Date: Monday, 11 September 2017 at 14:27 To: Bharathi Seshadri <bharathi.seshadri at gmail.com> Cc: llvm-dev <llvm-dev at lists.llvm.org>, nd <nd at arm.com> Subject: RE: [llvm-dev] Lowering llvm.memset for ARM target Hi Bharathi, '-mfpu=vfp ' is the root cause of the problem. It means: VFPv2, disabled Advanced SIMD extension. The intrinsic is lowered into stores only when Advanced SIMD extension is enabled. So, if your target supports the Advanced SIMD extension, the workaround is '-mfpu=neon'. I'll check what is happening when the Advanced SIMD extension is disabled. Thanks, Evgeny > -----Original Message----- > From: Bharathi Seshadri [mailto:bharathi.seshadri at gmail.com] > Sent: Friday, September 08, 2017 9:39 PM > To: Evgeny Astigeevich > Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target > > Hi Evgeny, > > Even for a litte-endian ARM target, I don't see that the intrinsic is lowered > into stores. I checked with llvm38, llvm40 and a somewhat recent trunk > (about a month old). I'm not sure what I'm missing. > > For my test case compiled using -O3 -c --target=arm-linux-gnueabi - > march=armv8-a+crc -mfloat-abi=hard -no-integrated-as -mfpu=vfp, I get > > bash-4.1$ cat trymem2.c > typedef struct { > int v0[100]; > } test; > #define LIMIT 9 > void init(test *t) > { > int i; > for (i = 0; i < LIMIT ; i++) > t->v0[i] = 0; > } > int main() { > test t; > init(&t); > return 0; > } > > > $objdump -d > 00000000 <init>: > 0: e92d4800 push {fp, lr} > 4: e1a0b00d mov fp, sp > 8: e3a01000 mov r1, #0 > c: e3a02024 mov r2, #36 ; 0x24 > 10: ebfffffe bl 0 <memset> <====== Call to memset > 14: e8bd8800 pop {fp, pc} > 00000018 <main>: > 18: e3a00000 mov r0, #0 > 1c: e12fff1e bx lr > > With my patched clang to modify the MaxMemsetPerStores for ARM to 16, I > get Disassembly of section .text: > 00000000 <init>: > 0: e3a01000 mov r1, #0 > 4: e5801020 str r1, [r0, #32] > 8: e5801004 str r1, [r0, #4] > c: e5801008 str r1, [r0, #8] > 10: e580100c str r1, [r0, #12] > 14: e5801010 str r1, [r0, #16] > 18: e5801014 str r1, [r0, #20] > 1c: e5801018 str r1, [r0, #24] > 20: e580101c str r1, [r0, #28] > 24: e5801000 str r1, [r0] > 28: e12fff1e bx lr > 0000002c <main>: > 2c: e3a00000 mov r0, #0 > 30: e12fff1e bx lr > > > Thanks, > > Bharathi > > On Fri, Sep 8, 2017 at 8:22 AM, Evgeny Astigeevich > <Evgeny.Astigeevich at arm.com> wrote: > > Hi Bharathi, > > > > From the discussion you provided I found that the issue happens for a big- > endian ARM target. > > For the little-endian target the intrinsic in your test case is lowered to store > instructions. > > Some debugging is needed to figure out why it's not happening for big- > endian. > > > > -Evgeny > > > >> -----Original Message----- > >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of > >> Evgeny Astigeevich via llvm-dev > >> Sent: Thursday, September 07, 2017 4:25 PM > >> To: Bharathi Seshadri > >> Cc: llvm-dev; nd > >> Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target > >> > >> Hi Bharathi, > >> > >> MaxStoresPerMemset was changed from 16 to 8 in r 169791. The commit > >> comment: > >> > >> "Some enhancements for memcpy / memset inline expansion. > >> 1. Teach it to use overlapping unaligned load / store to copy / set the > trailing > >> bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte > copies. > >> 2. Use f64 for memcpy / memset on targets where i64 is not legal but f64 > is. > >> e.g. > >> x86 and ARM. > >> 3. When memcpy from a constant string, do *not* replace the load with > >> a constant > >> if it's not possible to materialize an integer immediate with a single > >> instruction (required a new target hook: TLI.isIntImmLegal()). > >> 4. Use unaligned load / stores more aggressively if target hooks > >> indicates they > >> are "fast". > >> 5. Update ARM target hooks to use unaligned load / stores. e.g. > >> vld1.8 / vst1.8. > >> Also increase the threshold to something reasonable (8 for memset, 4 > pairs > >> for memcpy). > >> > >> This significantly improves Dhrystone, up to 50% on ARM iOS devices. > >> > >> rdar://12760078" > >> > >> It's strange. According to the comment the threshold was increased > >> but it is decreased. I think the code needs to be revisited and > benchmarked. > >> I'll do some benchmarking. > >> > >> Thanks, > >> Evgeny Astigeevich | Arm Compiler Optimization Team Lead > >> > >> > >> > -----Original Message----- > >> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf > >> > Of Bharathi Seshadri via llvm-dev > >> > Sent: Tuesday, September 05, 2017 8:24 PM > >> > To: llvm-dev at lists.llvm.org > >> > Subject: [llvm-dev] Lowering llvm.memset for ARM target > >> > > >> > As reported in an earlier thread > >> > (http://clang-developers.42468.n3.nabble.com/Disable-memset- > synthes > >> > is- > >> > tp4057810.html), > >> > we noticed in some cases that the llvm.memset intrinsic, if lowered > >> > to stores, could help with performance. > >> > > >> > Here's a test case: If LIMIT is > 8, I see that a call to memset is > >> > emitted for arm & aarch64, but not for x86 target. > >> > > >> > typedef struct { > >> > int v0[100]; > >> > } test; > >> > #define LIMIT 9 > >> > void init(test *t) > >> > { > >> > int i; > >> > for (i = 0; i < LIMIT ; i++) > >> > t->v0[i] = 0; > >> > } > >> > int main() { > >> > test t; > >> > init(&t); > >> > return 0; > >> > } > >> > > >> > Looking at the llvm sources, I see that there are two key target > >> > specific variables, MaxStoresPerMemset and > >> MaxStoresPerMemsetOptSize, > >> > that determine if the intrinsic llvm.memset can be lowered into > >> > store > >> operations. > >> > For ARM, these variables are set to 8 and 4 respectively. > >> > > >> > I do not know as to how the default values for these two variables > >> > are arrived at, but doubling these values (similar to that for the > >> > x86 > >> > target) seems to help our case and we observe a 7% increase in > >> > performance of our networking application. We use -O3 and -flto and > >> > 32-bit > >> arm. > >> > > >> > I can prepare a patch and post for review if such a change, say > >> > under CodeGenOpt::Aggressive would be acceptable. > >> > > >> > Thanks, > >> > Bharathi > >> > _______________________________________________ > >> > LLVM Developers mailing list > >> > llvm-dev at lists.llvm.org > >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >> _______________________________________________ > >> LLVM Developers mailing list > >> llvm-dev at lists.llvm.org > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Apparently Analagous Threads
- [cfe-dev] Disable memset synthesis
- [RFC] Enable Partial Inliner by default
- Reversion of rL292621 caused about 7% performance regressions on Cortex-M
- Reversion of rL292621 caused about 7% performance regressions on Cortex-M
- [RFC] Making .eh_frame more linker-friendly