Evgeny Astigeevich via llvm-dev
2017-Sep-11 13:27 UTC
[llvm-dev] Lowering llvm.memset for ARM target
Hi Bharathi, '-mfpu=vfp ' is the root cause of the problem. It means: VFPv2, disabled Advanced SIMD extension. The intrinsic is lowered into stores only when Advanced SIMD extension is enabled. So, if your target supports the Advanced SIMD extension, the workaround is '-mfpu=neon'. I'll check what is happening when the Advanced SIMD extension is disabled. Thanks, Evgeny> -----Original Message----- > From: Bharathi Seshadri [mailto:bharathi.seshadri at gmail.com] > Sent: Friday, September 08, 2017 9:39 PM > To: Evgeny Astigeevich > Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target > > Hi Evgeny, > > Even for a litte-endian ARM target, I don't see that the intrinsic is lowered > into stores. I checked with llvm38, llvm40 and a somewhat recent trunk > (about a month old). I'm not sure what I'm missing. > > For my test case compiled using -O3 -c --target=arm-linux-gnueabi - > march=armv8-a+crc -mfloat-abi=hard -no-integrated-as -mfpu=vfp, I get > > bash-4.1$ cat trymem2.c > typedef struct { > int v0[100]; > } test; > #define LIMIT 9 > void init(test *t) > { > int i; > for (i = 0; i < LIMIT ; i++) > t->v0[i] = 0; > } > int main() { > test t; > init(&t); > return 0; > } > > > $objdump -d > 00000000 <init>: > 0: e92d4800 push {fp, lr} > 4: e1a0b00d mov fp, sp > 8: e3a01000 mov r1, #0 > c: e3a02024 mov r2, #36 ; 0x24 > 10: ebfffffe bl 0 <memset> <====== Call to memset > 14: e8bd8800 pop {fp, pc} > 00000018 <main>: > 18: e3a00000 mov r0, #0 > 1c: e12fff1e bx lr > > With my patched clang to modify the MaxMemsetPerStores for ARM to 16, I > get Disassembly of section .text: > 00000000 <init>: > 0: e3a01000 mov r1, #0 > 4: e5801020 str r1, [r0, #32] > 8: e5801004 str r1, [r0, #4] > c: e5801008 str r1, [r0, #8] > 10: e580100c str r1, [r0, #12] > 14: e5801010 str r1, [r0, #16] > 18: e5801014 str r1, [r0, #20] > 1c: e5801018 str r1, [r0, #24] > 20: e580101c str r1, [r0, #28] > 24: e5801000 str r1, [r0] > 28: e12fff1e bx lr > 0000002c <main>: > 2c: e3a00000 mov r0, #0 > 30: e12fff1e bx lr > > > Thanks, > > Bharathi > > On Fri, Sep 8, 2017 at 8:22 AM, Evgeny Astigeevich > <Evgeny.Astigeevich at arm.com> wrote: > > Hi Bharathi, > > > > From the discussion you provided I found that the issue happens for a big- > endian ARM target. > > For the little-endian target the intrinsic in your test case is lowered to store > instructions. > > Some debugging is needed to figure out why it's not happening for big- > endian. > > > > -Evgeny > > > >> -----Original Message----- > >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of > >> Evgeny Astigeevich via llvm-dev > >> Sent: Thursday, September 07, 2017 4:25 PM > >> To: Bharathi Seshadri > >> Cc: llvm-dev; nd > >> Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target > >> > >> Hi Bharathi, > >> > >> MaxStoresPerMemset was changed from 16 to 8 in r 169791. The commit > >> comment: > >> > >> "Some enhancements for memcpy / memset inline expansion. > >> 1. Teach it to use overlapping unaligned load / store to copy / set the > trailing > >> bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte > copies. > >> 2. Use f64 for memcpy / memset on targets where i64 is not legal but f64 > is. > >> e.g. > >> x86 and ARM. > >> 3. When memcpy from a constant string, do *not* replace the load with > >> a constant > >> if it's not possible to materialize an integer immediate with a single > >> instruction (required a new target hook: TLI.isIntImmLegal()). > >> 4. Use unaligned load / stores more aggressively if target hooks > >> indicates they > >> are "fast". > >> 5. Update ARM target hooks to use unaligned load / stores. e.g. > >> vld1.8 / vst1.8. > >> Also increase the threshold to something reasonable (8 for memset, 4 > pairs > >> for memcpy). > >> > >> This significantly improves Dhrystone, up to 50% on ARM iOS devices. > >> > >> rdar://12760078" > >> > >> It's strange. According to the comment the threshold was increased > >> but it is decreased. I think the code needs to be revisited and > benchmarked. > >> I'll do some benchmarking. > >> > >> Thanks, > >> Evgeny Astigeevich | Arm Compiler Optimization Team Lead > >> > >> > >> > -----Original Message----- > >> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf > >> > Of Bharathi Seshadri via llvm-dev > >> > Sent: Tuesday, September 05, 2017 8:24 PM > >> > To: llvm-dev at lists.llvm.org > >> > Subject: [llvm-dev] Lowering llvm.memset for ARM target > >> > > >> > As reported in an earlier thread > >> > (http://clang-developers.42468.n3.nabble.com/Disable-memset- > synthes > >> > is- > >> > tp4057810.html), > >> > we noticed in some cases that the llvm.memset intrinsic, if lowered > >> > to stores, could help with performance. > >> > > >> > Here's a test case: If LIMIT is > 8, I see that a call to memset is > >> > emitted for arm & aarch64, but not for x86 target. > >> > > >> > typedef struct { > >> > int v0[100]; > >> > } test; > >> > #define LIMIT 9 > >> > void init(test *t) > >> > { > >> > int i; > >> > for (i = 0; i < LIMIT ; i++) > >> > t->v0[i] = 0; > >> > } > >> > int main() { > >> > test t; > >> > init(&t); > >> > return 0; > >> > } > >> > > >> > Looking at the llvm sources, I see that there are two key target > >> > specific variables, MaxStoresPerMemset and > >> MaxStoresPerMemsetOptSize, > >> > that determine if the intrinsic llvm.memset can be lowered into > >> > store > >> operations. > >> > For ARM, these variables are set to 8 and 4 respectively. > >> > > >> > I do not know as to how the default values for these two variables > >> > are arrived at, but doubling these values (similar to that for the > >> > x86 > >> > target) seems to help our case and we observe a 7% increase in > >> > performance of our networking application. We use -O3 and -flto and > >> > 32-bit > >> arm. > >> > > >> > I can prepare a patch and post for review if such a change, say > >> > under CodeGenOpt::Aggressive would be acceptable. > >> > > >> > Thanks, > >> > Bharathi > >> > _______________________________________________ > >> > LLVM Developers mailing list > >> > llvm-dev at lists.llvm.org > >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >> _______________________________________________ > >> LLVM Developers mailing list > >> llvm-dev at lists.llvm.org > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Friedman, Eli via llvm-dev
2017-Oct-25 18:48 UTC
[llvm-dev] Lowering llvm.memset for ARM target
On 10/25/2017 8:22 AM, Evgeny Astigeevich via llvm-dev wrote:> Hi Bharathi, > > I did some debugging, the current problem is that the same threshold values are used for SIMD and non-SIMD memory instructions. > For the test you provided, when the SIMD extension is disabled the current implementation of llvm.memset lowering finds out 9 store instructions will be required. As 9 > 8 llvm.memset is lowered to a call. > When the SIMD is enabled, it finds out 3 stores (two vst1.32 + one str) will be enough. As 3 < 8 llvm.memset is lowered to a sequence of stores. > > So before changing the threshold values we need to figure out: > > 1. Do we need separate thresholds for SIMD and non-SIMD memory instructions? For example, 8 for SIMD and 10-16 for non-SIMD. Some benchmarking is needed to find proper value. > 2. If we keep the single value, how much it should be increased? This might affect performance of SIMD using applications. So benchmarking again. > 3. If STRD are used then only 5 instructions are needed and llvm.memset is lowered as expected. I don’t know why the variant with STRD is not considered, maybe to avoid register pressure.We probably do want to lower memset to strd/stm when we don't have NEON. Patch welcome. :) (Note that in Thumb2, we often form strd anyway in ARMLoadStoreOptimizer, but we don't do it reliably, and we probably want to do something similar in ARM, and maybe Thumb1.) -Eli -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project