Miao Xie
2010-Sep-02 05:46 UTC
[PATCH V2 1/3] lib: introduce some memory copy macros and functions
Changes from V1 to V2: - change the version of GPL from version 2.1 to version 2 the kernel''s memcpy and memmove is very inefficient. But the glibc version is quite fast, in some cases it is 10 times faster than the kernel version. So I introduce some memory copy macros and functions of the glibc to improve the kernel version''s performance. The strategy of the memory functions is: 1. Copy bytes until the destination pointer is aligned. 2. Copy words in unrolled loops. If the source and destination are not aligned in the same way, use word memory operations, but shift and merge two read words before writing. 3. Copy the few remaining bytes. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> --- include/linux/memcopy.h | 226 ++++++++++++++++++++++++++ lib/Makefile | 3 +- lib/memcopy.c | 403 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 631 insertions(+), 1 deletions(-) create mode 100644 include/linux/memcopy.h create mode 100644 lib/memcopy.c diff --git a/include/linux/memcopy.h b/include/linux/memcopy.h new file mode 100644 index 0000000..9c65ac8 --- /dev/null +++ b/include/linux/memcopy.h @@ -0,0 +1,226 @@ +/* + * memcopy.h -- definitions for memory copy functions. Generic C version. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General + * Public License for more details. + * + * You should have received a copy of the GNU General Public License along + * with this program; if not, write to the Free Software Foundation, Inc., + * 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. + * + * The code is derived from the GNU C Library. + * Copyright (C) 1991, 1992, 1993, 1997, 2004 Free Software Foundation, Inc. + */ +#ifndef _LINUX_MEMCOPY_H_ +#define _LINUX_MEMCOPY_H_ + +/* + * The strategy of the memory functions is: + * + * 1. Copy bytes until the destination pointer is aligned. + * + * 2. Copy words in unrolled loops. If the source and destination + * are not aligned in the same way, use word memory operations, + * but shift and merge two read words before writing. + * + * 3. Copy the few remaining bytes. + * + * This is fast on processors that have at least 10 registers for + * allocation by GCC, and that can access memory at reg+const in one + * instruction. + */ + +#include <linux/types.h> +#include <linux/compiler.h> +#include <asm/byteorder.h> + +/* + * The macros defined in this file are: + * + * BYTE_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy) + * + * BYTE_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy) + * + * WORD_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_remaining, nbytes_to_copy) + * + * WORD_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_remaining, nbytes_to_copy) + * + * MERGE(old_word, sh_1, new_word, sh_2) + * + * MEM_COPY_FWD(dst_beg_ptr, src_beg_ptr, nbytes_to_copy) + * + * MEM_COPY_BWD(dst_end_ptr, src_end_ptr, nbytes_to_copy) + */ + +#define OP_T_THRESHOLD 16 + +/* + * Type to use for aligned memory operations. + * This should normally be the biggest type supported by a single load + * and store. + */ +#define op_t unsigned long int +#define OPSIZ (sizeof(op_t)) + +/* Type to use for unaligned operations. */ +typedef unsigned char byte; + +#ifndef MERGE +# ifdef __LITTLE_ENDIAN +# define MERGE(w0, sh_1, w1, sh_2) (((w0) >> (sh_1)) | ((w1) << (sh_2))) +# elif defined(__BIG_ENDIAN) +# define MERGE(w0, sh_1, w1, sh_2) (((w0) << (sh_1)) | ((w1) >> (sh_2))) +# else +# error "Macro MERGE() hasn''t defined!" +# endif +#endif + +/* + * Copy exactly NBYTES bytes from SRC_BP to DST_BP, + * without any assumptions about alignment of the pointers. + */ +#ifndef BYTE_COPY_FWD +#define BYTE_COPY_FWD(dst_bp, src_bp, nbytes) \ +do { \ + size_t __nbytes = (nbytes); \ + while (__nbytes > 0) { \ + byte __x = ((byte *) src_bp)[0]; \ + src_bp += 1; \ + __nbytes -= 1; \ + ((byte *) dst_bp)[0] = __x; \ + dst_bp += 1; \ + } \ +} while (0) +#endif + +/* + * Copy exactly NBYTES_TO_COPY bytes from SRC_END_PTR to DST_END_PTR, + * beginning at the bytes right before the pointers and continuing towards + * smaller addresses. Don''t assume anything about alignment of the + * pointers. + */ +#ifndef BYTE_COPY_BWD +#define BYTE_COPY_BWD(dst_ep, src_ep, nbytes) \ +do { \ + size_t __nbytes = (nbytes); \ + while (__nbytes > 0) { \ + byte __x; \ + src_ep -= 1; \ + __x = ((byte *) src_ep)[0]; \ + dst_ep -= 1; \ + __nbytes -= 1; \ + ((byte *) dst_ep)[0] = __x; \ + } \ +} while (0) +#endif +/* + * Copy *up to* NBYTES bytes from SRC_BP to DST_BP, with + * the assumption that DST_BP is aligned on an OPSIZ multiple. If + * not all bytes could be easily copied, store remaining number of bytes + * in NBYTES_LEFT, otherwise store 0. + */ +extern void _wordcopy_fwd_aligned(long int, long int, size_t); +extern void _wordcopy_fwd_dest_aligned(long int, long int, size_t); +#ifndef WORD_COPY_FWD +#define WORD_COPY_FWD(dst_bp, src_bp, nbytes_left, nbytes) \ +do { \ + if (src_bp % OPSIZ == 0) \ + _wordcopy_fwd_aligned (dst_bp, src_bp, (nbytes) / OPSIZ); \ + else \ + _wordcopy_fwd_dest_aligned (dst_bp, src_bp, (nbytes) / OPSIZ);\ + \ + src_bp += (nbytes) & -OPSIZ; \ + dst_bp += (nbytes) & -OPSIZ; \ + (nbytes_left) = (nbytes) % OPSIZ; \ +} while (0) +#endif + +/* + * Copy *up to* NBYTES_TO_COPY bytes from SRC_END_PTR to DST_END_PTR, + * beginning at the words (of type op_t) right before the pointers and + * continuing towards smaller addresses. May take advantage of that + * DST_END_PTR is aligned on an OPSIZ multiple. If not all bytes could be + * easily copied, store remaining number of bytes in NBYTES_REMAINING, + * otherwise store 0. + */ +extern void _wordcopy_bwd_aligned(long int, long int, size_t); +extern void _wordcopy_bwd_dest_aligned(long int, long int, size_t); +#ifndef WORD_COPY_BWD +#define WORD_COPY_BWD(dst_ep, src_ep, nbytes_left, nbytes) \ +do { \ + if (src_ep % OPSIZ == 0) \ + _wordcopy_bwd_aligned (dst_ep, src_ep, (nbytes) / OPSIZ); \ + else \ + _wordcopy_bwd_dest_aligned (dst_ep, src_ep, (nbytes) / OPSIZ);\ + \ + src_ep -= (nbytes) & -OPSIZ; \ + dst_ep -= (nbytes) & -OPSIZ; \ + (nbytes_left) = (nbytes) % OPSIZ; \ +} while (0) +#endif + +/* Copy memory from the beginning to the end */ +#ifndef MEM_COPY_FWD +static __always_inline void mem_copy_fwd(unsigned long dstp, + unsigned long srcp, + size_t count) +{ + /* If there not too few bytes to copy, use word copy. */ + if (count >= OP_T_THRESHOLD) { + /* Copy just a few bytes to make dstp aligned. */ + count -= (-dstp) % OPSIZ; + BYTE_COPY_FWD(dstp, srcp, (-dstp) % OPSIZ); + + /* + * Copy from srcp to dstp taking advantage of the known + * alignment of dstp. Number if bytes remaining is put in + * the third argument. + */ + WORD_COPY_FWD(dstp, srcp, count, count); + + /* Fall out and copy the tail. */ + } + + /* There are just a few bytes to copy. Use byte memory operations. */ + BYTE_COPY_FWD(dstp, srcp, count); +} +#endif + +/* Copy memory from the end to the beginning. */ +#ifndef MEM_COPY_BWD +static __always_inline void mem_copy_bwd(unsigned long dstp, + unsigned long srcp, + size_t count) +{ + srcp += count; + dstp += count; + + /* If there not too few bytes to copy, use word copy. */ + if (count >= OP_T_THRESHOLD) { + /* Copy just a few bytes to make dstp aligned. */ + count -= dstp % OPSIZ; + BYTE_COPY_BWD(dstp, srcp, dstp % OPSIZ); + + /* + * Copy from srcp to dstp taking advantage of the known + * alignment of dstp. Number if bytes remaining is put in + * the third argument. + */ + WORD_COPY_BWD(dstp, srcp, count, count); + + /* Fall out and copy the tail. */ + } + + /* There are just a few bytes to copy. Use byte memory operations. */ + BYTE_COPY_BWD (dstp, srcp, count); +} +#endif + +#endif diff --git a/lib/Makefile b/lib/Makefile index e6a3763..10a319e 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -12,7 +12,8 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ idr.o int_sqrt.o extable.o prio_tree.o \ sha1.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o prio_heap.o ratelimit.o show_mem.o \ - is_single_threaded.o plist.o decompress.o flex_array.o + is_single_threaded.o plist.o decompress.o flex_array.o \ + memcopy.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/memcopy.c b/lib/memcopy.c new file mode 100644 index 0000000..70fb6b2 --- /dev/null +++ b/lib/memcopy.c @@ -0,0 +1,403 @@ +/* + * memcopy.c -- subroutines for memory copy functions. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General + * Public License for more details. + * + * You should have received a copy of the GNU General Public License along + * with this program; if not, write to the Free Software Foundation, Inc., + * 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. + * + * The code is derived from the GNU C Library. + * Copyright (C) 1991, 1992, 1993, 1997, 2004 Free Software Foundation, Inc. + */ + +/* BE VERY CAREFUL IF YOU CHANGE THIS CODE...! */ + +#include <linux/memcopy.h> + +/* + * _wordcopy_fwd_aligned -- Copy block beginning at SRCP to block beginning + * at DSTP with LEN `op_t'' words (not LEN bytes!). + * Both SRCP and DSTP should be aligned for memory operations on `op_t''s. + */ +void _wordcopy_fwd_aligned (long int dstp, long int srcp, size_t len) +{ + op_t a0, a1; + + switch (len % 8) { + case 2: + a0 = ((op_t *) srcp)[0]; + srcp -= 6 * OPSIZ; + dstp -= 7 * OPSIZ; + len += 6; + goto do1; + case 3: + a1 = ((op_t *) srcp)[0]; + srcp -= 5 * OPSIZ; + dstp -= 6 * OPSIZ; + len += 5; + goto do2; + case 4: + a0 = ((op_t *) srcp)[0]; + srcp -= 4 * OPSIZ; + dstp -= 5 * OPSIZ; + len += 4; + goto do3; + case 5: + a1 = ((op_t *) srcp)[0]; + srcp -= 3 * OPSIZ; + dstp -= 4 * OPSIZ; + len += 3; + goto do4; + case 6: + a0 = ((op_t *) srcp)[0]; + srcp -= 2 * OPSIZ; + dstp -= 3 * OPSIZ; + len += 2; + goto do5; + case 7: + a1 = ((op_t *) srcp)[0]; + srcp -= 1 * OPSIZ; + dstp -= 2 * OPSIZ; + len += 1; + goto do6; + case 0: + if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0) + return; + a0 = ((op_t *) srcp)[0]; + srcp -= 0 * OPSIZ; + dstp -= 1 * OPSIZ; + goto do7; + case 1: + a1 = ((op_t *) srcp)[0]; + srcp -=-1 * OPSIZ; + dstp -= 0 * OPSIZ; + len -= 1; + if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0) + goto do0; + goto do8; /* No-op. */ + } + + do { +do8: + a0 = ((op_t *) srcp)[0]; + ((op_t *) dstp)[0] = a1; +do7: + a1 = ((op_t *) srcp)[1]; + ((op_t *) dstp)[1] = a0; +do6: + a0 = ((op_t *) srcp)[2]; + ((op_t *) dstp)[2] = a1; +do5: + a1 = ((op_t *) srcp)[3]; + ((op_t *) dstp)[3] = a0; +do4: + a0 = ((op_t *) srcp)[4]; + ((op_t *) dstp)[4] = a1; +do3: + a1 = ((op_t *) srcp)[5]; + ((op_t *) dstp)[5] = a0; +do2: + a0 = ((op_t *) srcp)[6]; + ((op_t *) dstp)[6] = a1; +do1: + a1 = ((op_t *) srcp)[7]; + ((op_t *) dstp)[7] = a0; + + srcp += 8 * OPSIZ; + dstp += 8 * OPSIZ; + len -= 8; + } while (len != 0); + + /* + * This is the right position for do0. Please don''t move it into + * the loop. + */ +do0: + ((op_t *) dstp)[0] = a1; +} + +/* + * _wordcopy_fwd_dest_aligned -- Copy block beginning at SRCP to block + * beginning at DSTP with LEN `op_t'' words (not LEN bytes!). DSTP should + * be aligned for memory operations on `op_t''s, but SRCP must *not* be aligned. + */ + +void _wordcopy_fwd_dest_aligned (long int dstp, long int srcp, size_t len) +{ + op_t a0, a1, a2, a3; + int sh_1, sh_2; + + /* + * Calculate how to shift a word read at the memory operation aligned + * srcp to make it aligned for copy. + */ + sh_1 = 8 * (srcp % OPSIZ); + sh_2 = 8 * OPSIZ - sh_1; + + /* + * Make SRCP aligned by rounding it down to the beginning of the `op_t'' + * it points in the middle of. + */ + srcp &= -OPSIZ; + + switch (len % 4) { + case 2: + a1 = ((op_t *) srcp)[0]; + a2 = ((op_t *) srcp)[1]; + srcp -= 1 * OPSIZ; + dstp -= 3 * OPSIZ; + len += 2; + goto do1; + case 3: + a0 = ((op_t *) srcp)[0]; + a1 = ((op_t *) srcp)[1]; + srcp -= 0 * OPSIZ; + dstp -= 2 * OPSIZ; + len += 1; + goto do2; + case 0: + if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0) + return; + a3 = ((op_t *) srcp)[0]; + a0 = ((op_t *) srcp)[1]; + srcp -=-1 * OPSIZ; + dstp -= 1 * OPSIZ; + len += 0; + goto do3; + case 1: + a2 = ((op_t *) srcp)[0]; + a3 = ((op_t *) srcp)[1]; + srcp -=-2 * OPSIZ; + dstp -= 0 * OPSIZ; + len -= 1; + if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0) + goto do0; + goto do4; /* No-op. */ + } + + do { +do4: + a0 = ((op_t *) srcp)[0]; + ((op_t *) dstp)[0] = MERGE (a2, sh_1, a3, sh_2); +do3: + a1 = ((op_t *) srcp)[1]; + ((op_t *) dstp)[1] = MERGE (a3, sh_1, a0, sh_2); +do2: + a2 = ((op_t *) srcp)[2]; + ((op_t *) dstp)[2] = MERGE (a0, sh_1, a1, sh_2); +do1: + a3 = ((op_t *) srcp)[3]; + ((op_t *) dstp)[3] = MERGE (a1, sh_1, a2, sh_2); + + srcp += 4 * OPSIZ; + dstp += 4 * OPSIZ; + len -= 4; + } while (len != 0); + + /* + * This is the right position for do0. Please don''t move it into + * the loop. + */ +do0: + ((op_t *) dstp)[0] = MERGE (a2, sh_1, a3, sh_2); +} + +/* + * _wordcopy_bwd_aligned -- Copy block finishing right before + * SRCP to block finishing right before DSTP with LEN `op_t'' words (not LEN + * bytes!). Both SRCP and DSTP should be aligned for memory operations + * on `op_t''s. + */ +void _wordcopy_bwd_aligned (long int dstp, long int srcp, size_t len) +{ + op_t a0, a1; + + switch (len % 8) { + case 2: + srcp -= 2 * OPSIZ; + dstp -= 1 * OPSIZ; + a0 = ((op_t *) srcp)[1]; + len += 6; + goto do1; + case 3: + srcp -= 3 * OPSIZ; + dstp -= 2 * OPSIZ; + a1 = ((op_t *) srcp)[2]; + len += 5; + goto do2; + case 4: + srcp -= 4 * OPSIZ; + dstp -= 3 * OPSIZ; + a0 = ((op_t *) srcp)[3]; + len += 4; + goto do3; + case 5: + srcp -= 5 * OPSIZ; + dstp -= 4 * OPSIZ; + a1 = ((op_t *) srcp)[4]; + len += 3; + goto do4; + case 6: + srcp -= 6 * OPSIZ; + dstp -= 5 * OPSIZ; + a0 = ((op_t *) srcp)[5]; + len += 2; + goto do5; + case 7: + srcp -= 7 * OPSIZ; + dstp -= 6 * OPSIZ; + a1 = ((op_t *) srcp)[6]; + len += 1; + goto do6; + case 0: + if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0) + return; + srcp -= 8 * OPSIZ; + dstp -= 7 * OPSIZ; + a0 = ((op_t *) srcp)[7]; + goto do7; + case 1: + srcp -= 9 * OPSIZ; + dstp -= 8 * OPSIZ; + a1 = ((op_t *) srcp)[8]; + len -= 1; + if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0) + goto do0; + goto do8; /* No-op. */ + } + + do { +do8: + a0 = ((op_t *) srcp)[7]; + ((op_t *) dstp)[7] = a1; +do7: + a1 = ((op_t *) srcp)[6]; + ((op_t *) dstp)[6] = a0; +do6: + a0 = ((op_t *) srcp)[5]; + ((op_t *) dstp)[5] = a1; +do5: + a1 = ((op_t *) srcp)[4]; + ((op_t *) dstp)[4] = a0; +do4: + a0 = ((op_t *) srcp)[3]; + ((op_t *) dstp)[3] = a1; +do3: + a1 = ((op_t *) srcp)[2]; + ((op_t *) dstp)[2] = a0; +do2: + a0 = ((op_t *) srcp)[1]; + ((op_t *) dstp)[1] = a1; +do1: + a1 = ((op_t *) srcp)[0]; + ((op_t *) dstp)[0] = a0; + + srcp -= 8 * OPSIZ; + dstp -= 8 * OPSIZ; + len -= 8; + } while (len != 0); + + /* + * This is the right position for do0. Please don''t move it into + * the loop. + */ +do0: + ((op_t *) dstp)[7] = a1; +} + +/* + * _wordcopy_bwd_dest_aligned -- Copy block finishing right before SRCP to + * block finishing right before DSTP with LEN `op_t'' words (not LEN bytes!). + * DSTP should be aligned for memory operations on `op_t'', but SRCP must *not* + * be aligned. + */ +void _wordcopy_bwd_dest_aligned (long int dstp, long int srcp, size_t len) +{ + op_t a0, a1, a2, a3; + int sh_1, sh_2; + + /* + * Calculate how to shift a word read at the memory operation aligned + * srcp to make it aligned for copy. + */ + + sh_1 = 8 * (srcp % OPSIZ); + sh_2 = 8 * OPSIZ - sh_1; + + /* + * Make srcp aligned by rounding it down to the beginning of the op_t + * it points in the middle of. + */ + srcp &= -OPSIZ; + srcp += OPSIZ; + + switch (len % 4) { + case 2: + srcp -= 3 * OPSIZ; + dstp -= 1 * OPSIZ; + a2 = ((op_t *) srcp)[2]; + a1 = ((op_t *) srcp)[1]; + len += 2; + goto do1; + case 3: + srcp -= 4 * OPSIZ; + dstp -= 2 * OPSIZ; + a3 = ((op_t *) srcp)[3]; + a2 = ((op_t *) srcp)[2]; + len += 1; + goto do2; + case 0: + if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0) + return; + srcp -= 5 * OPSIZ; + dstp -= 3 * OPSIZ; + a0 = ((op_t *) srcp)[4]; + a3 = ((op_t *) srcp)[3]; + goto do3; + case 1: + srcp -= 6 * OPSIZ; + dstp -= 4 * OPSIZ; + a1 = ((op_t *) srcp)[5]; + a0 = ((op_t *) srcp)[4]; + len -= 1; + if (OP_T_THRESHOLD <= 3 * OPSIZ && len == 0) + goto do0; + goto do4; /* No-op. */ + } + + do { +do4: + a3 = ((op_t *) srcp)[3]; + ((op_t *) dstp)[3] = MERGE (a0, sh_1, a1, sh_2); +do3: + a2 = ((op_t *) srcp)[2]; + ((op_t *) dstp)[2] = MERGE (a3, sh_1, a0, sh_2); +do2: + a1 = ((op_t *) srcp)[1]; + ((op_t *) dstp)[1] = MERGE (a2, sh_1, a3, sh_2); +do1: + a0 = ((op_t *) srcp)[0]; + ((op_t *) dstp)[0] = MERGE (a1, sh_1, a2, sh_2); + + srcp -= 4 * OPSIZ; + dstp -= 4 * OPSIZ; + len -= 4; + } while (len != 0); + + /* + * This is the right position for do0. Please don''t move it into + * the loop. + */ +do0: + ((op_t *) dstp)[3] = MERGE (a0, sh_1, a1, sh_2); +} + -- 1.7.0.1
Andi Kleen
2010-Sep-02 08:55 UTC
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
Miao Xie <miaox@cn.fujitsu.com> writes:> Changes from V1 to V2: > - change the version of GPL from version 2.1 to version 2 > > the kernel''s memcpy and memmove is very inefficient. But the glibc version is > quite fast, in some cases it is 10 times faster than the kernel version. So ICan you elaborate on which CPUs and with what workloads you measured that? The kernel memcpy is optimized for copies smaller than a page size for example (kernel very rarely does anything on larger than 4k), the glibc isn''t. etc. There are various other differences. memcpy and memmove are very different. AFAIK noone has tried to optimize memmove() before because traditionally it wasn''t used for anything performance critical in the kernel. Has that that changed? memcpy on the other hand while not perfect is actually quite optimized for typical workloads. One big difference between the kernel and glibc is that kernel is often cache cold, so you e.g. the cost of a very large code footprint memcpy/memset is harder to amortize. Microbenchmarks often leave out that crucial variable. I have some systemtap scripts to measure size/alignment distributions of copies on a kernel, if you have a particular workload you''re interested in those could be tried. Just copying the glibc bloat uncritical is very likely the wrong move at least. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2010-Sep-02 10:11 UTC
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
On Thu, 02 Sep 2010 10:55:58 +0200, Andi Kleen wrote:> Miao Xie<miaox@cn.fujitsu.com> writes: > >> Changes from V1 to V2: >> - change the version of GPL from version 2.1 to version 2 >> >> the kernel''s memcpy and memmove is very inefficient. But the glibc version is >> quite fast, in some cases it is 10 times faster than the kernel version. So I > > > Can you elaborate on which CPUs and with what workloads you measured that?I did this test on x86_64 box with 4 cores, and the workload is quite low, and I just do 500 bytes copy for 5,000,000 times. the attached file is my test program.> The kernel memcpy is optimized for copies smaller than a page size > for example (kernel very rarely does anything on larger than 4k), > the glibc isn''t. etc. There are various other differences. > > memcpy and memmove are very different. AFAIK noone has tried > to optimize memmove() before because traditionally it wasn''t > used for anything performance critical in the kernel. Has that > that changed? memcpy on the other hand while not perfect > is actually quite optimized for typical workloads.Yes,the performance of memcpy on the most architecture is well, But some of memmoves are implemented by byte copy, it is quite inefficient. Unfortunately those memmove are used to modify the metadata of some filesystems, such as: btrfs. That is memmove is importent for the performance of those filesystems. So I improve the generic version of memcpy and memmove, and x86_64''s memmove which are implemented by byte copy.> One big difference between the kernel and glibc is that kernel > is often cache cold, so you e.g. the cost of a very large code footprint > memcpy/memset is harder to amortize. > > Microbenchmarks often leave out that crucial variable. > > I have some systemtap scripts to measure size/alignment distributions of > copies on a kernel, if you have a particular workload you''re interested > in those could be tried.Good! Could you give me these script?> Just copying the glibc bloat uncritical is very likely > the wrong move at least.Agree! Thanks! Miao
Andi Kleen
2010-Sep-02 10:40 UTC
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
> So I improve the generic version of memcpy and memmove, and x86_64''s memmove > which are implemented by byte copy.One should also add that most memmove()s and memcpy()s are actually generated by gcc as inlines (especially if you don''t use the "make my code slow" option aka -Os) and don''t use the fallback. The fallback depends on the gcc version and if gcc thinks the data is aligned or not. Sometimes one can get better code in the caller by making sure gcc knows the correct alignment (e.g. with suitable types) and size. This might be worth looking at for btrfs if it''s really that memmove heavy.> > > >I have some systemtap scripts to measure size/alignment distributions of > >copies on a kernel, if you have a particular workload you''re interested > >in those could be tried. > > Good! Could you give me these script?ftp://firstfloor.org/pub/ak/probes/csum.m4 You need to run them through .m4 first. They don''t measure memmove, but that should be easy to add. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2010-Sep-08 11:05 UTC
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
Hi, Andi On Thu, 2 Sep 2010 12:40:08 +0200, Andi Kleen wrote:>> So I improve the generic version of memcpy and memmove, and x86_64''s memmove >> which are implemented by byte copy. > > One should also add that most memmove()s and memcpy()s are actually > generated by gcc as inlines (especially if you don''t use the > "make my code slow" option aka -Os) and don''t use the fallback. > The fallback depends on the gcc version and if gcc thinks the > data is aligned or not. > > Sometimes one can get better code in the caller by making sure > gcc knows the correct alignment (e.g. with suitable > types) and size. This might be worth looking at for btrfs > if it''s really that memmove heavy.Right! But the src address and dest address is not fixed, so it is hard to tell gcc that the address is alignment or not. The problem is memmove is very inefficient in fact, and it is used at some fast path, such as: it is used to do metadata copy for filesystem, so we must improve it.>>> I have some systemtap scripts to measure size/alignment distributions of >>> copies on a kernel, if you have a particular workload you''re interested >>> in those could be tried. >> >> Good! Could you give me these script? > > ftp://firstfloor.org/pub/ak/probes/csum.m4 > > You need to run them through .m4 first. > They don''t measure memmove, but that should be easy to add.I used your script to measure size/alignment distributions of copies on a kernel when I ran the btrfs test, and got some data: memmove total 325903 length value |-------------------------------------------------- count 1 | 0 2 | 0 4 | 3 8 | 0 16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 57062 32 |@@ 5903 64 |@@@ 7296 128 |@@@@@@@@@ 18868 256 |@@@@@@@@@@@@@@@@@ 33790 512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 64886 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 98450 2048 |@@@@@@@@@@@@@@@@@@@@ 39645 4096 | 0 8192 | 0 length upto 50 value |-------------------------------------------------- count 2 | 0 3 | 0 4 | 3 5 | 0 6 | 0 ~ 21 | 0 22 | 0 23 | 24 24 | 0 25 |@@@@@@@@@@ 57037 26 | 0 27 | 0 ~ 29 | 0 30 | 0 31 | 1 32 | 3 33 | 78 34 | 215 35 | 1865 36 | 432 37 | 0 38 | 0 39 | 0 40 | 0 41 | 130 42 | 0 43 | 80 44 | 0 45 | 0 46 | 0 47 | 0 48 | 80 49 | 0 50 | 1077 >50 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 264878 src unalignments value |-------------------------------------------------- count 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 23173 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17623 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 23760 3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17372 4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 19185 5 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 26264 6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20288 7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18474 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20160 9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17754 10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 21450 11 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18127 12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 23075 13 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18582 14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 21879 15 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18737 16 | 0 dst unalignments value |-------------------------------------------------- count 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 28566 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17449 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20980 3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17239 4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 17171 5 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 15691 6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20558 7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18590 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20644 9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 21459 10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20384 11 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 19460 12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 23087 13 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 19656 14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 26330 15 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18639 16 | 0 same unalignments value |-------------------------------------------------- count 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5695 1 |@@@@@@@@@@@@@@@ 1815 2 |@@@@@@@@@@@@@@@@@@@@@@@@@ 2850 3 |@@@@@@@@@@@@@@@ 1819 4 |@@@@@@@@@@@@@@@@@@@@@@@@ 2791 5 |@@@@@ 573 6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3358 7 |@@@@@@@@@@@@@@@@@@@@@ 2411 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3340 9 |@@@@@@@@@@@@@@@@@@@@@ 2404 10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3475 11 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 3019 12 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4153 13 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 3052 14 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4212 15 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 3011 16 | 0 different unalignments value |-------------------------------------------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 41652 2 |@@@@@@@@@@@@@@@@@@ 25447 3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 69946 4 |@ 1800 5 |@ 1701 6 |@ 1573 7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 62478 8 | 810 9 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 49180 10 | 684 11 | 148 12 | 1148 13 |@@@@@@@@@@ 15298 14 |@@ 3459 15 |@ 2601 16 | 0 According to the data, the length of the most copies is >=128. Besides that I did some tests for various condition on my x86_64 box. The test is doing 500 bytes memory copy for 50000 times. The test result is following: dest < src || dest >= src + len: Length Src Unalign Dest Unalign Without Patch Patch applied ------ ----------- ------------ ------------- ------------- 8 0 0 0s 223086us 0s 230612us 8 0 3 0s 133857us 0s 138364us 8 0 6 0s 133852us 0s 138364us 8 3 0 0s 133845us 0s 138365us 8 3 3 0s 133845us 0s 138364us 8 3 6 0s 133841us 0s 138363us 8 6 0 0s 133848us 0s 138364us 8 6 3 0s 133842us 0s 138363us 8 6 6 0s 133840us 0s 138360us 16 0 0 0s 133847us 0s 138364us 16 0 3 0s 133851us 0s 138362us 16 0 6 0s 133842us 0s 138368us 16 3 0 0s 133849us 0s 138360us 16 3 3 0s 133844us 0s 138362us 16 3 6 0s 133839us 0s 138365us 16 6 0 0s 133845us 0s 138359us 16 6 3 0s 133841us 0s 138368us 16 6 6 0s 133841us 0s 138363us 32 0 0 0s 160914us 0s 165435us 32 0 3 0s 160925us 0s 165427us 32 0 6 0s 160898us 0s 165443us 32 3 0 0s 160930us 0s 165432us 32 3 3 0s 160898us 0s 165434us 32 3 6 0s 160919us 0s 165433us 32 6 0 0s 160909us 0s 165436us 32 6 3 0s 160914us 0s 165425us 32 6 6 0s 160910us 0s 165439us 256 0 0 0s 294756us 0s 299280us 256 0 3 0s 500777us 0s 505367us 256 0 6 0s 655671us 0s 660232us 256 3 0 0s 497769us 0s 503386us 256 3 3 0s 497790us 0s 502358us 256 3 6 0s 500793us 0s 505253us 256 6 0 0s 497751us 0s 503097us 256 6 3 0s 497773us 0s 502242us 256 6 6 0s 497769us 0s 502192us 1024 0 0 0s 457170us 0s 461843us 1024 0 3 1s 655705us 1s 660707us 1024 0 6 2s 388031us 2s 391429us 1024 3 0 1s 652717us 1s 660362us 1024 3 3 2s 755214us 1s 657005us 1024 3 6 1s 655735us 1s 660939us 1024 6 0 1s 669425us 1s 662643us 1024 6 3 1s 653472us 1s 659986us 1024 6 6 1s 653559us 1s 662163us dest > src && dest < src + len: Length Src Unalign Dest Unalign Without Patch Patch applied ------ ----------- ------------ ------------- ------------- 8 0 3 0s 45029us 0s 45775us 8 0 6 0s 43634us 0s 48014us 8 3 0 0s 43625us 0s 43993us 8 3 6 0s 44324us 0s 45081us 8 6 0 0s 43613us 0s 44014us 8 6 3 0s 43620us 0s 44011us 16 0 0 0s 67680us 0s 49631us 16 0 3 0s 67677us 0s 77417us 16 0 6 0s 67671us 0s 81879us 16 3 0 0s 67676us 0s 52492us 16 3 3 0s 67675us 0s 75199us 16 3 6 0s 67677us 0s 81215us 16 6 0 0s 67673us 0s 52490us 16 6 3 0s 67676us 0s 77304us 16 6 6 0s 67676us 0s 79712us 32 0 0 0s 115800us 0s 49632us 32 0 3 0s 115812us 0s 81214us 32 0 6 0s 115792us 0s 85728us 32 3 0 0s 116488us 0s 60198us 32 3 3 0s 115803us 0s 79709us 32 3 6 0s 115798us 0s 85726us 32 6 0 0s 115844us 0s 60202us 32 6 3 0s 115805us 0s 81212us 32 6 6 0s 115802us 0s 84228us 256 0 0 0s 815079us 0s 91737us 256 0 3 0s 815075us 0s 194652us 256 0 6 0s 815079us 0s 198521us 256 3 0 0s 815074us 0s 171470us 256 3 3 0s 817181us 0s 114299us 256 3 6 0s 816478us 0s 198524us 256 6 0 0s 815822us 0s 173566us 256 6 3 0s 814936us 0s 196515us 256 6 6 0s 815152us 0s 118811us 1024 0 0 3s 125598us 0s 244644us 1024 0 3 3s 132574us 0s 590261us 1024 0 6 3s 125613us 0s 595143us 1024 3 0 3s 128452us 0s 568743us 1024 3 3 3s 124862us 0s 263189us 1024 3 6 3s 127440us 0s 595515us 1024 6 0 3s 122370us 0s 569479us 1024 6 3 3s 127429us 0s 590554us 1024 6 6 3s 126732us 0s 269415us Though there is little regression when the length is <=16, theperformance is quite well when the length is >=32. Thanks! Miao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andi Kleen
2010-Sep-08 12:19 UTC
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
> According to the data, the length of the most copies is >=128.Thanks for the data. Large is easier to optimize than small, that''s good. Could you also measure how many memsets need the backwards copy? (should be easy to add) If the number is small that needs backwards then the easiest fix would be to simply call the normal memcpy in the forward case. That is for backward could also use a string instruction copy of course, just have to set the direction flag. That would be a very small code change. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2010-Sep-08 12:57 UTC
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote:> >> According to the data, the length of the most copies is>=128. > > Thanks for the data. Large is easier to optimize than small, that''s good. > > Could you also measure how many memsets need the backwards copy? > (should be easy to add)I think memset doesn''t need the backwards copy. The reason why memmove needs the backward copy is because the forward copy may cover the data which will be used later if the dest string intersect with the src string. But memset needn''t worry about this problem. Thanks! Miao> > If the number is small that needs backwards then the easiest fix > would be to simply call the normal memcpy in the forward case. > > That is for backward could also use a string instruction copy > of course, just have to set the direction flag. > > That would be a very small code change. > > -Andi > > > >
Andi Kleen
2010-Sep-08 13:05 UTC
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
On Wed, 08 Sep 2010 20:57:07 +0800 Miao Xie <miaox@cn.fujitsu.com> wrote:> On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote: > > > >> According to the data, the length of the most copies is>=128. > > > > Thanks for the data. Large is easier to optimize than small, that''s > > good. > > > > Could you also measure how many memsets need the backwards copy? > > (should be easy to add) > > I think memset doesn''t need the backwards copy.I meant for memmove of course. Obviously memset doesn''t need a backwards copy. That was just the only thing the script didn''t measure because the original version didn''t have memmove support. Your whole thread was about making memmove faster, right? -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2010-Sep-08 13:32 UTC
Re: [PATCH V2 1/3] lib: introduce some memory copy macros and functions
On Wed, 8 Sep 2010 15:05:19 +0200, Andi Kleen wrote:> On Wed, 08 Sep 2010 20:57:07 +0800 > Miao Xie<miaox@cn.fujitsu.com> wrote: > >> On Wed, 8 Sep 2010 14:19:25 +0200 (cest), Andi Kleen wrote: >>> >>>> According to the data, the length of the most copies is>=128. >>> >>> Thanks for the data. Large is easier to optimize than small, that''s >>> good. >>> >>> Could you also measure how many memsets need the backwards copy?^^^^^^^ :-)>>> (should be easy to add) >> >> I think memset doesn''t need the backwards copy. > > I meant for memmove of course. Obviously memset doesn''t need a backwards > copy. That was just the only thing the script didn''t measure because > the original version didn''t have memmove support.memmove total 5224252 [snip] forward copy value |-------------------------------------------------- count 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 980253 backward copy value |-------------------------------------------------- count 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4243999 the backward copy is much more than the forward copy. Thanks Miao -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html