Clement Courbet via llvm-dev
2019-Jan-11 12:45 UTC
[llvm-dev] [RFC] Adding a -memeq-lib-function flag to allow the user to specify a memeq function.
On Thu, Jan 10, 2019 at 4:47 PM Clement Courbet <courbet at google.com> wrote:> > > On Wed, Jan 9, 2019 at 6:16 PM James Y Knight <jyknight at google.com> wrote: > >> >> >> On Tue, Jan 8, 2019 at 9:24 AM Clement Courbet <courbet at google.com> >> wrote: >> >>> >>> >>> On Mon, Jan 7, 2019 at 10:26 PM James Y Knight <jyknight at google.com> >>> wrote: >>> >> I'm afraid about the "almost" and "generally": what about users who don't >>> ? >>> >> >> Even so, it should be fine to enable it for those platforms which do >> include it. >> >> I do note, sadly, that currently out of all these implementations, only >>>> NetBSD and FreeBSD seem to actually define a separate more optimized bcmp >>>> function. That does mean that this optimization would be effectively a >>>> no-op, for the vast majority of people. >>>> >>> >>> This might or might not be considered really an issue. >>> >> >> Right, the issue is adding an effectively useless optimization in llvm. >> >> - In my original proposal, people have to explicitly opt-in to the >>> feature and link to their memcmp implementation, they do not get the >>> improvement automatically. >>> - In this proposal, they have to patch their libc, which might be >>> slightly more painful depending on the system. >>> >> >> Users may also include a function named bcmp in their binary, which will >> overrides the one from libc. >> >> Here's a patch with this proposal to see what this looks like: >>> https://reviews.llvm.org/D56436 >>> >> >> It feels like this optimization would be better done in >> llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp, >> > > I'll have a look at this approach. >You're right, that's a better place to do this indeed: https://reviews.llvm.org/D56593> > >> >> But if you can show a similar performance win in public code, it'd be >>>> great to attempt to push a more optimized version upstream at least to >>>> glibc. Some more precise numbers than "very large improvement" are probably >>>> necessary to show it's actually worth it. :) >>>> >>> >>> We were planning to contribute it to compiler-rt. Contributing a >>> deprecated function to the libc sounds.... weird. >>> >> >> Yes, contributing an optimization for a deprecated function is indeed >> weird. Thus the importance of reliable performance numbers justifying the >> importance -- I'd never have thought that the performance cost of returning >> an ordering from memcmp would be important, and I suspect nobody else did. >> > > Fair enough, let me give some numbers for this change. > Before that, a caveat with any benchmarks for comparing strings is that > the results depend a lot the distribution of sizes and content of these > strings. So it makes more sense to benchmark an actual application, and we > have our own custom benchmarks. > That being said, one of the cases where we have found this optimization to > be impactful is `operator==(const string&, const string&)`. libcxx has a family > of benchmarks for `BM_StringRelational_Eq`) > <https://github.com/llvm-mirror/libcxx/blob/master/benchmarks/algorithms.bench.cpp>, > which I'm going to use here. > > BM_StringRelational_Eq benchmarks comparison of strings of size 7 (Small), > 63 (Large) and 5k (Huge) characters, in four scenarios (scenarii ?): > - The equal case (Control), which is theoretically the worst case as you > have to prove that all bytes are equal. > - The case when strings differ. In that case bcmp() only needs to prove > that one byte differs to return nonzero. Typical cases where strings differ > are at the start of the string (ChangeFirst), but also, interestingly, at > the end (ChangeLast, when you are comparing strings with a common prefix, > which happens frequently e.g. when comparing subsequent elements of a > sorted list of strings). Another interesting case is the case when the > change position is in the middle (ChangeMiddle). > > For this comparison, I'm using as base the call to `memcmp`, and as > experiment the following crude bcmp() implementation (I'm assuming X86_64), > that shows how we can take advantage of the observations above to optimize > typical cases. > > ``` > #define UNALIGNED_LOAD64(_p) (*reinterpret_cast<const uint64_t *>(_p)) > > extern "C" int bcmp(const void* p1, const void* p2, size_t n) throw() { > const char* a = reinterpret_cast<const char*>(p1); > const char* b = reinterpret_cast<const char*>(p2); > if (n >= 8) { > uint64_t u = UNALIGNED_LOAD64(a) ^ UNALIGNED_LOAD64(b); > uint64_t v = UNALIGNED_LOAD64(a + n - 8) ^ UNALIGNED_LOAD64(b + n - 8); > if ((u | v) != 0) { > return 1; > } > } > return memcmp(a, b, n); > } > ``` > > Note that: > - there is a bit of noise in the results, but you'll see that this quite > dumb bcmp() reasonably improves {Large,Huge}_{ChangeFirst,ChangeLast} (note > the the improvement to {Large,Huge}_ChangeLast cannot be achieved with > the semantics of memcmp) without hurting the `ChangeMiddle` and `Control` > cases. > - the small string case (size==7) is not modified by our change because > there is a "fast path" for very small sizes on operator==. > - We are still experimenting with the final bcmp() implementation (in > particular improving the `Control` and `ChangeMiddle` cases by improving > parallelism). Our current version is better than memcmp() across the > board on X86. > > > base (ns) exp (ns) > BM_StringRelational_Eq_Empty_Empty_Control 1.65 1.7 > BM_StringRelational_Eq_Empty_Small_Control 1.37 1.4 > BM_StringRelational_Eq_Empty_Large_Control 1.37 1.44 > BM_StringRelational_Eq_Empty_Huge_Control 1.38 1.44 > BM_StringRelational_Eq_Small_Small_Control 6.53 6.51 > BM_StringRelational_Eq_Small_Small_ChangeFirst 1.96 1.94 > BM_StringRelational_Eq_Small_Small_ChangeMiddle 5.06 4.95 > BM_StringRelational_Eq_Small_Small_ChangeLast 6.77 6.84 > BM_StringRelational_Eq_Small_Large_Control 1.38 1.41 > BM_StringRelational_Eq_Small_Huge_Control 1.37 1.39 > BM_StringRelational_Eq_Large_Large_Control 5.54 5.8 > BM_StringRelational_Eq_Large_Large_ChangeFirst 6.25 3.06 > BM_StringRelational_Eq_Large_Large_ChangeMiddle 5.5 5.94 > BM_StringRelational_Eq_Large_Large_ChangeLast 6.04 3.42 > BM_StringRelational_Eq_Large_Huge_Control 1.1 1.1 > BM_StringRelational_Eq_Huge_Huge_Control 118 118 > BM_StringRelational_Eq_Huge_Huge_ChangeFirst 5.65 3.02 > BM_StringRelational_Eq_Huge_Huge_ChangeMiddle 69.5 66.9 > BM_StringRelational_Eq_Huge_Huge_ChangeLast 118 3.43 > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190111/d3fc3a68/attachment.html>
Clement Courbet via llvm-dev
2019-Feb-05 07:28 UTC
[llvm-dev] [RFC] Adding a -memeq-lib-function flag to allow the user to specify a memeq function.
I'd like to move forward with this. Since there are no objections to the approach suggested by James, I've tentatively sent the corresponding v5 implementation (https://reviews.llvm.org/D56593) for review. On Fri, Jan 11, 2019 at 1:45 PM Clement Courbet <courbet at google.com> wrote:> > > On Thu, Jan 10, 2019 at 4:47 PM Clement Courbet <courbet at google.com> > wrote: > >> >> >> On Wed, Jan 9, 2019 at 6:16 PM James Y Knight <jyknight at google.com> >> wrote: >> >>> >>> >>> On Tue, Jan 8, 2019 at 9:24 AM Clement Courbet <courbet at google.com> >>> wrote: >>> >>>> >>>> >>>> On Mon, Jan 7, 2019 at 10:26 PM James Y Knight <jyknight at google.com> >>>> wrote: >>>> >>> I'm afraid about the "almost" and "generally": what about users who >>>> don't ? >>>> >>> >>> Even so, it should be fine to enable it for those platforms which do >>> include it. >>> >>> I do note, sadly, that currently out of all these implementations, only >>>>> NetBSD and FreeBSD seem to actually define a separate more optimized bcmp >>>>> function. That does mean that this optimization would be effectively a >>>>> no-op, for the vast majority of people. >>>>> >>>> >>>> This might or might not be considered really an issue. >>>> >>> >>> Right, the issue is adding an effectively useless optimization in llvm. >>> >>> - In my original proposal, people have to explicitly opt-in to the >>>> feature and link to their memcmp implementation, they do not get the >>>> improvement automatically. >>>> - In this proposal, they have to patch their libc, which might be >>>> slightly more painful depending on the system. >>>> >>> >>> Users may also include a function named bcmp in their binary, which will >>> overrides the one from libc. >>> >>> Here's a patch with this proposal to see what this looks like: >>>> https://reviews.llvm.org/D56436 >>>> >>> >>> It feels like this optimization would be better done in >>> llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp, >>> >> >> I'll have a look at this approach. >> > > You're right, that's a better place to do this indeed: > https://reviews.llvm.org/D56593 > > > >> >> >>> >>> But if you can show a similar performance win in public code, it'd be >>>>> great to attempt to push a more optimized version upstream at least to >>>>> glibc. Some more precise numbers than "very large improvement" are probably >>>>> necessary to show it's actually worth it. :) >>>>> >>>> >>>> We were planning to contribute it to compiler-rt. Contributing a >>>> deprecated function to the libc sounds.... weird. >>>> >>> >>> Yes, contributing an optimization for a deprecated function is indeed >>> weird. Thus the importance of reliable performance numbers justifying the >>> importance -- I'd never have thought that the performance cost of returning >>> an ordering from memcmp would be important, and I suspect nobody else did. >>> >> >> Fair enough, let me give some numbers for this change. >> Before that, a caveat with any benchmarks for comparing strings is that >> the results depend a lot the distribution of sizes and content of these >> strings. So it makes more sense to benchmark an actual application, and we >> have our own custom benchmarks. >> That being said, one of the cases where we have found this optimization >> to be impactful is `operator==(const string&, const string&)`. libcxx has a family >> of benchmarks for `BM_StringRelational_Eq`) >> <https://github.com/llvm-mirror/libcxx/blob/master/benchmarks/algorithms.bench.cpp>, >> which I'm going to use here. >> >> BM_StringRelational_Eq benchmarks comparison of strings of size 7 >> (Small), 63 (Large) and 5k (Huge) characters, in four scenarios (scenarii >> ?): >> - The equal case (Control), which is theoretically the worst case as you >> have to prove that all bytes are equal. >> - The case when strings differ. In that case bcmp() only needs to prove >> that one byte differs to return nonzero. Typical cases where strings differ >> are at the start of the string (ChangeFirst), but also, interestingly, at >> the end (ChangeLast, when you are comparing strings with a common prefix, >> which happens frequently e.g. when comparing subsequent elements of a >> sorted list of strings). Another interesting case is the case when the >> change position is in the middle (ChangeMiddle). >> >> For this comparison, I'm using as base the call to `memcmp`, and as >> experiment the following crude bcmp() implementation (I'm assuming X86_64), >> that shows how we can take advantage of the observations above to optimize >> typical cases. >> >> ``` >> #define UNALIGNED_LOAD64(_p) (*reinterpret_cast<const uint64_t *>(_p)) >> >> extern "C" int bcmp(const void* p1, const void* p2, size_t n) throw() { >> const char* a = reinterpret_cast<const char*>(p1); >> const char* b = reinterpret_cast<const char*>(p2); >> if (n >= 8) { >> uint64_t u = UNALIGNED_LOAD64(a) ^ UNALIGNED_LOAD64(b); >> uint64_t v = UNALIGNED_LOAD64(a + n - 8) ^ UNALIGNED_LOAD64(b + n - >> 8); >> if ((u | v) != 0) { >> return 1; >> } >> } >> return memcmp(a, b, n); >> } >> ``` >> >> Note that: >> - there is a bit of noise in the results, but you'll see that this quite >> dumb bcmp() reasonably improves {Large,Huge}_{ChangeFirst,ChangeLast} (note >> the the improvement to {Large,Huge}_ChangeLast cannot be achieved with >> the semantics of memcmp) without hurting the `ChangeMiddle` and >> `Control` cases. >> - the small string case (size==7) is not modified by our change because >> there is a "fast path" for very small sizes on operator==. >> - We are still experimenting with the final bcmp() implementation (in >> particular improving the `Control` and `ChangeMiddle` cases by improving >> parallelism). Our current version is better than memcmp() across the >> board on X86. >> >> >> base (ns) exp (ns) >> BM_StringRelational_Eq_Empty_Empty_Control 1.65 1.7 >> BM_StringRelational_Eq_Empty_Small_Control 1.37 1.4 >> BM_StringRelational_Eq_Empty_Large_Control 1.37 1.44 >> BM_StringRelational_Eq_Empty_Huge_Control 1.38 1.44 >> BM_StringRelational_Eq_Small_Small_Control 6.53 6.51 >> BM_StringRelational_Eq_Small_Small_ChangeFirst 1.96 1.94 >> BM_StringRelational_Eq_Small_Small_ChangeMiddle 5.06 4.95 >> BM_StringRelational_Eq_Small_Small_ChangeLast 6.77 6.84 >> BM_StringRelational_Eq_Small_Large_Control 1.38 1.41 >> BM_StringRelational_Eq_Small_Huge_Control 1.37 1.39 >> BM_StringRelational_Eq_Large_Large_Control 5.54 5.8 >> BM_StringRelational_Eq_Large_Large_ChangeFirst 6.25 3.06 >> BM_StringRelational_Eq_Large_Large_ChangeMiddle 5.5 5.94 >> BM_StringRelational_Eq_Large_Large_ChangeLast 6.04 3.42 >> BM_StringRelational_Eq_Large_Huge_Control 1.1 1.1 >> BM_StringRelational_Eq_Huge_Huge_Control 118 118 >> BM_StringRelational_Eq_Huge_Huge_ChangeFirst 5.65 3.02 >> BM_StringRelational_Eq_Huge_Huge_ChangeMiddle 69.5 66.9 >> BM_StringRelational_Eq_Huge_Huge_ChangeLast 118 3.43 >> >> >> >> >> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190205/a5db76d0/attachment.html>
Chandler Carruth via llvm-dev
2019-Feb-05 08:27 UTC
[llvm-dev] [RFC] Adding a -memeq-lib-function flag to allow the user to specify a memeq function.
FWIW, I'm still somewhat inclined to suggest LLVM add intrinsics analogous to memcpy, memmove, and memset for memcmp and bcmp so that we can model these cleanly and then expand them late (backend) instead of early even in cases of short sequences, etc. That doesn't make this initial patch bad, quite the opposite, I think the patch you have is going in the right direction. I just think we need to push beyond this a bit as well. On Mon, Feb 4, 2019 at 11:29 PM Clement Courbet via llvm-dev < llvm-dev at lists.llvm.org> wrote:> I'd like to move forward with this. Since there are no objections to the > approach suggested by James, I've tentatively sent the corresponding v5 > implementation (https://reviews.llvm.org/D56593) for review. > > On Fri, Jan 11, 2019 at 1:45 PM Clement Courbet <courbet at google.com> > wrote: > >> >> >> On Thu, Jan 10, 2019 at 4:47 PM Clement Courbet <courbet at google.com> >> wrote: >> >>> >>> >>> On Wed, Jan 9, 2019 at 6:16 PM James Y Knight <jyknight at google.com> >>> wrote: >>> >>>> >>>> >>>> On Tue, Jan 8, 2019 at 9:24 AM Clement Courbet <courbet at google.com> >>>> wrote: >>>> >>>>> >>>>> >>>>> On Mon, Jan 7, 2019 at 10:26 PM James Y Knight <jyknight at google.com> >>>>> wrote: >>>>> >>>> I'm afraid about the "almost" and "generally": what about users who >>>>> don't ? >>>>> >>>> >>>> Even so, it should be fine to enable it for those platforms which do >>>> include it. >>>> >>>> I do note, sadly, that currently out of all these implementations, only >>>>>> NetBSD and FreeBSD seem to actually define a separate more optimized bcmp >>>>>> function. That does mean that this optimization would be effectively a >>>>>> no-op, for the vast majority of people. >>>>>> >>>>> >>>>> This might or might not be considered really an issue. >>>>> >>>> >>>> Right, the issue is adding an effectively useless optimization in llvm. >>>> >>>> - In my original proposal, people have to explicitly opt-in to the >>>>> feature and link to their memcmp implementation, they do not get the >>>>> improvement automatically. >>>>> - In this proposal, they have to patch their libc, which might be >>>>> slightly more painful depending on the system. >>>>> >>>> >>>> Users may also include a function named bcmp in their binary, which >>>> will overrides the one from libc. >>>> >>>> Here's a patch with this proposal to see what this looks like: >>>>> https://reviews.llvm.org/D56436 >>>>> >>>> >>>> It feels like this optimization would be better done in >>>> llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp, >>>> >>> >>> I'll have a look at this approach. >>> >> >> You're right, that's a better place to do this indeed: >> https://reviews.llvm.org/D56593 >> >> >> >>> >>> >>>> >>>> But if you can show a similar performance win in public code, it'd be >>>>>> great to attempt to push a more optimized version upstream at least to >>>>>> glibc. Some more precise numbers than "very large improvement" are probably >>>>>> necessary to show it's actually worth it. :) >>>>>> >>>>> >>>>> We were planning to contribute it to compiler-rt. Contributing a >>>>> deprecated function to the libc sounds.... weird. >>>>> >>>> >>>> Yes, contributing an optimization for a deprecated function is indeed >>>> weird. Thus the importance of reliable performance numbers justifying the >>>> importance -- I'd never have thought that the performance cost of returning >>>> an ordering from memcmp would be important, and I suspect nobody else did. >>>> >>> >>> Fair enough, let me give some numbers for this change. >>> Before that, a caveat with any benchmarks for comparing strings is that >>> the results depend a lot the distribution of sizes and content of these >>> strings. So it makes more sense to benchmark an actual application, and we >>> have our own custom benchmarks. >>> That being said, one of the cases where we have found this optimization >>> to be impactful is `operator==(const string&, const string&)`. libcxx has a family >>> of benchmarks for `BM_StringRelational_Eq`) >>> <https://github.com/llvm-mirror/libcxx/blob/master/benchmarks/algorithms.bench.cpp>, >>> which I'm going to use here. >>> >>> BM_StringRelational_Eq benchmarks comparison of strings of size 7 >>> (Small), 63 (Large) and 5k (Huge) characters, in four scenarios (scenarii >>> ?): >>> - The equal case (Control), which is theoretically the worst case as >>> you have to prove that all bytes are equal. >>> - The case when strings differ. In that case bcmp() only needs to prove >>> that one byte differs to return nonzero. Typical cases where strings differ >>> are at the start of the string (ChangeFirst), but also, interestingly, at >>> the end (ChangeLast, when you are comparing strings with a common prefix, >>> which happens frequently e.g. when comparing subsequent elements of a >>> sorted list of strings). Another interesting case is the case when the >>> change position is in the middle (ChangeMiddle). >>> >>> For this comparison, I'm using as base the call to `memcmp`, and as >>> experiment the following crude bcmp() implementation (I'm assuming X86_64), >>> that shows how we can take advantage of the observations above to optimize >>> typical cases. >>> >>> ``` >>> #define UNALIGNED_LOAD64(_p) (*reinterpret_cast<const uint64_t *>(_p)) >>> >>> extern "C" int bcmp(const void* p1, const void* p2, size_t n) throw() { >>> const char* a = reinterpret_cast<const char*>(p1); >>> const char* b = reinterpret_cast<const char*>(p2); >>> if (n >= 8) { >>> uint64_t u = UNALIGNED_LOAD64(a) ^ UNALIGNED_LOAD64(b); >>> uint64_t v = UNALIGNED_LOAD64(a + n - 8) ^ UNALIGNED_LOAD64(b + n - >>> 8); >>> if ((u | v) != 0) { >>> return 1; >>> } >>> } >>> return memcmp(a, b, n); >>> } >>> ``` >>> >>> Note that: >>> - there is a bit of noise in the results, but you'll see that this >>> quite dumb bcmp() reasonably improves {Large,Huge}_{ChangeFirst, >>> ChangeLast} (note the the improvement to {Large,Huge}_ChangeLast cannot >>> be achieved with the semantics of memcmp) without hurting the `ChangeMiddle` >>> and `Control` cases. >>> - the small string case (size==7) is not modified by our change because >>> there is a "fast path" for very small sizes on operator==. >>> - We are still experimenting with the final bcmp() implementation (in >>> particular improving the `Control` and `ChangeMiddle` cases by >>> improving parallelism). Our current version is better than memcmp() >>> across the board on X86. >>> >>> >>> base (ns) exp (ns) >>> BM_StringRelational_Eq_Empty_Empty_Control 1.65 1.7 >>> BM_StringRelational_Eq_Empty_Small_Control 1.37 1.4 >>> BM_StringRelational_Eq_Empty_Large_Control 1.37 1.44 >>> BM_StringRelational_Eq_Empty_Huge_Control 1.38 1.44 >>> BM_StringRelational_Eq_Small_Small_Control 6.53 6.51 >>> BM_StringRelational_Eq_Small_Small_ChangeFirst 1.96 1.94 >>> BM_StringRelational_Eq_Small_Small_ChangeMiddle 5.06 4.95 >>> BM_StringRelational_Eq_Small_Small_ChangeLast 6.77 6.84 >>> BM_StringRelational_Eq_Small_Large_Control 1.38 1.41 >>> BM_StringRelational_Eq_Small_Huge_Control 1.37 1.39 >>> BM_StringRelational_Eq_Large_Large_Control 5.54 5.8 >>> BM_StringRelational_Eq_Large_Large_ChangeFirst 6.25 3.06 >>> BM_StringRelational_Eq_Large_Large_ChangeMiddle 5.5 5.94 >>> BM_StringRelational_Eq_Large_Large_ChangeLast 6.04 3.42 >>> BM_StringRelational_Eq_Large_Huge_Control 1.1 1.1 >>> BM_StringRelational_Eq_Huge_Huge_Control 118 118 >>> BM_StringRelational_Eq_Huge_Huge_ChangeFirst 5.65 3.02 >>> BM_StringRelational_Eq_Huge_Huge_ChangeMiddle 69.5 66.9 >>> BM_StringRelational_Eq_Huge_Huge_ChangeLast 118 3.43 >>> >>> >>> >>> >>> >>> >>> >> _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190205/9f26bc78/attachment.html>