Sanjay Patel via llvm-dev
2016-Mar-10 22:06 UTC
[llvm-dev] masked-load endpoints optimization
If we're loading the first and last elements of a vector using a masked load [1], can we replace the masked load with a full vector load? "The result of this operation is equivalent to a regular vector load instruction followed by a ‘select’ between the loaded and the passthru values, predicated on the same mask. However, using this intrinsic prevents exceptions on memory access to masked-off lanes." I think the fact that we're loading the endpoints of the vector guarantees that a full vector load can't have any different faulting/exception behavior on x86 and most (?) other targets. We would, however, be reading memory that the program has not explicitly requested. IR example: define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) { ; load the first and last elements pointed to by %addr and shuffle those into %v %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>* %addr, i32 4, <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v) ret <4 x i32> %res } would become something like: define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) { %vecload = load <4 x i32>, <4 x i32>* %addr, align 4 %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %vecload, <4 x i32> %v ret <4 x i32> %sel } If this isn't valid as an IR optimization, would it be acceptable as a DAG combine with target hook to opt in? [1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/749fad7c/attachment.html>
Nema, Ashutosh via llvm-dev
2016-Mar-11 05:22 UTC
[llvm-dev] masked-load endpoints optimization
This looks interesting, the main motivation appears to be replacing masked vector load with a general vector load followed by a select. Observed masked vector loads are in general expensive in comparison with a vector load. But if first & last element of a masked vector load are guaranteed to be accessed then it can be transformed to a vector load. In opt this can be driven by TTI, where the benefit of this transformation should be checked. Regards, Ashutosh From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Sanjay Patel via llvm-dev Sent: Friday, March 11, 2016 3:37 AM To: llvm-dev Subject: [llvm-dev] masked-load endpoints optimization If we're loading the first and last elements of a vector using a masked load [1], can we replace the masked load with a full vector load? "The result of this operation is equivalent to a regular vector load instruction followed by a ‘select’ between the loaded and the passthru values, predicated on the same mask. However, using this intrinsic prevents exceptions on memory access to masked-off lanes." I think the fact that we're loading the endpoints of the vector guarantees that a full vector load can't have any different faulting/exception behavior on x86 and most (?) other targets. We would, however, be reading memory that the program has not explicitly requested. IR example: define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) { ; load the first and last elements pointed to by %addr and shuffle those into %v %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>* %addr, i32 4, <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v) ret <4 x i32> %res } would become something like: define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) { %vecload = load <4 x i32>, <4 x i32>* %addr, align 4 %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %vecload, <4 x i32> %v ret <4 x i32> %sel } If this isn't valid as an IR optimization, would it be acceptable as a DAG combine with target hook to opt in? [1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/13463579/attachment-0001.html>
Sanjay Patel via llvm-dev
2016-Mar-11 16:57 UTC
[llvm-dev] masked-load endpoints optimization
Thanks, Ashutosh. Yes, either TTI or TLI could be used to limit the transform if we do it in CGP rather than the DAG. The real question I have is whether it is legal to read the extra memory, regardless of whether this is a masked load or something else. Note that the x86 backend already does this, so either my proposal is ok for x86, or we're already doing an illegal optimization: define <4 x i32> @load_bonus_bytes(i32* %addr1, <4 x i32> %v) { %ld1 = load i32, i32* %addr1 %addr2 = getelementptr i32, i32* %addr1, i64 3 %ld2 = load i32, i32* %addr2 %vec1 = insertelement <4 x i32> undef, i32 %ld1, i32 0 %vec2 = insertelement <4 x i32> %vec1, i32 %ld2, i32 3 ret <4 x i32> %vec2 } $ ./llc -o - loadcombine.ll ... movups (%rdi), %xmm0 retq On Thu, Mar 10, 2016 at 10:22 PM, Nema, Ashutosh <Ashutosh.Nema at amd.com> wrote:> This looks interesting, the main motivation appears to be replacing masked > vector load with a general vector load followed by a select. > > > > Observed masked vector loads are in general expensive in comparison with a > vector load. > > > > But if first & last element of a masked vector load are guaranteed to be > accessed then it can be transformed to a vector load. > > > > In opt this can be driven by TTI, where the benefit of this transformation > should be checked. > > > > Regards, > > Ashutosh > > > > *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of *Sanjay > Patel via llvm-dev > *Sent:* Friday, March 11, 2016 3:37 AM > *To:* llvm-dev > *Subject:* [llvm-dev] masked-load endpoints optimization > > > > If we're loading the first and last elements of a vector using a masked > load [1], can we replace the masked load with a full vector load? > > "The result of this operation is equivalent to a regular vector load > instruction followed by a ‘select’ between the loaded and the passthru > values, predicated on the same mask. However, using this intrinsic prevents > exceptions on memory access to masked-off lanes." > > I think the fact that we're loading the endpoints of the vector guarantees > that a full vector load can't have any different faulting/exception > behavior on x86 and most (?) other targets. We would, however, be reading > memory that the program has not explicitly requested. > > IR example: > > define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) { > > ; load the first and last elements pointed to by %addr and shuffle those > into %v > > %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>* %addr, i32 4, > <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v) > ret <4 x i32> %res > } > > would become something like: > > > define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) { > > %vecload = load <4 x i32>, <4 x i32>* %addr, align 4 > > %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %vecload, <4 > x i32> %v > > ret <4 x i32> %sel > } > > If this isn't valid as an IR optimization, would it be acceptable as a DAG > combine with target hook to opt in? > > > [1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/d5ea5b7c/attachment.html>