dag at cray.com
2014-Oct-24 17:22 UTC
[LLVMdev] Adding masked vector load and store intrinsics
"Das, Dibyendu" <Dibyendu.Das at amd.com> writes:> This looks to be a reasonable proposal. However native instructions > that support such masked ld/st may have a high latency ? Also, it > would be good to state some workloads where this will have a positive > impact.Any significant vector workload will see a giant gain from this. The masked operations really shouldn't have any more latency. The time of the memory operation itself dominates. -David
Das, Dibyendu
2014-Oct-24 18:44 UTC
[LLVMdev] Adding masked vector load and store intrinsics
Is there an example of such a workload ( lets say from the spec cpu 2006 harness or similar ) that you have in mind and the amount of gain expected ? - dibyendu -----Original Message----- From: dag at cray.com [mailto:dag at cray.com] Sent: Friday, October 24, 2014 10:52 PM To: Das, Dibyendu Cc: 'elena.demikhovsky at intel.com'; 'llvmdev at cs.uiuc.edu' Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics "Das, Dibyendu" <Dibyendu.Das at amd.com> writes:> This looks to be a reasonable proposal. However native instructions > that support such masked ld/st may have a high latency ? Also, it > would be good to state some workloads where this will have a positive > impact.Any significant vector workload will see a giant gain from this. The masked operations really shouldn't have any more latency. The time of the memory operation itself dominates. -David
dag at cray.com
2014-Oct-24 19:50 UTC
[LLVMdev] Adding masked vector load and store intrinsics
"Das, Dibyendu" <Dibyendu.Das at amd.com> writes:> Is there an example of such a workload ( lets say from the spec cpu > 2006 harness or similar ) that you have in mind and the amount of gain > expected ?Literally nearly every code that has significant vector work in it. Even if there is no control flow in the loop, masking allows the compiler to more aggressively vectorize and rely on the masks to prevent unsafe execution. The amount of gain is highly code-dependent but my guess is that Elena's example of 2x speedup is typical, maybe even on the lower end. The capability of the vectorizer is the biggest factor. Without masks, the vectorizer cannot be as aggressive. With masks, the vectorizer still has to be written to be aggressive. Ph.D. dissertations have been written on the topic. It's non-trivial work. Masking is an enabling technology, not an end goal. -David