Hi Frank,> > To answer Nadav's question. This kind of loop is generated by a scientific library and we are in the process of evaluating whether LLVM can be used for this research project. The target architectures will have (very wide) vector instructions and these loops are performance-critical to the application. Thus it would be important that these loops can make use of the vector units.Does your CPU have a good scatter/gather support ? It will be easy to add support for scatter/gather operations to the LLVM Loop-Vectorizer. The current design focuses on SIMD vectors and it probably does not have all of the features that are needed for wide-vector vectorization.> Right now as it seems LLVM cannot vectorize these loops. We might have some time to look into this, but it's not sure yet. However, high-level guidance from LLVM pros would be very useful. > > What is this usual way of requesting an improvement feature? Is this mailing list the central pace to communicate? >You can open bugs on bugzilla, but I think that the best way to move forward is to continue the discussions on the mailing list. Are there other workloads that are important to you ? Are there any other problems that you ran into ? Thanks, Nadav -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/b01e5ebd/attachment.html>
Hi Nadav, We are looking at a variety of target architectures. Ultimately we aim to run on BG/Q and Intel Xeon Phi (native). However, running on those architectures with the LLVM technology is planned in some future. As a first step we would target vanilla x86 with SSE/AVX 128/256 as a proof-of-concept. Most of our generated functions implement pure data-parallel operations which suit vector instructions. There are of course some kernels that require scatter/gather but I don't worry about those right now. What I don't understand: How can the loop vectorizer be good on a small vector size but not so good on a large one? (I guess this is what you're saying with SIMD vector as a 'small vector'). Isn't this functionality completely generic in the loop vectorizer and its algorithm doesn't care about the actual 'width' of the vector? Why did you bring up gather/scatter instructions? The test function doesn't make use of them. What's the role of gather/scatter in the loop vectorizer? I know one needs to insert/extract values to/from vectors in order to use them for scalar operations. But in the case here, there are no scalar operations. That's what I mean with these functions implement purely data-parallel/vector operations. Regards whether we have other problems. That's the good news about it: There are no other problem. Our applications already runs (and is correct) using the LLVM JIT'er. However, only with a datalayout that's not optimal for CPU architectures. In this case the functions get vectorized, but the application performance gets hurt due to cache thrashing. Now, applying an optimized data layout, which maximizes cache line reuse, introduces these 'rem' and 'div' instructions mentioned earlier which seem to let the vectorizer fail (or be it the scalar evolution analysis pass). Is there fundamental functionality missing in the auto vectorizer when the target vector size increases to 512 bits (instead of 128 for example)? And why? What needs to be done (on a high level) in order to have the auto vectorizer succeed on the test function as given erlier? Frank On 30/10/13 15:14, Nadav Rotem wrote:> Hi Frank, > >> >> To answer Nadav's question. This kind of loop is generated by a >> scientific library and we are in the process of evaluating whether >> LLVM can be used for this research project. The target architectures >> will have (very wide) vector instructions and these loops are >> performance-critical to the application. Thus it would be important >> that these loops can make use of the vector units. > > Does your CPU have a good scatter/gather support ? It will be easy to > add support for scatter/gather operations to the LLVM Loop-Vectorizer. > The current design focuses on SIMD vectors and it probably does not > have all of the features that are needed for wide-vector vectorization. > >> Right now as it seems LLVM cannot vectorize these loops. We might >> have some time to look into this, but it's not sure yet. However, >> high-level guidance from LLVM pros would be very useful. >> >> What is this usual way of requesting an improvement feature? Is this >> mailing list the central pace to communicate? >> > > You can open bugs on bugzilla, but I think that the best way to move > forward is to continue the discussions on the mailing list. Are there > other workloads that are important to you ? Are there any other > problems that you ran into ? > > Thanks, > Nadav
Hi Frank,> We are looking at a variety of target architectures. Ultimately we aim to run on BG/Q and Intel Xeon Phi (native). However, running on those architectures with the LLVM technology is planned in some future. As a first step we would target vanilla x86 with SSE/AVX 128/256 as a proof-of-concept.Great! It should be easy to support these targets. When you said wide-vectors I assumed that you mean old school vector-processors. Elena Demikhovsky is working on adding AVX512 support and once she is done things should just work. We will need to support some of the new features of AVX512 such as predication and scatter/gather to make the most out of this CPU. I don’t know too much on BG/Q, but maybe Hal can provide more info.> > Most of our generated functions implement pure data-parallel operations which suit vector instructions. There are of course some kernels that require scatter/gather but I don't worry about those right now.> What I don't understand: How can the loop vectorizer be good on a small vector size but not so good on a large one? (I guess this is what you're saying with SIMD vector as a 'small vector'). Isn't this functionality completely generic in the loop vectorizer and its algorithm doesn't care about the actual 'width' of the vector? > Why did you bring up gather/scatter instructions? The test function doesn't make use of them.If scatter/gather were free (or low cost), then it could allow vectorization of many more loops, because many times the high-cost of non-consecutive memory operations prevent vectorization.> What's the role of gather/scatter in the loop vectorizer?Simply to load/store non-consecutive memory locations.> I know one needs to insert/extract values to/from vectors in order to use them for scalar operations. But in the case here, there are no scalar operations. That's what I mean with these functions implement purely data-parallel/vector operations. > > Regards whether we have other problems. That's the good news about it: There are no other problem. Our applications already runs (and is correct) using the LLVM JIT'er. However, only with a datalayout that's not optimal for CPU architectures. In this case the functions get vectorized, but the application performance gets hurt due to cache thrashing. Now, applying an optimized data layout, which maximizes cache line reuse, introduces these 'rem' and 'div' instructions mentioned earlier which seem to let the vectorizer fail (or be it the scalar evolution analysis pass). > > Is there fundamental functionality missing in the auto vectorizer when the target vector size increases to 512 bits (instead of 128 for example)? And why? >Scatter/Gather cost model (and possibly intrinsics), support for predicated instructions, AVX512 cost model.> What needs to be done (on a high level) in order to have the auto vectorizer succeed on the test function as given erlier?Maybe you could rewrite the loop in a way that will expose contiguous memory accesses. Is this something you could do ? Thanks, Nadav> Frank > > > On 30/10/13 15:14, Nadav Rotem wrote: >> Hi Frank, >> >>> >>> To answer Nadav's question. This kind of loop is generated by a scientific library and we are in the process of evaluating whether LLVM can be used for this research project. The target architectures will have (very wide) vector instructions and these loops are performance-critical to the application. Thus it would be important that these loops can make use of the vector units. >> >> Does your CPU have a good scatter/gather support ? It will be easy to add support for scatter/gather operations to the LLVM Loop-Vectorizer. The current design focuses on SIMD vectors and it probably does not have all of the features that are needed for wide-vector vectorization. >> >>> Right now as it seems LLVM cannot vectorize these loops. We might have some time to look into this, but it's not sure yet. However, high-level guidance from LLVM pros would be very useful. >>> >>> What is this usual way of requesting an improvement feature? Is this mailing list the central pace to communicate? >>> >> >> You can open bugs on bugzilla, but I think that the best way to move forward is to continue the discussions on the mailing list. Are there other workloads that are important to you ? Are there any other problems that you ran into ? >> >> Thanks, >> Nadav > >