2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at gmail.com>:> > > Because a solution which doesn't generalize is not a very powerful > solution. What happens when somebody says that they want to use atomics + > large aggregate loads and stores? Give them yet another, different answer? > That would mean our earlier, less general answer, approach was either a > bandaid (bad) or the new answer requires a parallel code path in their > frontend (worse). > >It is expected from atomics/volatile to work differently. That is the whole point of them. A lot of optimization in InstCombine plain ignore atomic/volatile load/store. That is expected.> >> >>> >>> >>>> >>>> The argument that target are relying on InstCombine to mitigate IR >>>> requiring legalization seems dubious to me. First, because both aggregate >>>> and large scalar require legalization, so, if not ideal, the proposed >>>> change does not makes things any worse than they already are. In fact, as >>>> far as legalization is concerned, theses are pretty much the same. It >>>> should also be noted that InstCombine is not guaranteed to run before the >>>> target, so it seems like a bad idea to me to rely on it in the backend. >>>> >>> >>> InstCombine is not guaranteed to run before IR hits the backend but the >>> result of legalizing the machinations of InstCombine's output during >>> SelectionDAG is worse than generating illegal IR in the first place. >>> >> >> That does not follow. InstCombine is not creating new things that require >> legalisation, it changes one thing that require legalization into another >> that a larger part of LLVM can understand. >> > > I'm afraid I don't understand what you are getting at here. InstCombine > carefully avoids ptrtoint to weird types, truncs to weird types, etc. when > creating new IR. >What I'm getting at is that is doesn't make the situation any worse. In fact, it makes things actually better as you actually get something that do not need legalization is some cases, when you always get something that need legalization otherwize. Think about it. aggregate require legalization. Large scalar require legalization. Without the patch, you always need legalization. With the patch you sometime need legalization. And get optimization back into the game. If your point is that less legalization is better, then the change is a win. If you are willing to go the the multiple load/store solution for non volatile/atomic, then even more so.> > >> >> >>> >>> >>>> >>>> As for the big integral thing, I really don't care. I can change it to >>>> create multiple loads/stores respecting data layout, I have the code for >>>> that and could adapt it for this PR without too much trouble. If this is >>>> the only thing that is blocking this PR, then we can proceed. But I'd like >>>> some notion that we are making progress. Would you be willing to accept a >>>> solution based on creating a serie of load/store respecting the datalayout ? >>>> >>> >>> Splitting the memory operation into smaller operations is not semantics >>> preserving from an IR-theoretic perspective. For example, splitting a >>> volatile memory operation into several volatile memory operations is not >>> OK. Same goes with atomics. Some targets provide atomic memory operations >>> at the granularity of a cache line and splitting at legal integer >>> granularity would be observably different. >>> >>> >> That is off topic. Proposed patch explicitly gate for this. >> > > Then I guess we agree to disagree about what is "on topic". I think that > our advice to frontend authors regarding larger-than-legal loads/stores > should be uniform and not dependent on whether or not the operation was or > was not volatile. > >Come on. It is expected from atomic/volatile to be handled differently. That is the whole point of atomic/volatile. Bringing atomic/volatile as an argument that something should not be done in the general case sounds like backward rationalization.> >> >>> >>> With the above in mind, I don't see it as unreasonable for frontends to >>> generate IR that LLVM is comfortable with. We seem fine telling frontend >>> authors that they should strive to avoid large aggregate memory operations >>> in our performance tips guide < >>> http://llvm.org/docs/Frontend/PerformanceTips.html#avoid-loads-and-stores-of-large-aggregate-type>. >>> Implementation experience with Clang hasn't shown this to be particularly >>> odious to follow and none of the LLVM-side solutions seem satisfactory. >>> >>> >> Most front end do not have clang resources. Additionally, this tip is not >> quite accurate. I'm not interested in large aggregate load/store at this >> stage. I'm interested in ANY aggregate load/store. LLVM is just unable to >> handle any of it in a way that make sense. It could certainly do better for >> small aggregate, without too much trouble. >> >> > I'm confused what you mean about "clang resources" here, you haven't made > it clear what the burden it is to your frontend. I'm not saying that there > isn't such a burden, I just haven't seen it been articulated and I have > heard nothing similar from other folks using LLVM. What prevents you from > performing field-at-a-time loads and stores or calls to the memcpy > intrinsic? >clang has many developer behind it, some of them paid to work on it. That s simply not the case for many others. But to answer your questions : - Per field load/store generate more loads/stores than necessary in many cases. These can't be aggregated back because of padding. - memcpy only work memory to memory. It is certainly usable in some cases, but certainly do not cover all uses. I'm willing to do the memcpy optimization in InstCombine (in fact, things would not degenerate into so much bikescheding, that would already be done). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150817/4a727fcc/attachment.html>
On 08/17/2015 12:13 AM, deadal nix via llvm-dev wrote:> 2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at gmail.com > <mailto:david.majnemer at gmail.com>>: > > > > Because a solution which doesn't generalize is not a very powerful > solution. What happens when somebody says that they want to use > atomics + large aggregate loads and stores? Give them yet another, > different answer? That would mean our earlier, less general > answer, approach was either a bandaid (bad) or the new answer > requires a parallel code path in their frontend (worse). > > > It is expected from atomics/volatile to work differently. That is the > whole point of them. > > A lot of optimization in InstCombine plain ignore atomic/volatile > load/store. That is expected.I agree with this particular point. If we limited the optimizer to treating all load and stores the same, we'd have a much weaker optimizer. Treating atomic vs non-atomic FCAs differently from an optimization standpoint seems potentially reasonable. I would not want to treat them differently from a correctness/lowering strategy standpoint. (i.e. both the input and output from instcombine need to trigger somewhat sane results from the backend.) Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150817/9d16d08d/attachment.html>
Hi,> On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > > 2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at gmail.com <mailto:david.majnemer at gmail.com>>: > > > Because a solution which doesn't generalize is not a very powerful solution. What happens when somebody says that they want to use atomics + large aggregate loads and stores? Give them yet another, different answer? That would mean our earlier, less general answer, approach was either a bandaid (bad) or the new answer requires a parallel code path in their frontend (worse).+1 with David’s approach: making thing incrementally better is fine *as long as* the long term direction is identified. Small incremental changes that makes things slightly better in the short term but drives us away of the long term direction is not good. Don’t get me wrong, I’m not saying that the current patch is not good, just that it does not seem clear to me that the long term direction has been identified, which explain why some can be nervous about adding stuff prematurely. And I’m not for the status quo, while I can’t judge it definitively myself, I even bugged David last month to look at this revision and try to identify what is really the long term direction and how to make your (and other) frontends’ life easier.> > > It is expected from atomics/volatile to work differently. That is the whole point of them. > > A lot of optimization in InstCombine plain ignore atomic/volatile load/store. That is expected. > > > > > The argument that target are relying on InstCombine to mitigate IR requiring legalization seems dubious to me. First, because both aggregate and large scalar require legalization, so, if not ideal, the proposed change does not makes things any worse than they already are. In fact, as far as legalization is concerned, theses are pretty much the same. It should also be noted that InstCombine is not guaranteed to run before the target, so it seems like a bad idea to me to rely on it in the backend. > > InstCombine is not guaranteed to run before IR hits the backend but the result of legalizing the machinations of InstCombine's output during SelectionDAG is worse than generating illegal IR in the first place. > > That does not follow. InstCombine is not creating new things that require legalisation, it changes one thing that require legalization into another that a larger part of LLVM can understand. > > I'm afraid I don't understand what you are getting at here. InstCombine carefully avoids ptrtoint to weird types, truncs to weird types, etc. when creating new IR. > > What I'm getting at is that is doesn't make the situation any worse. In fact, it makes things actually better as you actually get something that do not need legalization is some cases, when you always get something that need legalization otherwize. > > Think about it. aggregate require legalization. Large scalar require legalization. Without the patch, you always need legalization. With the patch you sometime need legalization. And get optimization back into the game. > > If your point is that less legalization is better, then the change is a win. If you are willing to go the the multiple load/store solution for non volatile/atomic, then even more so. > > > > > > As for the big integral thing, I really don't care. I can change it to create multiple loads/stores respecting data layout, I have the code for that and could adapt it for this PR without too much trouble. If this is the only thing that is blocking this PR, then we can proceed. But I'd like some notion that we are making progress. Would you be willing to accept a solution based on creating a serie of load/store respecting the datalayout ? > > Splitting the memory operation into smaller operations is not semantics preserving from an IR-theoretic perspective. For example, splitting a volatile memory operation into several volatile memory operations is not OK. Same goes with atomics. Some targets provide atomic memory operations at the granularity of a cache line and splitting at legal integer granularity would be observably different. > > > That is off topic. Proposed patch explicitly gate for this. > > Then I guess we agree to disagree about what is "on topic". I think that our advice to frontend authors regarding larger-than-legal loads/stores should be uniform and not dependent on whether or not the operation was or was not volatile. > > > Come on. It is expected from atomic/volatile to be handled differently. That is the whole point of atomic/volatile. Bringing atomic/volatile as an argument that something should not be done in the general case sounds like backward rationalization. > > > > With the above in mind, I don't see it as unreasonable for frontends to generate IR that LLVM is comfortable with. We seem fine telling frontend authors that they should strive to avoid large aggregate memory operations in our performance tips guide <http://llvm.org/docs/Frontend/PerformanceTips.html#avoid-loads-and-stores-of-large-aggregate-type <http://llvm.org/docs/Frontend/PerformanceTips.html#avoid-loads-and-stores-of-large-aggregate-type>>. Implementation experience with Clang hasn't shown this to be particularly odious to follow and none of the LLVM-side solutions seem satisfactory. > > > Most front end do not have clang resources. Additionally, this tip is not quite accurate. I'm not interested in large aggregate load/store at this stage. I'm interested in ANY aggregate load/store. LLVM is just unable to handle any of it in a way that make sense. It could certainly do better for small aggregate, without too much trouble. > > > I'm confused what you mean about "clang resources" here, you haven't made it clear what the burden it is to your frontend. I'm not saying that there isn't such a burden, I just haven't seen it been articulated and I have heard nothing similar from other folks using LLVM. What prevents you from performing field-at-a-time loads and stores or calls to the memcpy intrinsic? > > clang has many developer behind it, some of them paid to work on it. That s simply not the case for many others. > > But to answer your questions : > - Per field load/store generate more loads/stores than necessary in many cases. These can't be aggregated back because of padding. > - memcpy only work memory to memory. It is certainly usable in some cases, but certainly do not cover all uses. > > I'm willing to do the memcpy optimization in InstCombine (in fact, things would not degenerate into so much bikescheding, that would already be done).Calling out “bikescheding” what other devs think is what keeps the quality of the project high is unlikely to help your patch go through, it’s probably quite the opposite actually. — Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150817/490cd5ed/attachment.html>
2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at apple.com>:> Hi, > > On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > > 2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at gmail.com>: > >> >> >> Because a solution which doesn't generalize is not a very powerful >> solution. What happens when somebody says that they want to use atomics + >> large aggregate loads and stores? Give them yet another, different answer? >> That would mean our earlier, less general answer, approach was either a >> bandaid (bad) or the new answer requires a parallel code path in their >> frontend (worse). >> > > > +1 with David’s approach: making thing incrementally better is fine *as > long as* the long term direction is identified. Small incremental changes > that makes things slightly better in the short term but drives us away of > the long term direction is not good. > > Don’t get me wrong, I’m not saying that the current patch is not good, > just that it does not seem clear to me that the long term direction has > been identified, which explain why some can be nervous about adding stuff > prematurely. > And I’m not for the status quo, while I can’t judge it definitively > myself, I even bugged David last month to look at this revision and try to > identify what is really the long term direction and how to make your (and > other) frontends’ life easier. > > >As long as there is something to be done. Concern has been raised for very large aggregate (64K, 1Mb) but there is no way a good codegen can come out of these anyway. I don't know of any machine that have 1Mb of register available to tank the load. Even I we had a good way to handle it in InstCombine, the backend would have no capability to generate something nice for it anyway. Most aggregates are small and there is no good excuse to not do anything to handle them because someone could generate gigantic ones that won't map nicely to the hardware anyway. By that logic, SROA should not exists as one could generate gigantic aggregate as well (in fact, SROA fail pretty badly on large aggregates). The second concern raised is for atomic/volatile, which needs to be handled by the optimizer differently anyway, so is mostly irrelevant here.> >> > > > clang has many developer behind it, some of them paid to work on it. That > s simply not the case for many others. > > But to answer your questions : > - Per field load/store generate more loads/stores than necessary in many > cases. These can't be aggregated back because of padding. > - memcpy only work memory to memory. It is certainly usable in some > cases, but certainly do not cover all uses. > > I'm willing to do the memcpy optimization in InstCombine (in fact, things > would not degenerate into so much bikescheding, that would already be done). > > > Calling out “bikescheding” what other devs think is what keeps the quality > of the project high is unlikely to help your patch go through, it’s > probably quite the opposite actually. > > >I understand the desire to keep quality high. That's is not where the problem is. The problem lies into discussing actual proposal against hypothetical perfect ones that do not exists. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150817/59a6be4a/attachment.html>