Doerfert, Johannes Rudolf via llvm-dev
2019-Jan-22 19:29 UTC
[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
We would still know that. We can do exactly the same reasoning as we do now. I think the important question is, how different is the code generated for either mode and can we hide (most of) the differences in the runtime. If I understand you correctly, you say the data sharing code looks very different and the differences cannot be hidden, correct? It would be helpful for me to understand your point if you could give me a piece of OpenMP for which the data sharing in SPMD mode and "guarded" mode are as different as possible. I can compile it in both modes myself so high-level OpenMP is fine (I will disable SPMD mode manually in the source if necessary). Thanks, Johannes ________________________________ From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Alexey Bataev via llvm-dev <llvm-dev at lists.llvm.org> Sent: Tuesday, January 22, 2019 13:10 To: Doerfert, Johannes Rudolf Cc: Alexey Bataev; LLVM-Dev; Arpith Chacko Jacob; openmp-dev at lists.llvm.org; cfe-dev at lists.llvm.org Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation" But we need to know the execution mode, SPMD or "guarded" ------------- Best regards, Alexey Bataev 22.01.2019 13:54, Doerfert, Johannes Rudolf пишет: We could still do that in clang, couldn't we? Get Outlook for Android<https://aka.ms/ghei36> ________________________________ From: Alexey Bataev <a.bataev at outlook.com><mailto:a.bataev at outlook.com> Sent: Tuesday, January 22, 2019 12:52:42 PM To: Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org> Cc: openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" The globalization for the local variables, for example. It must be implemented in the compiler to get the good performance, not in the runtime. ------------- Best regards, Alexey Bataev 22.01.2019 13:43, Doerfert, Johannes Rudolf пишет: Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass. Thanks, James Get Outlook for Android<https://aka.ms/ghei36> ________________________________ From: Alexey Bataev <a.bataev at outlook.com><mailto:a.bataev at outlook.com> Sent: Tuesday, January 22, 2019 12:34:01 PM To: Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org> Cc: openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" ------------- Best regards, Alexey Bataev 22.01.2019 13:17, Doerfert, Johannes Rudolf пишет: Where we are ------------ Currently, when we generate OpenMP target offloading code for GPUs, we use sufficient syntactic criteria to decide between two execution modes: 1) SPMD -- All target threads (in an OpenMP team) run all the code. 2) "Guarded" -- The master thread (of an OpenMP team) runs the user code. If an OpenMP distribute region is encountered, thus if all threads (in the OpenMP team) are supposed to execute the region, the master wakes up the idling worker threads and points them to the correct piece of code for distributed execution. For a variety of reasons we (generally) prefer the first execution mode. However, depending on the code, that might not be valid, or we might just not know if it is in the Clang code generation phase. The implementation of the "guarded" execution mode follows roughly the state machine description in [1], though the implementation is different (more general) nowadays. What we want ------------ Increase the amount of code executed in SPMD mode and the use of lightweight "guarding" schemes where appropriate. How we get (could) there ------------------------ We propose the following two modifications in order: 1) Move the state machine logic into the OpenMP runtime library. That means in SPMD mode all device threads will start the execution of the user code, thus emerge from the runtime, while in guarded mode only the master will escape the runtime and the other threads will idle in their state machine code that is now just "hidden". Why: - The state machine code cannot be (reasonably) optimized anyway, moving it into the library shouldn't hurt runtime but might even improve compile time a little bit. - The change should also simplify the Clang code generation as we would generate structurally the same code for both execution modes but only the runtime library calls, or their arguments, would differ between them. - The reason we should not "just start in SPMD mode" and "repair" it later is simple, this way we always have semantically correct and executable code. - Finally, and most importantly, there is now only little difference (see above) between the two modes in the code generated by clang. If we later analyze the code trying to decide if we can use SPMD mode instead of guarded mode the analysis and transformation becomes much simpler. The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor. We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance. 2) Implement a middle-end LLVM-IR pass that detects the guarded mode, e.g., through the runtime library calls used, and that tries to convert it into the SPMD mode potentially by introducing lightweight guards in the process. Why: - After the inliner, and the canonicalizations, we have a clearer picture of the code that is actually executed in the target region and all the side effects it contains. Thus, we can make an educated decision on the required amount of guards that prevent unwanted side effects from happening after a move to SPMD mode. - At this point we can more easily introduce different schemes to avoid side effects by threads that were not supposed to run. We can decide if a state machine is needed, conditionals should be employed, masked instructions are appropriate, or "dummy" local storage can be used to hide the side effect from the outside world. None of this was implemented yet but we plan to start in the immediate future. Any comments, ideas, criticism is welcome! Cheers, Johannes P.S. [2-4] Provide further information on implementation and features. [1] https://ieeexplore.ieee.org/document/7069297 [2] https://dl.acm.org/citation.cfm?id=2833161 [3] https://dl.acm.org/citation.cfm?id=3018870 [4] https://dl.acm.org/citation.cfm?id=3148189 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/11fa9b0d/attachment-0001.html>
Alexey Bataev via llvm-dev
2019-Jan-22 19:46 UTC
[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
No, we don't. We need to perform the different kind of the analysis for SPMD mode constructs and Non-SPMD. For SPMD mode we need to globalize only reduction/lastprivate variables. For Non-SPMD mode, we need to globalize all the private/local variables, that may escape their declaration context in the construct. ------------- Best regards, Alexey Bataev 22.01.2019 14:29, Doerfert, Johannes Rudolf пишет:> We would still know that. We can do exactly the same reasoning as we > do now. > > I think the important question is, how different is the code generated > for either mode and can we hide (most of) the differences in the runtime. > > > If I understand you correctly, you say the data sharing code looks > very different and the differences cannot be hidden, correct? > > It would be helpful for me to understand your point if you could give > me a piece of OpenMP for which the data sharing in SPMD mode and "guarded" > > mode are as different as possible. I can compile it in both modes > myself so high-level OpenMP is fine (I will disable SPMD mode manually > in the source if necessary). > > > Thanks, > > Johannes > > > > > ------------------------------------------------------------------------ > *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Alexey > Bataev via llvm-dev <llvm-dev at lists.llvm.org> > *Sent:* Tuesday, January 22, 2019 13:10 > *To:* Doerfert, Johannes Rudolf > *Cc:* Alexey Bataev; LLVM-Dev; Arpith Chacko Jacob; > openmp-dev at lists.llvm.org; cfe-dev at lists.llvm.org > *Subject:* Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation" > > But we need to know the execution mode, SPMD or "guarded" > > ------------- > Best regards, > Alexey Bataev > 22.01.2019 13:54, Doerfert, Johannes Rudolf пишет: >> We could still do that in clang, couldn't we? >> >> Get Outlook for Android <https://aka.ms/ghei36> >> >> ------------------------------------------------------------------------ >> *From:* Alexey Bataev <a.bataev at outlook.com> >> <mailto:a.bataev at outlook.com> >> *Sent:* Tuesday, January 22, 2019 12:52:42 PM >> *To:* Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org >> <mailto:cfe-dev at lists.llvm.org> >> *Cc:* openmp-dev at lists.llvm.org <mailto:openmp-dev at lists.llvm.org>; >> LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob >> *Subject:* Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" >> >> >> The globalization for the local variables, for example. It must be >> implemented in the compiler to get the good performance, not in the >> runtime. >> >> >> ------------- >> Best regards, >> Alexey Bataev >> 22.01.2019 13:43, Doerfert, Johannes Rudolf пишет: >>> Could you elaborate on what you refer to wrt data sharing. What do >>> we currently do in the clang code generation that we could not >>> effectively implement in the runtime, potentially with support of an >>> llvm pass. >>> >>> Thanks, >>> James >>> >>> Get Outlook for Android <https://aka.ms/ghei36> >>> >>> ------------------------------------------------------------------------ >>> *From:* Alexey Bataev <a.bataev at outlook.com> >>> <mailto:a.bataev at outlook.com> >>> *Sent:* Tuesday, January 22, 2019 12:34:01 PM >>> *To:* Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org >>> <mailto:cfe-dev at lists.llvm.org> >>> *Cc:* openmp-dev at lists.llvm.org <mailto:openmp-dev at lists.llvm.org>; >>> LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob >>> *Subject:* Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" >>> >>> >>> >>> ------------- >>> Best regards, >>> Alexey Bataev >>> 22.01.2019 13:17, Doerfert, Johannes Rudolf пишет: >>>> Where we are >>>> ------------ >>>> >>>> Currently, when we generate OpenMP target offloading code for GPUs, we >>>> use sufficient syntactic criteria to decide between two execution modes: >>>> 1) SPMD -- All target threads (in an OpenMP team) run all the code. >>>> 2) "Guarded" -- The master thread (of an OpenMP team) runs the user >>>> code. If an OpenMP distribute region is encountered, thus >>>> if all threads (in the OpenMP team) are supposed to >>>> execute the region, the master wakes up the idling >>>> worker threads and points them to the correct piece of >>>> code for distributed execution. >>>> >>>> For a variety of reasons we (generally) prefer the first execution mode. >>>> However, depending on the code, that might not be valid, or we might >>>> just not know if it is in the Clang code generation phase. >>>> >>>> The implementation of the "guarded" execution mode follows roughly the >>>> state machine description in [1], though the implementation is different >>>> (more general) nowadays. >>>> >>>> >>>> What we want >>>> ------------ >>>> >>>> Increase the amount of code executed in SPMD mode and the use of >>>> lightweight "guarding" schemes where appropriate. >>>> >>>> >>>> How we get (could) there >>>> ------------------------ >>>> >>>> We propose the following two modifications in order: >>>> >>>> 1) Move the state machine logic into the OpenMP runtime library. That >>>> means in SPMD mode all device threads will start the execution of >>>> the user code, thus emerge from the runtime, while in guarded mode >>>> only the master will escape the runtime and the other threads will >>>> idle in their state machine code that is now just "hidden". >>>> >>>> Why: >>>> - The state machine code cannot be (reasonably) optimized anyway, >>>> moving it into the library shouldn't hurt runtime but might even >>>> improve compile time a little bit. >>>> - The change should also simplify the Clang code generation as we >>>> would generate structurally the same code for both execution modes >>>> but only the runtime library calls, or their arguments, would >>>> differ between them. >>>> - The reason we should not "just start in SPMD mode" and "repair" >>>> it later is simple, this way we always have semantically correct >>>> and executable code. >>>> - Finally, and most importantly, there is now only little >>>> difference (see above) between the two modes in the code >>>> generated by clang. If we later analyze the code trying to decide >>>> if we can use SPMD mode instead of guarded mode the analysis and >>>> transformation becomes much simpler. >>> >>> The last item is wrong, unfortunately. A lot of things in the >>> codegen depend on the execution mode, e.g. correct support of the >>> data-sharing. Of course, we can try to generalize the codegen and >>> rely completely on the runtime, but the performance is going to be >>> very poor. >>> >>> We still need static analysis in the compiler. I agree, that it is >>> better to move this analysis to the backend, at least after the >>> inlining, but at the moment it is not possible. We need the support >>> for the late outlining, which will allow to implement better >>> detection of the SPMD constructs + improve performance. >>> >>>> 2) Implement a middle-end LLVM-IR pass that detects the guarded mode, >>>> e.g., through the runtime library calls used, and that tries to >>>> convert it into the SPMD mode potentially by introducing lightweight >>>> guards in the process. >>>> >>>> Why: >>>> - After the inliner, and the canonicalizations, we have a clearer >>>> picture of the code that is actually executed in the target >>>> region and all the side effects it contains. Thus, we can make an >>>> educated decision on the required amount of guards that prevent >>>> unwanted side effects from happening after a move to SPMD mode. >>>> - At this point we can more easily introduce different schemes to >>>> avoid side effects by threads that were not supposed to run. We >>>> can decide if a state machine is needed, conditionals should be >>>> employed, masked instructions are appropriate, or "dummy" local >>>> storage can be used to hide the side effect from the outside >>>> world. >>>> >>>> >>>> None of this was implemented yet but we plan to start in the immediate >>>> future. Any comments, ideas, criticism is welcome! >>>> >>>> >>>> Cheers, >>>> Johannes >>>> >>>> >>>> P.S. [2-4] Provide further information on implementation and features. >>>> >>>> [1] https://ieeexplore.ieee.org/document/7069297 >>>> [2] https://dl.acm.org/citation.cfm?id=2833161 >>>> [3] https://dl.acm.org/citation.cfm?id=3018870 >>>> [4] https://dl.acm.org/citation.cfm?id=3148189 >>>> >>>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/7a0bfb89/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/7a0bfb89/attachment.sig>
Doerfert, Johannes Rudolf via llvm-dev
2019-Jan-22 23:49 UTC
[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
After an IRC discussion, I think Alexey and I are pretty much in agreement (on the general feasibility at least). I try to sketch the proposed idea again below, as the initial RFC was simply not descriptive enough. After that, I shortly summarize how I see these changes being developed and committed so that we - never have any regressions, - can make an educated decision before removing any existing code. What we want to do: The intermediate goal is that the code generated by clang for the SPMD and non-SPMD (earlier denoted as "guarded") case is conceptually/structurally very similar. The current non-SPMD code is, however, a state machine generated into the user code module. This state machine is very hard to analyze and optimize. If the code would look as the SPMD code but *behave the same way it does now*, we could "easily" switch from non-SPMD to SPMD version after a (late) analysis determined legality. To make the code look the same but behave differently, we propose to hide the semantic difference in the runtime library calls. That is, the runtime calls emitted in the two modes are (slightly) different, or there is a flag which indicates the (initial) mode. If that mode is SPMD, the runtime behavior does not change compared to the way it is now. If that mode is non-SPMD, the runtime would separate the master and worker threads, as we do it now in the user code module, and keep the workers in an internal state machine waiting for the master to provide them with work. Only the master would return from the runtime call and the mechanism to distribute work to the worker threads would (for now) stay the same. Preliminary implementation (and integration) steps: 1) Design and implement the necessary runtime extensions and determine feasibility. 2) Allow to Clang codegen to use the new runtime extensions if explicitly chosen by the user. 2b) Performance comparison unoptimized new code path vs. original code path on test cases and real use cases. 3) Implement the middle-end pass to analyze and optimize the code using the runtime extensions. 3b) Performance comparison optimized new code path vs. original code path on real use cases. 4) If no conceptual problem was found and 2b)/3b) determined that the new code path is superior, switch to the new code path by default. 5) If no regressions/complaints are reported after a grace period, remove the old code path from the clang front-end. Again, this is an early design RFC for which I welcome any feedback! Thanks, Johannes ________________________________ From: Doerfert, Johannes Rudolf Sent: Tuesday, January 22, 2019 1:50:51 PM To: Alexey Bataev Cc: cfe-dev at lists.llvm.org; openmp-dev at lists.llvm.org Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" What do you refer to with: "No, we don't". Again, I do not propose to remove the SPMD "detection" in Clang. We will still identify SPMD mode based on the syntactic criteria we have now. The Clang analysis is also not affected. Thus, we will globalize/localize the same variables as we do now. I don't see why this should be any different. ________________________________ From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Alexey Bataev via llvm-dev <llvm-dev at lists.llvm.org> Sent: Tuesday, January 22, 2019 1:46:39 PM To: Doerfert, Johannes Rudolf Cc: llvm-dev; cfe-dev at lists.llvm.org; openmp-dev at lists.llvm.org Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation" No, we don't. We need to perform the different kind of the analysis for SPMD mode constructs and Non-SPMD. For SPMD mode we need to globalize only reduction/lastprivate variables. For Non-SPMD mode, we need to globalize all the private/local variables, that may escape their declaration context in the construct. ------------- Best regards, Alexey Bataev 22.01.2019 14:29, Doerfert, Johannes Rudolf пишет: We would still know that. We can do exactly the same reasoning as we do now. I think the important question is, how different is the code generated for either mode and can we hide (most of) the differences in the runtime. If I understand you correctly, you say the data sharing code looks very different and the differences cannot be hidden, correct? It would be helpful for me to understand your point if you could give me a piece of OpenMP for which the data sharing in SPMD mode and "guarded" mode are as different as possible. I can compile it in both modes myself so high-level OpenMP is fine (I will disable SPMD mode manually in the source if necessary). Thanks, Johannes ________________________________ From: llvm-dev <llvm-dev-bounces at lists.llvm.org><mailto:llvm-dev-bounces at lists.llvm.org> on behalf of Alexey Bataev via llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org> Sent: Tuesday, January 22, 2019 13:10 To: Doerfert, Johannes Rudolf Cc: Alexey Bataev; LLVM-Dev; Arpith Chacko Jacob; openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation" But we need to know the execution mode, SPMD or "guarded" ------------- Best regards, Alexey Bataev 22.01.2019 13:54, Doerfert, Johannes Rudolf пишет: We could still do that in clang, couldn't we? Get Outlook for Android<https://aka.ms/ghei36> ________________________________ From: Alexey Bataev <a.bataev at outlook.com><mailto:a.bataev at outlook.com> Sent: Tuesday, January 22, 2019 12:52:42 PM To: Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org> Cc: openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" The globalization for the local variables, for example. It must be implemented in the compiler to get the good performance, not in the runtime. ------------- Best regards, Alexey Bataev 22.01.2019 13:43, Doerfert, Johannes Rudolf пишет: Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass. Thanks, James Get Outlook for Android<https://aka.ms/ghei36> ________________________________ From: Alexey Bataev <a.bataev at outlook.com><mailto:a.bataev at outlook.com> Sent: Tuesday, January 22, 2019 12:34:01 PM To: Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org> Cc: openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" ------------- Best regards, Alexey Bataev 22.01.2019 13:17, Doerfert, Johannes Rudolf пишет: Where we are ------------ Currently, when we generate OpenMP target offloading code for GPUs, we use sufficient syntactic criteria to decide between two execution modes: 1) SPMD -- All target threads (in an OpenMP team) run all the code. 2) "Guarded" -- The master thread (of an OpenMP team) runs the user code. If an OpenMP distribute region is encountered, thus if all threads (in the OpenMP team) are supposed to execute the region, the master wakes up the idling worker threads and points them to the correct piece of code for distributed execution. For a variety of reasons we (generally) prefer the first execution mode. However, depending on the code, that might not be valid, or we might just not know if it is in the Clang code generation phase. The implementation of the "guarded" execution mode follows roughly the state machine description in [1], though the implementation is different (more general) nowadays. What we want ------------ Increase the amount of code executed in SPMD mode and the use of lightweight "guarding" schemes where appropriate. How we get (could) there ------------------------ We propose the following two modifications in order: 1) Move the state machine logic into the OpenMP runtime library. That means in SPMD mode all device threads will start the execution of the user code, thus emerge from the runtime, while in guarded mode only the master will escape the runtime and the other threads will idle in their state machine code that is now just "hidden". Why: - The state machine code cannot be (reasonably) optimized anyway, moving it into the library shouldn't hurt runtime but might even improve compile time a little bit. - The change should also simplify the Clang code generation as we would generate structurally the same code for both execution modes but only the runtime library calls, or their arguments, would differ between them. - The reason we should not "just start in SPMD mode" and "repair" it later is simple, this way we always have semantically correct and executable code. - Finally, and most importantly, there is now only little difference (see above) between the two modes in the code generated by clang. If we later analyze the code trying to decide if we can use SPMD mode instead of guarded mode the analysis and transformation becomes much simpler. The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor. We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance. 2) Implement a middle-end LLVM-IR pass that detects the guarded mode, e.g., through the runtime library calls used, and that tries to convert it into the SPMD mode potentially by introducing lightweight guards in the process. Why: - After the inliner, and the canonicalizations, we have a clearer picture of the code that is actually executed in the target region and all the side effects it contains. Thus, we can make an educated decision on the required amount of guards that prevent unwanted side effects from happening after a move to SPMD mode. - At this point we can more easily introduce different schemes to avoid side effects by threads that were not supposed to run. We can decide if a state machine is needed, conditionals should be employed, masked instructions are appropriate, or "dummy" local storage can be used to hide the side effect from the outside world. None of this was implemented yet but we plan to start in the immediate future. Any comments, ideas, criticism is welcome! Cheers, Johannes P.S. [2-4] Provide further information on implementation and features. [1] https://ieeexplore.ieee.org/document/7069297 [2] https://dl.acm.org/citation.cfm?id=2833161 [3] https://dl.acm.org/citation.cfm?id=3018870 [4] https://dl.acm.org/citation.cfm?id=3148189 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190122/bcbe257f/attachment.html>
Guray Ozen via llvm-dev
2019-Jan-23 09:14 UTC
[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
We are working on OpenMP target offloading for GPUs in Flang, and adopting the same code generation strategy. The proposal is affecting us. It would be nice to know more details about the proposal. So we can prepare ourselves to adapt flang (if everything goes on the way). Have you find and a solution for data sharing? How are you going to manage data sharing for SPMD and non-SPMD? From: cfe-dev <cfe-dev-bounces at lists.llvm.org> On Behalf Of Doerfert, Johannes Rudolf via cfe-dev Sent: Wednesday, January 23, 2019 12:50 AM To: Alexey Bataev <a.bataev at outlook.com> Cc: llvm-dev <llvm-dev at lists.llvm.org>; cfe-dev at lists.llvm.org; openmp-dev at lists.llvm.org Subject: Re: [cfe-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation" After an IRC discussion, I think Alexey and I are pretty much in agreement (on the general feasibility at least). I try to sketch the proposed idea again below, as the initial RFC was simply not descriptive enough. After that, I shortly summarize how I see these changes being developed and committed so that we - never have any regressions, - can make an educated decision before removing any existing code. What we want to do: The intermediate goal is that the code generated by clang for the SPMD and non-SPMD (earlier denoted as "guarded") case is conceptually/structurally very similar. The current non-SPMD code is, however, a state machine generated into the user code module. This state machine is very hard to analyze and optimize. If the code would look as the SPMD code but *behave the same way it does now*, we could "easily" switch from non-SPMD to SPMD version after a (late) analysis determined legality. To make the code look the same but behave differently, we propose to hide the semantic difference in the runtime library calls. That is, the runtime calls emitted in the two modes are (slightly) different, or there is a flag which indicates the (initial) mode. If that mode is SPMD, the runtime behavior does not change compared to the way it is now. If that mode is non-SPMD, the runtime would separate the master and worker threads, as we do it now in the user code module, and keep the workers in an internal state machine waiting for the master to provide them with work. Only the master would return from the runtime call and the mechanism to distribute work to the worker threads would (for now) stay the same. Preliminary implementation (and integration) steps: 1) Design and implement the necessary runtime extensions and determine feasibility. 2) Allow to Clang codegen to use the new runtime extensions if explicitly chosen by the user. 2b) Performance comparison unoptimized new code path vs. original code path on test cases and real use cases. 3) Implement the middle-end pass to analyze and optimize the code using the runtime extensions. 3b) Performance comparison optimized new code path vs. original code path on real use cases. 4) If no conceptual problem was found and 2b)/3b) determined that the new code path is superior, switch to the new code path by default. 5) If no regressions/complaints are reported after a grace period, remove the old code path from the clang front-end. Again, this is an early design RFC for which I welcome any feedback! Thanks, Johannes ________________________________ From: Doerfert, Johannes Rudolf Sent: Tuesday, January 22, 2019 1:50:51 PM To: Alexey Bataev Cc: cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>; openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" What do you refer to with: "No, we don't". Again, I do not propose to remove the SPMD "detection" in Clang. We will still identify SPMD mode based on the syntactic criteria we have now. The Clang analysis is also not affected. Thus, we will globalize/localize the same variables as we do now. I don't see why this should be any different. ________________________________ From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces at lists.llvm.org>> on behalf of Alexey Bataev via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Sent: Tuesday, January 22, 2019 1:46:39 PM To: Doerfert, Johannes Rudolf Cc: llvm-dev; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>; openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation" No, we don't. We need to perform the different kind of the analysis for SPMD mode constructs and Non-SPMD. For SPMD mode we need to globalize only reduction/lastprivate variables. For Non-SPMD mode, we need to globalize all the private/local variables, that may escape their declaration context in the construct. ------------- Best regards, Alexey Bataev 22.01.2019 14:29, Doerfert, Johannes Rudolf пишет: We would still know that. We can do exactly the same reasoning as we do now. I think the important question is, how different is the code generated for either mode and can we hide (most of) the differences in the runtime. If I understand you correctly, you say the data sharing code looks very different and the differences cannot be hidden, correct? It would be helpful for me to understand your point if you could give me a piece of OpenMP for which the data sharing in SPMD mode and "guarded" mode are as different as possible. I can compile it in both modes myself so high-level OpenMP is fine (I will disable SPMD mode manually in the source if necessary). Thanks, Johannes ________________________________ From: llvm-dev <llvm-dev-bounces at lists.llvm.org><mailto:llvm-dev-bounces at lists.llvm.org> on behalf of Alexey Bataev via llvm-dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org> Sent: Tuesday, January 22, 2019 13:10 To: Doerfert, Johannes Rudolf Cc: Alexey Bataev; LLVM-Dev; Arpith Chacko Jacob; openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation" But we need to know the execution mode, SPMD or "guarded" ------------- Best regards, Alexey Bataev 22.01.2019 13:54, Doerfert, Johannes Rudolf пишет: We could still do that in clang, couldn't we? Get Outlook for Android<https://aka.ms/ghei36> ________________________________ From: Alexey Bataev <a.bataev at outlook.com><mailto:a.bataev at outlook.com> Sent: Tuesday, January 22, 2019 12:52:42 PM To: Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org> Cc: openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" The globalization for the local variables, for example. It must be implemented in the compiler to get the good performance, not in the runtime. ------------- Best regards, Alexey Bataev 22.01.2019 13:43, Doerfert, Johannes Rudolf пишет: Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass. Thanks, James Get Outlook for Android<https://aka.ms/ghei36> ________________________________ From: Alexey Bataev <a.bataev at outlook.com><mailto:a.bataev at outlook.com> Sent: Tuesday, January 22, 2019 12:34:01 PM To: Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org> Cc: openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" ------------- Best regards, Alexey Bataev 22.01.2019 13:17, Doerfert, Johannes Rudolf пишет: Where we are ------------ Currently, when we generate OpenMP target offloading code for GPUs, we use sufficient syntactic criteria to decide between two execution modes: 1) SPMD -- All target threads (in an OpenMP team) run all the code. 2) "Guarded" -- The master thread (of an OpenMP team) runs the user code. If an OpenMP distribute region is encountered, thus if all threads (in the OpenMP team) are supposed to execute the region, the master wakes up the idling worker threads and points them to the correct piece of code for distributed execution. For a variety of reasons we (generally) prefer the first execution mode. However, depending on the code, that might not be valid, or we might just not know if it is in the Clang code generation phase. The implementation of the "guarded" execution mode follows roughly the state machine description in [1], though the implementation is different (more general) nowadays. What we want ------------ Increase the amount of code executed in SPMD mode and the use of lightweight "guarding" schemes where appropriate. How we get (could) there ------------------------ We propose the following two modifications in order: 1) Move the state machine logic into the OpenMP runtime library. That means in SPMD mode all device threads will start the execution of the user code, thus emerge from the runtime, while in guarded mode only the master will escape the runtime and the other threads will idle in their state machine code that is now just "hidden". Why: - The state machine code cannot be (reasonably) optimized anyway, moving it into the library shouldn't hurt runtime but might even improve compile time a little bit. - The change should also simplify the Clang code generation as we would generate structurally the same code for both execution modes but only the runtime library calls, or their arguments, would differ between them. - The reason we should not "just start in SPMD mode" and "repair" it later is simple, this way we always have semantically correct and executable code. - Finally, and most importantly, there is now only little difference (see above) between the two modes in the code generated by clang. If we later analyze the code trying to decide if we can use SPMD mode instead of guarded mode the analysis and transformation becomes much simpler. The last item is wrong, unfortunately. A lot of things in the codegen depend on the execution mode, e.g. correct support of the data-sharing. Of course, we can try to generalize the codegen and rely completely on the runtime, but the performance is going to be very poor. We still need static analysis in the compiler. I agree, that it is better to move this analysis to the backend, at least after the inlining, but at the moment it is not possible. We need the support for the late outlining, which will allow to implement better detection of the SPMD constructs + improve performance. 2) Implement a middle-end LLVM-IR pass that detects the guarded mode, e.g., through the runtime library calls used, and that tries to convert it into the SPMD mode potentially by introducing lightweight guards in the process. Why: - After the inliner, and the canonicalizations, we have a clearer picture of the code that is actually executed in the target region and all the side effects it contains. Thus, we can make an educated decision on the required amount of guards that prevent unwanted side effects from happening after a move to SPMD mode. - At this point we can more easily introduce different schemes to avoid side effects by threads that were not supposed to run. We can decide if a state machine is needed, conditionals should be employed, masked instructions are appropriate, or "dummy" local storage can be used to hide the side effect from the outside world. None of this was implemented yet but we plan to start in the immediate future. Any comments, ideas, criticism is welcome! Cheers, Johannes P.S. [2-4] Provide further information on implementation and features. [1] https://ieeexplore.ieee.org/document/7069297 [2] https://dl.acm.org/citation.cfm?id=2833161 [3] https://dl.acm.org/citation.cfm?id=3018870 [4] https://dl.acm.org/citation.cfm?id=3148189 ----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ----------------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190123/a066c893/attachment.html>