thr3ads.net - llvm dev - [llvm-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2016-Mar-10 17:54 UTC

[llvm-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

----- Original Message -----> From: "Arpith C Jacob" <acjacob at us.ibm.com>
> To: llvm-dev at lists.llvm.org
> Cc: jhen at google.com, "Hal J. Finkel" <hfinkel at
anl.gov>
> Sent: Thursday, March 10, 2016 10:38:46 AM
> Subject: Re: [llvm-dev] RFC: Proposing an LLVM subproject for parallelism
runtime and support libraries
> 
> Hi Jason,
> 
> I'm trying to better understand your StreamExecutor proposal and how
> it relates to other parallel programming models and runtimes such as
> RAJA [1], KOKKOS [2], or some hypothetical SPARK C++ API.
> 
> Please correct me if I'm misunderstanding your proposal, but I think
> the essence of what you want from the compiler is type safety for
> accelerator kernel launches, i.e., you would like the frontend to
> parse, check, and codegen for the construct:
> add_mystery_value<<<1, 1>>>(kernel_input_argument,
*result.ptr());
> 
> Is that a correct understanding?
> 
Without answering your question, I'll point out that, as I understand it,
StreamExecutor completely replaces the CUDA userspace library runtime components
and talks directly to the drivers. Jason, please correct me if I'm wrong.

 -Hal
> Thanks,
> Arpith
> 
> [1]
>
http://computation.llnl.gov/projects/raja-managing-application-portability-next-generation-platforms
> [2] https://github.com/kokkos/kokkos
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Jason Henline via llvm-dev

2016-Mar-10 19:14 UTC

head link

[llvm-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

In response to the latest questions from C Bergström:

Is
there "CUDA" or OpenCL hidden in the headers and that's where the
actual offload portion is happening

 Does StreamExecutor
wrapper around public or private CUDA/OpenCL runtimes?

Yes, StreamExecutor is a wrapper around the public OpenCL and CUDA
userspace driver libraries. The actual offloading is achieved by making
calls to those libraries.

Is there anything stopping you from exposing "wrapper" interfaces
which are the same as the NVIDIA runtime?

There is nothing stopping us from doing that. The reason we haven't to this
point is because we felt the current StreamExecutor API was nicer to work
with.

Where is the StreamExecutor runtime source now?

It is currently housed in Google's internal code repo, where it is being
used in production code. There is also a local copy in the open-source
TensorFlow project (https://www.tensorflow.org) which we want to replace
with a dependency on a separate open source StreamExecutor project.

/*
I have said this before and I really get uncomfortable with the
generic term "CUDA" in clang. Until someone from NVIDIA (lawyers) put
something in writing. CUDA is an NV trademark and clang/llvm project
can't claim to be "CUDA" and need to make a distinction.
Informally
this is all friendly now, but I do hope it's officially clarified at
some point. Maybe it's as simple as saying "CUDA compatible" - I
don't
know..
*/

Good point! I will try to keep that in mind.

I think having a nice model that lowers cleanly (high performance) to
at least some targets is (should be) very important. From my
experience - if you have complex or perfectly nested loops - how would
you take this sort of algorithm and map it to StreamExecutor? Getting
reductions right or wrong can also have a performance impact - If your
goal is to create a "one wrapper rules them all" approach - I'm
hoping
you can find a common way to also make it easier for basic needs to be
expressed to the underlying target. (In a target agnostic way)

I'm not quite sure how to answer this in all generality, but here are some
thoughts. Any complex or nested looping control flow that happens on data
stored in device memory can be handled completely within the kernel
definition, and should be independent of StreamExecutor. If the complexity
arises instead from coordinating data transfers to device memory with
kernel launches, then StreamExecutor proposes to model those dependencies
as "streams" where one operation can be forced to wait on another
(much in
the way CUDA streams work). It would be possible to create new
"canned"
operations to perform common operations like reductions where the data
won't all fit on the device at once, but those canned operations would
probably not be optimal for all platforms, and in those cases the user
might need to roll their own.

Microsoft did a really nice job of documenting C++AMP - Does google
have a bunch of example codes which show how StreamExecutor can be
used to implement various algorithms?

We don't currently have any simplified public examples, but I agree that
would be something useful to have. I may write up a few in the coming weeks.

Does clang/llvm
accept anything or is there some metric for generally deciding what
should get a sub-project and what just is too early.

I'm a newcomer to the community myself, so I'll leave this to others to
give a better answer than I could.

Does Google have a plan to engage and bring other
stakeholders into supporting this?

We see this unified model as a benifit to all accelerator platforms because
we think it will make it easier for programmers to use their systems. We
plan to propose this model to these vendors and see if we can get them
interested in providing code or advertising this model as a way to program
their devices.

I hope all my questions are viewed as positive and meant to be constructive.

Absolutely. I feel that your input has been very constructive, and I
appreciate you helping us think through this design.

On Thu, Mar 10, 2016 at 9:54 AM Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
> > From: "Arpith C Jacob" <acjacob at us.ibm.com>
> > To: llvm-dev at lists.llvm.org
> > Cc: jhen at google.com, "Hal J. Finkel" <hfinkel at
anl.gov>
> > Sent: Thursday, March 10, 2016 10:38:46 AM
> > Subject: Re: [llvm-dev] RFC: Proposing an LLVM subproject for
> parallelism runtime and support libraries
> >
> > Hi Jason,
> >
> > I'm trying to better understand your StreamExecutor proposal and
how
> > it relates to other parallel programming models and runtimes such as
> > RAJA [1], KOKKOS [2], or some hypothetical SPARK C++ API.
> >
> > Please correct me if I'm misunderstanding your proposal, but I
think
> > the essence of what you want from the compiler is type safety for
> > accelerator kernel launches, i.e., you would like the frontend to
> > parse, check, and codegen for the construct:
> > add_mystery_value<<<1, 1>>>(kernel_input_argument,
*result.ptr());
> >
> > Is that a correct understanding?
> >
>
> Without answering your question, I'll point out that, as I understand
it,
> StreamExecutor completely replaces the CUDA userspace library runtime
> components and talks directly to the drivers. Jason, please correct me if
> I'm wrong.
>
>  -Hal
>
> > Thanks,
> > Arpith
> >
> > [1]
> >
>
http://computation.llnl.gov/projects/raja-managing-application-portability-next-generation-platforms
> > [2] https://github.com/kokkos/kokkos
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/7aa8471f/attachment.html>

Jason Henline via llvm-dev

2016-Mar-10 19:19 UTC

head link

[llvm-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

Arpith,

Please correct me if I'm misunderstanding your proposal, but I think the
essence of what you want from the compiler is type safety for accelerator
kernel launches, i.e., you would like the frontend to parse, check, and
codegen for the construct:

…

add_mystery_value<<<1, 1>>>(kernel_input_argument,
*result.ptr());


Yes, you are correct that this is one of the constructs we want to support.
Also, just as Hal said, we are interested in replacing all the functions of
the CUDA userspace runtime library. These include operations such as
allocating device memory, copying data to and from the device, stream and
event management, etc.

On Thu, Mar 10, 2016 at 11:14 AM Jason Henline <jhen at google.com> wrote:
> In response to the latest questions from C Bergström:
>
>
> Is
> there "CUDA" or OpenCL hidden in the headers and that's where
the
> actual offload portion is happening
>
>  Does StreamExecutor
> wrapper around public or private CUDA/OpenCL runtimes?
>
> Yes, StreamExecutor is a wrapper around the public OpenCL and CUDA
> userspace driver libraries. The actual offloading is achieved by making
> calls to those libraries.
>
> Is there anything stopping you from exposing "wrapper" interfaces
> which are the same as the NVIDIA runtime?
>
> There is nothing stopping us from doing that. The reason we haven't to
> this point is because we felt the current StreamExecutor API was nicer to
> work with.
>
> Where is the StreamExecutor runtime source now?
>
> It is currently housed in Google's internal code repo, where it is
being
> used in production code. There is also a local copy in the open-source
> TensorFlow project (https://www.tensorflow.org) which we want to replace
> with a dependency on a separate open source StreamExecutor project.
>
> /*
> I have said this before and I really get uncomfortable with the
> generic term "CUDA" in clang. Until someone from NVIDIA (lawyers)
put
> something in writing. CUDA is an NV trademark and clang/llvm project
> can't claim to be "CUDA" and need to make a distinction.
Informally
> this is all friendly now, but I do hope it's officially clarified at
> some point. Maybe it's as simple as saying "CUDA compatible"
- I don't
> know..
> */
>
> Good point! I will try to keep that in mind.
>
> I think having a nice model that lowers cleanly (high performance) to
> at least some targets is (should be) very important. From my
> experience - if you have complex or perfectly nested loops - how would
> you take this sort of algorithm and map it to StreamExecutor? Getting
> reductions right or wrong can also have a performance impact - If your
> goal is to create a "one wrapper rules them all" approach -
I'm hoping
> you can find a common way to also make it easier for basic needs to be
> expressed to the underlying target. (In a target agnostic way)
>
> I'm not quite sure how to answer this in all generality, but here are
some
> thoughts. Any complex or nested looping control flow that happens on data
> stored in device memory can be handled completely within the kernel
> definition, and should be independent of StreamExecutor. If the complexity
> arises instead from coordinating data transfers to device memory with
> kernel launches, then StreamExecutor proposes to model those dependencies
> as "streams" where one operation can be forced to wait on another
(much in
> the way CUDA streams work). It would be possible to create new
"canned"
> operations to perform common operations like reductions where the data
> won't all fit on the device at once, but those canned operations would
> probably not be optimal for all platforms, and in those cases the user
> might need to roll their own.
>
> Microsoft did a really nice job of documenting C++AMP - Does google
> have a bunch of example codes which show how StreamExecutor can be
> used to implement various algorithms?
>
> We don't currently have any simplified public examples, but I agree
that
> would be something useful to have. I may write up a few in the coming
weeks.
>
> Does clang/llvm
> accept anything or is there some metric for generally deciding what
> should get a sub-project and what just is too early.
>
> I'm a newcomer to the community myself, so I'll leave this to
others to
> give a better answer than I could.
>
> Does Google have a plan to engage and bring other
> stakeholders into supporting this?
>
> We see this unified model as a benifit to all accelerator platforms
> because we think it will make it easier for programmers to use their
> systems. We plan to propose this model to these vendors and see if we can
> get them interested in providing code or advertising this model as a way to
> program their devices.
>
> I hope all my questions are viewed as positive and meant to be
> constructive.
>
> Absolutely. I feel that your input has been very constructive, and I
> appreciate you helping us think through this design.
>
> On Thu, Mar 10, 2016 at 9:54 AM Hal Finkel <hfinkel at anl.gov>
wrote:
>
>> ----- Original Message -----
>> > From: "Arpith C Jacob" <acjacob at us.ibm.com>
>> > To: llvm-dev at lists.llvm.org
>> > Cc: jhen at google.com, "Hal J. Finkel" <hfinkel at
anl.gov>
>> > Sent: Thursday, March 10, 2016 10:38:46 AM
>> > Subject: Re: [llvm-dev] RFC: Proposing an LLVM subproject for
>> parallelism runtime and support libraries
>> >
>> > Hi Jason,
>> >
>> > I'm trying to better understand your StreamExecutor proposal
and how
>> > it relates to other parallel programming models and runtimes such
as
>> > RAJA [1], KOKKOS [2], or some hypothetical SPARK C++ API.
>> >
>> > Please correct me if I'm misunderstanding your proposal, but I
think
>> > the essence of what you want from the compiler is type safety for
>> > accelerator kernel launches, i.e., you would like the frontend to
>> > parse, check, and codegen for the construct:
>> > add_mystery_value<<<1,
1>>>(kernel_input_argument, *result.ptr());
>> >
>> > Is that a correct understanding?
>> >
>>
>> Without answering your question, I'll point out that, as I
understand it,
>> StreamExecutor completely replaces the CUDA userspace library runtime
>> components and talks directly to the drivers. Jason, please correct me
if
>> I'm wrong.
>>
>>  -Hal
>>
>> > Thanks,
>> > Arpith
>> >
>> > [1]
>> >
>>
http://computation.llnl.gov/projects/raja-managing-application-portability-next-generation-platforms
>> > [2] https://github.com/kokkos/kokkos
>> >
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/16f4bed4/attachment.html>

llvm dev - Mar 2016 - RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

[llvm-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

[llvm-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

[llvm-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries