thr3ads.net - llvm dev - [LLVMdev] Supporting heterogeneous computing in llvm. [Jun 2015]

If this information is useful, please help other people find it:
Share via:

C Bergström

2015-Jun-06 20:09 UTC

[LLVMdev] Supporting heterogeneous computing in llvm.

On Sun, Jun 7, 2015 at 2:52 AM, Eric Christopher <echristo at gmail.com>
wrote:>
>
> On Sat, Jun 6, 2015 at 12:43 PM C Bergström <cbergstrom at
pathscale.com>
> wrote:
>>
>> On Sun, Jun 7, 2015 at 2:34 AM, Eric Christopher <echristo at
gmail.com>
>> wrote:
>> >
>> >
>> > On Sat, Jun 6, 2015 at 12:31 PM C Bergström <cbergstrom at
pathscale.com>
>> > wrote:
>> >>
>> >> On Sun, Jun 7, 2015 at 2:22 AM, Eric Christopher <echristo
at gmail.com>
>> >> wrote:
>> >> >
>> >> >
>> >> > On Sat, Jun 6, 2015 at 5:02 AM C Bergström <cbergstrom
at pathscale.com>
>> >> > wrote:
>> >> >>
>> >> >> On Sat, Jun 6, 2015 at 6:24 PM, Christos Margiolas
>> >> >> <chrmargiolas at gmail.com> wrote:
>> >> >> > Hello,
>> >> >> >
>> >> >> > Thank you a lot for the feedback. I believe that
the heterogeneous
>> >> >> > engine
>> >> >> > should be strongly connected with
parallelization and
>> >> >> > vectorization
>> >> >> > efforts.
>> >> >> > Most of the accelerators are parallel
architectures where having
>> >> >> > efficient
>> >> >> > parallelization and vectorization can be
critical for performance.
>> >> >> >
>> >> >> > I am interested in these efforts and I hope that
my code can help
>> >> >> > you
>> >> >> > managing the offloading operations. Your LLVM
instruction set
>> >> >> > extensions
>> >> >> > may
>> >> >> > require some changes in the analysis code but I
think is going to
>> >> >> > be
>> >> >> > straightforward.
>> >> >> >
>> >> >> > I am planning to push my code on phabricator in
the next days.
>> >> >>
>> >> >> If you're doing the extracting at the loop and
llvm ir level - why
>> >> >> would you need to modify the IR? Wouldn't the
target level lowering
>> >> >> happen later?
>> >> >>
>> >> >> How are you actually determining to offload? Is this
tied to
>> >> >> directives or using heuristics+some set of
restrictions?
>> >> >>
>> >> >> Lastly, are you handling 2 targets in the same module
or end up
>> >> >> emitting 2 modules and dealing with recombining
things later..
>> >> >>
>> >> >
>> >> > It's not currently possible to do this using the
current structure
>> >> > without
>> >> > some significant and, honestly, icky patches.
>> >>
>> >> What's not possible? I agree some of our local patches and
design may
>> >> not make it upstream as-is, but we are offloading to 2+
targets using
>> >> llvm ir *today*.
>> >>
>> >
>> > I'm not sure how much more clear I can be. It's not
possible, in the
>> > same
>> > module, to handle multiple targets at the same time.
>> >
>> >>
>> >> IMHO - you must (re)solve the problem about handling multiple
targets
>> >> concurrently. That means 2 targets in a single Module or 2
Modules
>> >> basically glued one after the other.
>> >
>> >
>> > Patches welcome.
>>
>> While I appreciate your taste in music - Canned (troll) replies are
>> typically a waste of time..
>
>
> This is uncalled for and unacceptable. I've done an immense amount of
work
> so that we can support different subtargets in the same module and get
> better LTO and target features. If you have a feature above and beyond what
> I've been able to do (and you say you do) then a request for patches is
more
> than acceptable as a response. I've yet to see any work from you and a
lot
> of talk about what other people should do.
Umm.. don't get your feathers in a ruffle - you provided *zero*
content and I was just saying it wasn't impossible. To pop back all
huffy is just funny.

Anyway, to bring this conversation back to something technical instead
of just stupid comments.. I'd agree that flipping targets back and
forth (intermixed) in the same Module *is* probably a substantial
amount of work. If the optimization passes worked at a PU (program
unit) aka function level it wouldn't be.

Why can't you append 1 Module after another and switch?

As you point out whole program analysis/optimization will face a
similar problem - same question as above.
---------------------
Currently - (I don't know about DSP - TI/Qualcomm), but most people in
the industry are using custom runtimes to parse the GPU code and
load/execute. It would be great if the linker/loader actually had
better support for this built-in.

I don't know the exact capabilities of gnu/sun linker/loader, but
something along the lines of managling the function to also include
target details

so compiler would emit multiple mangled versions of foo() and
linker/loader could pick the most optimized.

Something like this
nvc0_foo
avx2_foo
avx512_foo
(Also I'd agree that the above would be quite hard)

Eric Christopher

2015-Jun-06 20:20 UTC

head link

[LLVMdev] Supporting heterogeneous computing in llvm.

On Sat, Jun 6, 2015 at 1:09 PM C Bergström <cbergstrom at pathscale.com>
wrote:
> On Sun, Jun 7, 2015 at 2:52 AM, Eric Christopher <echristo at
gmail.com>
> wrote:
> >
> >
> > On Sat, Jun 6, 2015 at 12:43 PM C Bergström <cbergstrom at
pathscale.com>
> > wrote:
> >>
> >> On Sun, Jun 7, 2015 at 2:34 AM, Eric Christopher <echristo at
gmail.com>
> >> wrote:
> >> >
> >> >
> >> > On Sat, Jun 6, 2015 at 12:31 PM C Bergström <cbergstrom at
pathscale.com
> >
> >> > wrote:
> >> >>
> >> >> On Sun, Jun 7, 2015 at 2:22 AM, Eric Christopher
<echristo at gmail.com
> >
> >> >> wrote:
> >> >> >
> >> >> >
> >> >> > On Sat, Jun 6, 2015 at 5:02 AM C Bergström <
> cbergstrom at pathscale.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> On Sat, Jun 6, 2015 at 6:24 PM, Christos
Margiolas
> >> >> >> <chrmargiolas at gmail.com> wrote:
> >> >> >> > Hello,
> >> >> >> >
> >> >> >> > Thank you a lot for the feedback. I believe
that the
> heterogeneous
> >> >> >> > engine
> >> >> >> > should be strongly connected with
parallelization and
> >> >> >> > vectorization
> >> >> >> > efforts.
> >> >> >> > Most of the accelerators are parallel
architectures where having
> >> >> >> > efficient
> >> >> >> > parallelization and vectorization can be
critical for
> performance.
> >> >> >> >
> >> >> >> > I am interested in these efforts and I hope
that my code can
> help
> >> >> >> > you
> >> >> >> > managing the offloading operations. Your
LLVM instruction set
> >> >> >> > extensions
> >> >> >> > may
> >> >> >> > require some changes in the analysis code
but I think is going
> to
> >> >> >> > be
> >> >> >> > straightforward.
> >> >> >> >
> >> >> >> > I am planning to push my code on
phabricator in the next days.
> >> >> >>
> >> >> >> If you're doing the extracting at the loop
and llvm ir level - why
> >> >> >> would you need to modify the IR? Wouldn't
the target level
> lowering
> >> >> >> happen later?
> >> >> >>
> >> >> >> How are you actually determining to offload? Is
this tied to
> >> >> >> directives or using heuristics+some set of
restrictions?
> >> >> >>
> >> >> >> Lastly, are you handling 2 targets in the same
module or end up
> >> >> >> emitting 2 modules and dealing with recombining
things later..
> >> >> >>
> >> >> >
> >> >> > It's not currently possible to do this using the
current structure
> >> >> > without
> >> >> > some significant and, honestly, icky patches.
> >> >>
> >> >> What's not possible? I agree some of our local
patches and design may
> >> >> not make it upstream as-is, but we are offloading to 2+
targets using
> >> >> llvm ir *today*.
> >> >>
> >> >
> >> > I'm not sure how much more clear I can be. It's not
possible, in the
> >> > same
> >> > module, to handle multiple targets at the same time.
> >> >
> >> >>
> >> >> IMHO - you must (re)solve the problem about handling
multiple targets
> >> >> concurrently. That means 2 targets in a single Module or
2 Modules
> >> >> basically glued one after the other.
> >> >
> >> >
> >> > Patches welcome.
> >>
> >> While I appreciate your taste in music - Canned (troll) replies
are
> >> typically a waste of time..
> >
> >
> > This is uncalled for and unacceptable. I've done an immense amount
of
> work
> > so that we can support different subtargets in the same module and get
> > better LTO and target features. If you have a feature above and beyond
> what
> > I've been able to do (and you say you do) then a request for
patches is
> more
> > than acceptable as a response. I've yet to see any work from you
and a
> lot
> > of talk about what other people should do.
>
> Umm.. don't get your feathers in a ruffle - you provided *zero*
> content and I was just saying it wasn't impossible. To pop back all
> huffy is just funny.
>
>I can say the same and calling my post trolling was unacceptable.

> Anyway, to bring this conversation back to something technical instead
> of just stupid comments.. I'd agree that flipping targets back and
> forth (intermixed) in the same Module *is* probably a substantial
> amount of work. If the optimization passes worked at a PU (program
> unit) aka function level it wouldn't be.
>
>It's just another level of indirection essentially - and a lot of work.
It's much easier to do what's being proposed and outline work into
another
module. To do what you've said (and I've looked at) is basically turning
each function into it's own little module - ala what the ORC JIT does with
per-function compilation.

> Why can't you append 1 Module after another and switch?
>
This is, effectively, two modules and it'll behave the same. The reasons
are data transfer etc for module level attributes, data layout, etc. We've
still got some lingering issues at the function level let alone at the
module level with side data taking over. Akira and I are working on them as
we can.

>
> As you point out whole program analysis/optimization will face a
> similar problem - same question as above.
> ---------------------
> Currently - (I don't know about DSP - TI/Qualcomm), but most people in
> the industry are using custom runtimes to parse the GPU code and
> load/execute. It would be great if the linker/loader actually had
> better support for this built-in.
>
> I don't know the exact capabilities of gnu/sun linker/loader, but
> something along the lines of managling the function to also include
> target details
>
> so compiler would emit multiple mangled versions of foo() and
> linker/loader could pick the most optimized.
>
> Something like this
> nvc0_foo
> avx2_foo
> avx512_foo
> (Also I'd agree that the above would be quite hard)
>
There's quite a bit of work in this direction in a lot of different ways.
You can take a look at the gnu ifunc ELF extensions as a way of doing this
on a per-subtarget feature level. The obvious extension of this to
accelerators is something that we've had discussions about (GNU Tools
Cauldron a couple of years ago) and I believe it's been discussed as part
of a C++ working group.

At any rate, it's a much bigger discussion than a weekend on the mailing
list, but there's been some thought about how it'll need to happen on
each
architecture/OS and, as you can tell, it's a matter of ongoing
experimentation and development. (References: CUDA work, Movidius work,
etc).

-eric
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150606/7386ccee/attachment.html>

C Bergström

2015-Jun-06 20:52 UTC

head link

[LLVMdev] Supporting heterogeneous computing in llvm.

>>
>> Anyway, to bring this conversation back to something technical instead
>> of just stupid comments.. I'd agree that flipping targets back and
>> forth (intermixed) in the same Module *is* probably a substantial
>> amount of work. If the optimization passes worked at a PU (program
>> unit) aka function level it wouldn't be.
>>
>
> It's just another level of indirection essentially - and a lot of work.
It's
> much easier to do what's being proposed and outline work into another
> module. To do what you've said (and I've looked at) is basically
turning
> each function into it's own little module - ala what the ORC JIT does
with
> per-function compilation.
/* Non-jit example - Old Pro64/MIPSPro from SGI is per PU as well..
I'm not sure what kernelgen is doing.. */

I'm not sure I was clear - I'lll try to elaborate

You take the region of code or cuda kernel. etc being offloaded and
outline it into a seperate PU (function) which goes into a new module,
which is appended to the 1st.

This isn't exactly the clang model today, but *if* llvm is a library -
it's easier to handle the 2 modules one after the other.
>
>>
>> Why can't you append 1 Module after another and switch?
>
>
> This is, effectively, two modules and it'll behave the same. The
reasons are
> data transfer etc for module level attributes, data layout, etc. We've
still
> got some lingering issues at the function level let alone at the module
> level with side data taking over. Akira and I are working on them as we
can.
cool - good to hear.

>
>>
>>
>> As you point out whole program analysis/optimization will face a
>> similar problem - same question as above.
>> ---------------------
>> Currently - (I don't know about DSP - TI/Qualcomm), but most people
in
>> the industry are using custom runtimes to parse the GPU code and
>> load/execute. It would be great if the linker/loader actually had
>> better support for this built-in.
>>
>> I don't know the exact capabilities of gnu/sun linker/loader, but
>> something along the lines of managling the function to also include
>> target details
>>
>> so compiler would emit multiple mangled versions of foo() and
>> linker/loader could pick the most optimized.
>>
>> Something like this
>> nvc0_foo
>> avx2_foo
>> avx512_foo
>> (Also I'd agree that the above would be quite hard)
>
>
> There's quite a bit of work in this direction in a lot of different
ways.
> You can take a look at the gnu ifunc ELF extensions as a way of doing this
> on a per-subtarget feature level. The obvious extension of this to
> accelerators is something that we've had discussions about (GNU Tools
> Cauldron a couple of years ago) and I believe it's been discussed as
part of
> a C++ working group.
The ifunc stuff doesn't behave exactly as I'd like. It's sorta
close.
Another example - On solaris at boot time they have a check for the
system capabilities and mount over libc/m with the most optimized
version the system is capable of. When I first saw this I thought it
was quite clever and cool. (Many years ago) Doing that for
accelerators wouldn't exactly work though - since they can hang and be
(slightly?) less reliable than the CPU. (Not to mention busy)

The upside to this is less work for the loader. The downside is you
have to build multiple versions of libc and friends.
>
> At any rate, it's a much bigger discussion than a weekend on the
mailing
> list, but there's been some thought about how it'll need to happen
on each
> architecture/OS and, as you can tell, it's a matter of ongoing
> experimentation and development. (References: CUDA work, Movidius work,
> etc).
Yeah I agree - I probably won't be sending a patch any time soon, but
I thought I could ask questions around designs that I know have
functionally worked.

llvm dev - Jun 2015 - [LLVMdev] Supporting heterogeneous computing in llvm.

[LLVMdev] Supporting heterogeneous computing in llvm.

[LLVMdev] Supporting heterogeneous computing in llvm.

[LLVMdev] Supporting heterogeneous computing in llvm.