thr3ads.net - llvm dev - [llvm-dev] slow performance in llc.exe to do with large global floating point arrays [May 2019]

If this information is useful, please help other people find it:
Share via:

Chris Lovett via llvm-dev

2019-May-08 22:52 UTC

[llvm-dev] slow performance in llc.exe to do with large global floating point arrays

We are building a neural network compiler using LLVM, see
https://github.com/Microsoft/ELL.

We want to put the neural network weights into a bunch of global float arrays
because it allows us to more easily leverage
Flash RAM on small embedded devices. For example, it enables these kinds of
scenarios:
keyword spotting
demo<https://lovettchris.github.io/posts/keyword_spotting/>.

We are finding some pretty bad compiler performance in some cases. For example,
this github
gist<https://gist.github.com/lovettchris/91e30bce1d18f16eddaf67306101e4e0>
contains a bitcode file which is a neural network compiled by ELL and it has
about 30mb of floating point data, and when we put that through llc it takes 262
seconds to compile (on an Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz), but if we
strip out the weights the "code" component of our neural network
inference takes only 2 seconds to compile.

We've noticed a good improvement in LLVM 8.0 in this area, but we think
there's still a lot more that could be done. For example,
is it possible to dump big arrays of global floating point data into a binary
without invoking huge assembly writer overhead?
Perhaps what is happening is the optimizer is trying to optimized away unused
floats but we would like to disable that and just
tell the compiler dump the floats into the object file, don't bother trying
to optimize them....

Any thoughts?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190508/887a71bb/attachment-0001.html>

Eli Friedman via llvm-dev

2019-May-10 01:30 UTC

head link

[llvm-dev] slow performance in llc.exe to do with large global floating point arrays

Have you tried -time-passes to see where it's actually spending time?  I
don't think there's currently a timer that covers printing global
variables to assembly, but you should be able to rule out something else.

Currently, the fastest path for emitting global data into an object file is an
i8 array "[1000000 x i8]"; given a module in memory, we make one extra
copy over the ideal of just directly calling write() on the bits, which should
be fast enough for most purposes.  A "[1000000 x float]" currently
uses a less efficient path, which copies the values one by one, but it probably
wouldn't be hard to optimize. If you're emitting something that
isn't just an array of constant data, it gets less efficient.  See
emitGlobalConstantDataSequential in lib/CodeGen/AsmPrinter.cpp.

-Eli

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Chris
Lovett via llvm-dev
Sent: Wednesday, May 8, 2019 3:52 PM
To: llvm-dev at lists.llvm.org
Subject: [EXT] [llvm-dev] slow performance in llc.exe to do with large global
floating point arrays

We are building a neural network compiler using LLVM, see
https://github.com/Microsoft/ELL.

We want to put the neural network weights into a bunch of global float arrays
because it allows us to more easily leverage
Flash RAM on small embedded devices.  For example, it enables these kinds of
scenarios:
keyword spotting
demo<https://lovettchris.github.io/posts/keyword_spotting/>.

We are finding some pretty bad compiler performance in some cases.  For example,
this github
gist<https://gist.github.com/lovettchris/91e30bce1d18f16eddaf67306101e4e0>
contains a bitcode file which is a neural network compiled by ELL and it has
about 30mb of floating point data, and when we put that through llc it takes 262
seconds to compile (on an Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz), but if we
strip out the weights the "code" component of our neural network
inference takes only 2 seconds to compile.

We've noticed a good improvement in LLVM 8.0 in this area, but we think
there's still a lot more that could be done.  For example,
is it possible to dump big arrays of global floating point data into a binary
without invoking huge assembly writer overhead?
Perhaps what is happening is the optimizer is trying to optimized away unused
floats but we would like to disable that and just
tell the compiler dump the floats into the object file, don't bother trying
to optimize them....

Any thoughts?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190510/0ca05a18/attachment.html>

llvm dev - May 2019 - slow performance in llc.exe to do with large global floating point arrays

[llvm-dev] slow performance in llc.exe to do with large global floating point arrays

[llvm-dev] slow performance in llc.exe to do with large global floating point arrays