thr3ads.net - llvm dev - [llvm-dev] Performance of large llvm::ConstantDataArrays [Sep 2017]

If this information is useful, please help other people find it:
Share via:

Chris Lovett via llvm-dev

2017-Sep-08 06:06 UTC

[llvm-dev] Performance of large llvm::ConstantDataArrays

I'm running into some pretty bad performance in llc.exe when compiling some
large neural networks into code that contains some very
large llvm::ConstantDataArrays, some are { size=102,760,448 }. There's a
small about of actual code for processing the network, but the assembly is
mostly global data.

I'm finding that llc.exe memory spikes up around 30 gigabytes and the job
takes 20-30 minutes compiling from bitcode.  When I looked into it I found
that every single floating point number is loaded into ConstantFP object
where the float is parsed into exponent, mantissa and stored in an integer
part is stored in a heap allocated array, then these are emitted into
MCDataFragments where again more heap allocated data, the float appears to
be stored in SmallVectorImpl<char>.  On top of this I see a lot of
MCFillFragments added because of long double padding.

All up the code I'm compiling ends up with 276 million MCFragments, which
just take a super long time in each phase of compiling (loading from
bitcode, emitting, layout and writing).  With a peak working set of 30
gigabytes each float is taking around 108 bytes!

Is there a more efficient way to do this? Or is there any plan in the works
to handle global data more efficiently in llc ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170907/5978f546/attachment.html>

Joerg Sonnenberger via llvm-dev

2017-Sep-08 13:00 UTC

head link

[llvm-dev] Performance of large llvm::ConstantDataArrays

On Thu, Sep 07, 2017 at 11:06:58PM -0700, Chris Lovett via llvm-dev
wrote:> I'm running into some pretty bad performance in llc.exe when compiling
some
> large neural networks into code that contains some very
> large llvm::ConstantDataArrays, some are { size=102,760,448 }. There's
a
> small about of actual code for processing the network, but the assembly is
> mostly global data.
Have you considered just writing out binary data directly and using i.e.
.incbin for including it?

Joerg

Sean Silva via llvm-dev

2017-Sep-08 18:33 UTC

head link

[llvm-dev] Performance of large llvm::ConstantDataArrays

On Thu, Sep 7, 2017 at 11:06 PM, Chris Lovett via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> I'm running into some pretty bad performance in llc.exe when compiling
> some large neural networks into code that contains some very large
llvm::ConstantDataArrays,
> some are { size=102,760,448 }. There's a small about of actual code for
> processing the network, but the assembly is mostly global data.
>
> I'm finding that llc.exe memory spikes up around 30 gigabytes and the
job
> takes 20-30 minutes compiling from bitcode.  When I looked into it I found
> that every single floating point number is loaded into ConstantFP object
> where the float is parsed into exponent, mantissa and stored in an integer
> part is stored in a heap allocated array, then these are emitted into
> MCDataFragments where again more heap allocated data, the float appears to
> be stored in SmallVectorImpl<char>.  On top of this I see a lot of
> MCFillFragments added because of long double padding.
>
> All up the code I'm compiling ends up with 276 million MCFragments,
which
> just take a super long time in each phase of compiling (loading from
> bitcode, emitting, layout and writing).  With a peak working set of 30
> gigabytes each float is taking around 108 bytes!
>
> Is there a more efficient way to do this? Or is there any plan in the
> works to handle global data more efficiently in llc ?
>
Maybe try putting the blob of floating point numbers in a string / i8 array?

-- Sean Silva

>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170908/ccb37d78/attachment.html>

Chris Lovett via llvm-dev

2017-Sep-08 23:30 UTC

head link

[llvm-dev] Performance of large llvm::ConstantDataArrays

Thanks, I've considered that, but I am using LLVM to target multiple
platforms, so if I do that I'm worried I need to consider the
cross-platform floating point memory layouts ... unless LLVM can help me
with creating the correct binary blob...

On Fri, Sep 8, 2017 at 6:00 AM, Joerg Sonnenberger via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On Thu, Sep 07, 2017 at 11:06:58PM -0700, Chris Lovett via llvm-dev wrote:
> > I'm running into some pretty bad performance in llc.exe when
compiling
> some
> > large neural networks into code that contains some very
> > large llvm::ConstantDataArrays, some are { size=102,760,448 }.
There's a
> > small about of actual code for processing the network, but the
assembly
> is
> > mostly global data.
>
> Have you considered just writing out binary data directly and using i.e.
> .incbin for including it?
>
> Joerg
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170908/6bc2a86b/attachment.html>

Chris Lattner via llvm-dev

2017-Sep-10 05:18 UTC

head link

[llvm-dev] Performance of large llvm::ConstantDataArrays

> On Sep 7, 2017, at 11:06 PM, Chris Lovett via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> I'm running into some pretty bad performance in llc.exe when compiling
some large neural networks into code that contains some very large
llvm::ConstantDataArrays, some are { size=102,760,448 }. There's a small
about of actual code for processing the network, but the assembly is mostly
global data.
Yes, llvm’s representation of constant arrays is insanity for cases like this. 
Your case is bad, but just imagine the cost of a large char[] initialization:
even though each byte is stored as a ConstantInt, the bloat isn’t huge because
they are uniqued.  The real problem comes from each entry in the ConstantArray
being stored as an operand list.  An operand in the operand list consumes
something like 3-4 words per operand to maintain the uselist and a bunch of
other nonsense that isn’t right for this.

IMO, there is a relatively easy solution for this.  Introduce a new subclass of
ConstantData which represents a blob of data that gets emitted to the .o file,
stored in a reasonable native format.  I think it would be fine to limit this to
only representing arrays of primitive types (e.g. array of float, array of
bytes, etc) since this keeps the API to the type simple (the type models an
array, so it should have array element members only), and things that want to
get the elements of the array out can have them returned as ConstantInt’s (or
whatever).  I’d name this something like “ConstantArrayBlob”.

There are cases this wouldn’t cover well, e.g. an array of small structs, but I
think that is ok, and it could be feature crept to support that over time.   The
next trick is adding the corresponding special case to Clang to not generate the
ConstantArray and the ConstantFP/Int members when given a candidate
initialization.  This can be done as a secondary optimization after the basic
mechanics are in place.

-Chris

> 
> I'm finding that llc.exe memory spikes up around 30 gigabytes and the
job takes 20-30 minutes compiling from bitcode.  When I looked into it I found
that every single floating point number is loaded into ConstantFP object where
the float is parsed into exponent, mantissa and stored in an integer part is
stored in a heap allocated array, then these are emitted into MCDataFragments
where again more heap allocated data, the float appears to be stored in
SmallVectorImpl<char>.  On top of this I see a lot of MCFillFragments
added because of long double padding.
> 
> All up the code I'm compiling ends up with 276 million MCFragments,
which just take a super long time in each phase of compiling (loading from
bitcode, emitting, layout and writing).  With a peak working set of 30 gigabytes
each float is taking around 108 bytes!
> 
> Is there a more efficient way to do this? Or is there any plan in the works
to handle global data more efficiently in llc ?
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170909/356c05ef/attachment.html>

Sean Silva via llvm-dev

2017-Sep-10 08:34 UTC

head link

[llvm-dev] Performance of large llvm::ConstantDataArrays

On Sat, Sep 9, 2017 at 10:18 PM, Chris Lattner via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> On Sep 7, 2017, at 11:06 PM, Chris Lovett via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> I'm running into some pretty bad performance in llc.exe when compiling
> some large neural networks into code that contains some very large
llvm::ConstantDataArrays,
> some are { size=102,760,448 }. There's a small about of actual code for
> processing the network, but the assembly is mostly global data.
>
>
> Yes, llvm’s representation of constant arrays is insanity for cases like
> this.  Your case is bad, but just imagine the cost of a large char[]
> initialization: even though each byte is stored as a ConstantInt, the bloat
> isn’t huge because they are uniqued.  The real problem comes from each
> entry in the ConstantArray being stored as an operand list.  An operand in
> the operand list consumes something like 3-4 words per operand to maintain
> the uselist and a bunch of other nonsense that isn’t right for this.
>
> IMO, there is a relatively easy solution for this.  Introduce a new
> subclass of ConstantData which represents a blob of data that gets emitted
> to the .o file, stored in a reasonable native format.  I think it would be
> fine to limit this to only representing arrays of primitive types (e.g.
> array of float, array of bytes, etc) since this keeps the API to the type
> simple (the type models an array, so it should have array element members
> only), and things that want to get the elements of the array out can have
> them returned as ConstantInt’s (or whatever).  I’d name this something like
> “ConstantArrayBlob”.
>

What's the relationship between ConstantDataArray and ConstantArray?

The former's doxygen says "An array constant whose element type is a
simple
1/2/4/8-byte integer or float/double, and whose elements are just simple
data values (i.e. ConstantInt/ConstantFP). This Constant node has no
operands because it stores all of the elements of the constant as densely
packed data, instead of as Value*'s." so I assumed that it was a dense
representation and it seemed reasonable that an i8 typed one of them would
basically operate as a "ConstantArrayBlob". (but I guess if MC still
creates one fragment per element that will still be a memory hog regardless
of the IR's representation)

-- Sean Silva

>
> There are cases this wouldn’t cover well, e.g. an array of small structs,
> but I think that is ok, and it could be feature crept to support that over
> time.   The next trick is adding the corresponding special case to Clang to
> not generate the ConstantArray and the ConstantFP/Int members when given a
> candidate initialization.  This can be done as a secondary optimization
> after the basic mechanics are in place.
>
> -Chris
>
>
>
> I'm finding that llc.exe memory spikes up around 30 gigabytes and the
job
> takes 20-30 minutes compiling from bitcode.  When I looked into it I found
> that every single floating point number is loaded into ConstantFP object
> where the float is parsed into exponent, mantissa and stored in an integer
> part is stored in a heap allocated array, then these are emitted into
> MCDataFragments where again more heap allocated data, the float appears to
> be stored in SmallVectorImpl<char>.  On top of this I see a lot of
> MCFillFragments added because of long double padding.
>
> All up the code I'm compiling ends up with 276 million MCFragments,
which
> just take a super long time in each phase of compiling (loading from
> bitcode, emitting, layout and writing).  With a peak working set of 30
> gigabytes each float is taking around 108 bytes!
>
> Is there a more efficient way to do this? Or is there any plan in the
> works to handle global data more efficiently in llc ?
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170910/5b74f4ec/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Sep 2017 - Performance of large llvm::ConstantDataArrays

[llvm-dev] Performance of large llvm::ConstantDataArrays

[llvm-dev] Performance of large llvm::ConstantDataArrays

[llvm-dev] Performance of large llvm::ConstantDataArrays

[llvm-dev] Performance of large llvm::ConstantDataArrays

[llvm-dev] Performance of large llvm::ConstantDataArrays

[llvm-dev] Performance of large llvm::ConstantDataArrays

Maybe Matching Threads