thr3ads.net - llvm dev - [llvm-dev] LLD: Using sendfile(2) to copy file contents [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Rui Ueyama via llvm-dev

2016-Jun-05 20:19 UTC

[llvm-dev] LLD: Using sendfile(2) to copy file contents

This is a short summary of an experiment that I did for the linker.

One of the major tasks of the linker is to copy file contents from input
object files to an output file. I was wondering what's the fastest way to
copy data from one file to another, so I conducted an experiment.

Currently, LLD copies file contents using memcpy (input files and an output
file are mapped to memory.) mmap+memcpy is not known as the fastest way to
copy file contents.

Linux has sendfile system call. The system call takes two file descriptors
and copies contents from one to another (it used to take only a socket as a
destination, but these days it can take any file.) It is usually much
faster than memcpy to copy files. For example, it is about 3x faster than
cp command to copy large files on my machine (on SSD/ext4).

I made a change to LLVM and LLD to use sendfile instead of memcpy to copy
section contents. Here's the time to link clang with debug info.

    memcpy: 12.96 seconds
    sendfile: 12.82 seconds

sendfile(2) was slightly faster but not that much. But if you disable
string merging (by passing -O0 parameter to the linker), the difference
becomes noticeable.

    memcpy: 7.85 seconds
    sendfile: 6.94 seconds

I think it is because, with -O0, the linker has to copy more contents than
without -O0. It creates 2x larger executable than without -O0. As the
amount of data the linker needs to copy gets larger, sendfile gets more
effective.

By the way, gold takes 27.05 seconds to link it.

With the results, I'm *not* going to submit that change. There are two
reasons. First, the optimization seems too system-specific, and I'm not yet
sure if it's always effective even on Linux. Second, the current
implementations of MemoryBuffer and FileOutputBuffer are not
sendfile(2)-friendly because they close file descriptors immediately after
mapping them to memory. My patch is too hacky to submit.

Being said that, the results clearly show that there's room for future
optimization. I think we want to revisit it when we want to do a low-level
optimization on link speed.

Rui
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160605/811a604f/attachment.html>

Davide Italiano via llvm-dev

2016-Jun-05 20:48 UTC

head link

[llvm-dev] LLD: Using sendfile(2) to copy file contents

On Sun, Jun 5, 2016 at 1:19 PM, Rui Ueyama via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> This is a short summary of an experiment that I did for the linker.
>
> One of the major tasks of the linker is to copy file contents from input
> object files to an output file. I was wondering what's the fastest way
to
> copy data from one file to another, so I conducted an experiment.
>
> Currently, LLD copies file contents using memcpy (input files and an output
> file are mapped to memory.) mmap+memcpy is not known as the fastest way to
> copy file contents.
>
> Linux has sendfile system call. The system call takes two file descriptors
> and copies contents from one to another (it used to take only a socket as a
> destination, but these days it can take any file.) It is usually much
faster
> than memcpy to copy files. For example, it is about 3x faster than cp
> command to copy large files on my machine (on SSD/ext4).
>
> I made a change to LLVM and LLD to use sendfile instead of memcpy to copy
> section contents. Here's the time to link clang with debug info.
>
>     memcpy: 12.96 seconds
>     sendfile: 12.82 seconds
>
> sendfile(2) was slightly faster but not that much. But if you disable
string
> merging (by passing -O0 parameter to the linker), the difference becomes
> noticeable.
>
>     memcpy: 7.85 seconds
>     sendfile: 6.94 seconds
>
> I think it is because, with -O0, the linker has to copy more contents than
> without -O0. It creates 2x larger executable than without -O0. As the
amount
> of data the linker needs to copy gets larger, sendfile gets more effective.
>
> By the way, gold takes 27.05 seconds to link it.
>
With or without string merging?

Thanks,

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Rui Ueyama via llvm-dev

2016-Jun-05 21:10 UTC

head link

[llvm-dev] LLD: Using sendfile(2) to copy file contents

I think it's with string merging.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160605/8334ac97/attachment.html>

David Chisnall via llvm-dev

2016-Jun-06 09:24 UTC

head link

[llvm-dev] LLD: Using sendfile(2) to copy file contents

On 5 Jun 2016, at 21:19, Rui Ueyama via llvm-dev <llvm-dev at
lists.llvm.org> wrote:> 
> This is a short summary of an experiment that I did for the linker.
> 
> One of the major tasks of the linker is to copy file contents from input
object files to an output file. I was wondering what's the fastest way to
copy data from one file to another, so I conducted an experiment.
> 
> Currently, LLD copies file contents using memcpy (input files and an output
file are mapped to memory.) mmap+memcpy is not known as the fastest way to copy
file contents.
> 
> Linux has sendfile system call. The system call takes two file descriptors
and copies contents from one to another (it used to take only a socket as a
destination, but these days it can take any file.) It is usually much faster
than memcpy to copy files. For example, it is about 3x faster than cp command to
copy large files on my machine (on SSD/ext4).
> 
> I made a change to LLVM and LLD to use sendfile instead of memcpy to copy
section contents. Here's the time to link clang with debug info.
> 
>     memcpy: 12.96 seconds
>     sendfile: 12.82 seconds
> 
> sendfile(2) was slightly faster but not that much. But if you disable
string merging (by passing -O0 parameter to the linker), the difference becomes
noticeable.
> 
>     memcpy: 7.85 seconds
>     sendfile: 6.94 seconds
> 
> I think it is because, with -O0, the linker has to copy more contents than
without -O0. It creates 2x larger executable than without -O0. As the amount of
data the linker needs to copy gets larger, sendfile gets more effective.
> 
> By the way, gold takes 27.05 seconds to link it.
> 
> With the results, I'm not going to submit that change. There are two
reasons. First, the optimization seems too system-specific, and I'm not yet
sure if it's always effective even on Linux. Second, the current
implementations of MemoryBuffer and FileOutputBuffer are not
sendfile(2)-friendly because they close file descriptors immediately after
mapping them to memory. My patch is too hacky to submit.
> 
> Being said that, the results clearly show that there's room for future
optimization. I think we want to revisit it when we want to do a low-level
optimization on link speed.
This approach is only likely to yield a speedup if you are copying more than a
page, because then there is the potential for the kernel to avoid a memcpy and
just alias the pages in the buffer cache (note: most systems won’t do this
anyway, but at least then you’re exposing an optimisation opportunity to the
kernel).  Using the kernel’s memcpy in place of the userspace one is likely to
be slightly slower, as kernel memcpy implementations often don’t take advantage
of vector operations, to avoid having to save and restore FPU state for each
kernel thread, though if you’re having cache misses then these won’t make much
difference (and if you’re on x86, depending on the manufacturer, you may hit a
pattern that the microcode recognises and have your code replaced entirely with
a microcoded memcpy).

One possible improvement would be to have a custom memcpy that used non-temporal
stores, as this memory is likely not to be used at all on the CPU in the near
future (though on recent Intel chips, the DMA unit shares LLC with the CPU, so
will pull it back into L3 on writeback) and probably not DMA’d for another 10-30
seconds (if it’s sooner, then this can adversely affect performance, because on
Intel chips the DMA controller is limited to using a subset of the cache, so
having the CPU pull things into cache that are going to be DMA’d out can
actually increase performance - ironically, some zero-copy optimisations
actually harm performance on these systems).  This should reduce cache pressure,
as the stores will all go through a single way in the (typically) 8-way
associative cache.  If this is also the last time that you’re going to read  the
data, then using non-temporal loads may also help.  Note, however, that the
interpretation of the non-temporal hints is advisory and some x86 microcode
implementations make quite surprising decisions.

David

Rafael Espíndola via llvm-dev

2016-Jun-06 17:49 UTC

head link

[llvm-dev] LLD: Using sendfile(2) to copy file contents

Thanks a lot for running the experiment. One thing I want to try one day is
relocating one section at a time in anonymous memory and then using async
io(io_submit) to write the final bits. That way the kernel can do io while
we relocate other sections.

Cheers,
Rafael
On Jun 5, 2016 4:19 PM, "Rui Ueyama via llvm-dev" <llvm-dev at
lists.llvm.org>
wrote:
> This is a short summary of an experiment that I did for the linker.
>
> One of the major tasks of the linker is to copy file contents from input
> object files to an output file. I was wondering what's the fastest way
to
> copy data from one file to another, so I conducted an experiment.
>
> Currently, LLD copies file contents using memcpy (input files and an
> output file are mapped to memory.) mmap+memcpy is not known as the fastest
> way to copy file contents.
>
> Linux has sendfile system call. The system call takes two file descriptors
> and copies contents from one to another (it used to take only a socket as a
> destination, but these days it can take any file.) It is usually much
> faster than memcpy to copy files. For example, it is about 3x faster than
> cp command to copy large files on my machine (on SSD/ext4).
>
> I made a change to LLVM and LLD to use sendfile instead of memcpy to copy
> section contents. Here's the time to link clang with debug info.
>
>     memcpy: 12.96 seconds
>     sendfile: 12.82 seconds
>
> sendfile(2) was slightly faster but not that much. But if you disable
> string merging (by passing -O0 parameter to the linker), the difference
> becomes noticeable.
>
>     memcpy: 7.85 seconds
>     sendfile: 6.94 seconds
>
> I think it is because, with -O0, the linker has to copy more contents than
> without -O0. It creates 2x larger executable than without -O0. As the
> amount of data the linker needs to copy gets larger, sendfile gets more
> effective.
>
> By the way, gold takes 27.05 seconds to link it.
>
> With the results, I'm *not* going to submit that change. There are two
> reasons. First, the optimization seems too system-specific, and I'm not
yet
> sure if it's always effective even on Linux. Second, the current
> implementations of MemoryBuffer and FileOutputBuffer are not
> sendfile(2)-friendly because they close file descriptors immediately after
> mapping them to memory. My patch is too hacky to submit.
>
> Being said that, the results clearly show that there's room for future
> optimization. I think we want to revisit it when we want to do a low-level
> optimization on link speed.
>
> Rui
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160606/8be9f3fe/attachment.html>

Rui Ueyama via llvm-dev

2016-Jun-06 18:41 UTC

head link

[llvm-dev] LLD: Using sendfile(2) to copy file contents

As to leave an opportunity for the kernel, I think mmap+write would be
enough. Because the kernel knows what address is mmap'ed, it can detect
that write's source is actually a mmap'ed file and if that's the
case it
can optimize as it does for sendfile. It seems that Linux doesn't do that
now, though.

I haven't thought about using non-temporal stores. It may work as we copy
very large amount of data, but after copying data, we read it back in order
to apply relocations, so I'm not sure if it's going to be overall win.

Finally, as to asynchronous IO, I'm wondering if it's effective. It
seems
that not that many people are using asynchronous IO on Linux, and it is
often the case that minor paths are not optimized well. I agree that at
least in theory it could improve throughput, so it's worth a try.

On Mon, Jun 6, 2016 at 10:49 AM, Rafael Espíndola <
rafael.espindola at gmail.com> wrote:
> Thanks a lot for running the experiment. One thing I want to try one day
> is relocating one section at a time in anonymous memory and then using
> async io(io_submit) to write the final bits. That way the kernel can do io
> while we relocate other sections.
>
> Cheers,
> Rafael
> On Jun 5, 2016 4:19 PM, "Rui Ueyama via llvm-dev" <llvm-dev at
lists.llvm.org>
> wrote:
>
>> This is a short summary of an experiment that I did for the linker.
>>
>> One of the major tasks of the linker is to copy file contents from
input
>> object files to an output file. I was wondering what's the fastest
way to
>> copy data from one file to another, so I conducted an experiment.
>>
>> Currently, LLD copies file contents using memcpy (input files and an
>> output file are mapped to memory.) mmap+memcpy is not known as the
fastest
>> way to copy file contents.
>>
>> Linux has sendfile system call. The system call takes two file
>> descriptors and copies contents from one to another (it used to take
only a
>> socket as a destination, but these days it can take any file.) It is
>> usually much faster than memcpy to copy files. For example, it is about
3x
>> faster than cp command to copy large files on my machine (on SSD/ext4).
>>
>> I made a change to LLVM and LLD to use sendfile instead of memcpy to
copy
>> section contents. Here's the time to link clang with debug info.
>>
>>     memcpy: 12.96 seconds
>>     sendfile: 12.82 seconds
>>
>> sendfile(2) was slightly faster but not that much. But if you disable
>> string merging (by passing -O0 parameter to the linker), the difference
>> becomes noticeable.
>>
>>     memcpy: 7.85 seconds
>>     sendfile: 6.94 seconds
>>
>> I think it is because, with -O0, the linker has to copy more contents
>> than without -O0. It creates 2x larger executable than without -O0. As
the
>> amount of data the linker needs to copy gets larger, sendfile gets more
>> effective.
>>
>> By the way, gold takes 27.05 seconds to link it.
>>
>> With the results, I'm *not* going to submit that change. There are
two
>> reasons. First, the optimization seems too system-specific, and I'm
not yet
>> sure if it's always effective even on Linux. Second, the current
>> implementations of MemoryBuffer and FileOutputBuffer are not
>> sendfile(2)-friendly because they close file descriptors immediately
after
>> mapping them to memory. My patch is too hacky to submit.
>>
>> Being said that, the results clearly show that there's room for
future
>> optimization. I think we want to revisit it when we want to do a
low-level
>> optimization on link speed.
>>
>> Rui
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160606/407fd12c/attachment.html>

Sean Silva via llvm-dev

2016-Jun-06 21:55 UTC

head link

[llvm-dev] LLD: Using sendfile(2) to copy file contents

On Mon, Jun 6, 2016 at 2:24 AM, David Chisnall via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On 5 Jun 2016, at 21:19, Rui Ueyama via llvm-dev <llvm-dev at
lists.llvm.org>
> wrote:
> >
> > This is a short summary of an experiment that I did for the linker.
> >
> > One of the major tasks of the linker is to copy file contents from
input
> object files to an output file. I was wondering what's the fastest way
to
> copy data from one file to another, so I conducted an experiment.
> >
> > Currently, LLD copies file contents using memcpy (input files and an
> output file are mapped to memory.) mmap+memcpy is not known as the fastest
> way to copy file contents.
> >
> > Linux has sendfile system call. The system call takes two file
> descriptors and copies contents from one to another (it used to take only a
> socket as a destination, but these days it can take any file.) It is
> usually much faster than memcpy to copy files. For example, it is about 3x
> faster than cp command to copy large files on my machine (on SSD/ext4).
> >
> > I made a change to LLVM and LLD to use sendfile instead of memcpy to
> copy section contents. Here's the time to link clang with debug info.
> >
> >     memcpy: 12.96 seconds
> >     sendfile: 12.82 seconds
> >
> > sendfile(2) was slightly faster but not that much. But if you disable
> string merging (by passing -O0 parameter to the linker), the difference
> becomes noticeable.
> >
> >     memcpy: 7.85 seconds
> >     sendfile: 6.94 seconds
> >
> > I think it is because, with -O0, the linker has to copy more contents
> than without -O0. It creates 2x larger executable than without -O0. As the
> amount of data the linker needs to copy gets larger, sendfile gets more
> effective.
> >
> > By the way, gold takes 27.05 seconds to link it.
> >
> > With the results, I'm not going to submit that change. There are
two
> reasons. First, the optimization seems too system-specific, and I'm not
yet
> sure if it's always effective even on Linux. Second, the current
> implementations of MemoryBuffer and FileOutputBuffer are not
> sendfile(2)-friendly because they close file descriptors immediately after
> mapping them to memory. My patch is too hacky to submit.
> >
> > Being said that, the results clearly show that there's room for
future
> optimization. I think we want to revisit it when we want to do a low-level
> optimization on link speed.
>
> This approach is only likely to yield a speedup if you are copying more
> than a page, because then there is the potential for the kernel to avoid a
> memcpy and just alias the pages in the buffer cache (note: most systems
> won’t do this anyway, but at least then you’re exposing an optimisation
> opportunity to the kernel).

This assumes that the from/to addresses have the same offset modulo the
page size, which I'm not sure is ever really the case for input sections
and their location in the output.

>   Using the kernel’s memcpy in place of the userspace one is likely to be
> slightly slower, as kernel memcpy implementations often don’t take
> advantage of vector operations, to avoid having to save and restore FPU
> state for each kernel thread, though if you’re having cache misses then
> these won’t make much difference (and if you’re on x86, depending on the
> manufacturer, you may hit a pattern that the microcode recognises and have
> your code replaced entirely with a microcoded memcpy).
>
> One possible improvement would be to have a custom memcpy that used
> non-temporal stores, as this memory is likely not to be used at all on the
> CPU in the near future (though on recent Intel chips, the DMA unit shares
> LLC with the CPU, so will pull it back into L3 on writeback) and probably
> not DMA’d for another 10-30 seconds (if it’s sooner, then this can
> adversely affect performance, because on Intel chips the DMA controller is
> limited to using a subset of the cache, so having the CPU pull things into
> cache that are going to be DMA’d out can actually increase performance -
> ironically, some zero-copy optimisations actually harm performance on these
> systems).  This should reduce cache pressure, as the stores will all go
> through a single way in the (typically) 8-way associative cache.  If this
> is also the last time that you’re going to read  the data, then using
> non-temporal loads may also help.  Note, however, that the interpretation
> of the non-temporal hints is advisory and some x86 microcode
> implementations make quite surprising decisions.
>
I don't think that the performance problem of the memcpy here is Dcache
related (it is just a memcpy and so should prefetch well). I clocked that
our memcpy to the output is getting < 1GB/s throughput (on a machine that
can do >60GB/s DRAM bandwidth; see http://reviews.llvm.org/D20645#440638).
My guess is that the problem here is more about virtual memory cost (kernel
having to fix up page tables, zero-fill, etc.).

-- Sean Silva

>
> David
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160606/26bd5f18/attachment.html>

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Jun 2016 - LLD: Using sendfile(2) to copy file contents

[llvm-dev] LLD: Using sendfile(2) to copy file contents

[llvm-dev] LLD: Using sendfile(2) to copy file contents

[llvm-dev] LLD: Using sendfile(2) to copy file contents

[llvm-dev] LLD: Using sendfile(2) to copy file contents

[llvm-dev] LLD: Using sendfile(2) to copy file contents

[llvm-dev] LLD: Using sendfile(2) to copy file contents

[llvm-dev] LLD: Using sendfile(2) to copy file contents

Possibly Parallel Threads