thr3ads.net - Linux Virtualization - [PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance [Dec 2013]

If this information is useful, please help other people find it:
Share via:

Jason Wang

2013-Dec-26 07:33 UTC

[PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

On 12/17/2013 08:16 AM, Michael Dalton wrote:> Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page
frag
> allocators") changed the mergeable receive buffer size from PAGE_SIZE
to
> MTU-size, introducing a single-stream regression for benchmarks with large
> average packet size. There is no single optimal buffer size for all
> workloads.  For workloads with packet size <= MTU bytes, MTU +
virtio-net
> header-sized buffers are preferred as larger buffers reduce the TCP window
> due to SKB truesize. However, single-stream workloads with large average
> packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
> are used.
>
> This commit auto-tunes the mergeable receiver buffer packet size by
> choosing the packet buffer size based on an EWMA of the recent packet
> sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
> virtio-net header len to PAGE_SIZE. This improves throughput for
> large packet workloads, as any workload with average packet size >>
PAGE_SIZE will use PAGE_SIZE buffers.
>
> These optimizations interact positively with recent commit
> ba275241030c ("virtio-net: coalesce rx frags when possible during
rx"),
> which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
> optimizations benefit buffers of any size.
>
> Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
> between two QEMU VMs on a single physical machine. Each VM has two VCPUs
> with all offloads & vhost enabled. All VMs and vhost threads run in a
> single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
> in the system will not be scheduled on the benchmark CPUs. Trunk includes
> SKB rx frag coalescing.
>
> net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
> net-next (MTU-size bufs):  13170.01Gb/s
> net-next + auto-tune: 14555.94Gb/s
>
> Signed-off-by: Michael Dalton <mwdalton at google.com>
The patch looks good to me and test this patch with mlx4, it help to
increase the rx performance from about 22Gb/s to about 26 Gb/s.

Michael Dalton

2013-Dec-26 20:06 UTC

head link

[PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

On Mon, Dec 23, 2013 at 4:51 AM, Michael S. Tsirkin <mst at redhat.com>
wrote:> OK so a high level benchmark shows it's worth it,
> but how well does the logic work?
> I think we should make the buffer size accessible in sysfs
> or debugfs, and look at it, otherwise we don't really know.
>Exporting the size sounds good to me, it is definitely an
important metric and would give more visibility to the admin.

Do you have a preference for implementation strategy? I was
thinking just add a DEVICE_ATTR to create a read-only sysfs file,
'mergeable_rx_buffer_size', and return a space-separated list of the
current buffer size (computed from the average packet size) for each
receive queue. -EINVAL or a similar error could be returned if the
netdev was not configured for mergeable rx buffers.
> I don't get the real motivation for this.
>
> We have skbs A,B,C sharing a page, with chunk D being unused.
> This randomly charges chunk D to an skb that ended up last
> in the page.
> Correct?
> Why does this make sense?
The intent of this code is to adjust the SKB true size for
the packet. We should completely use each packet buffer except
for the last buffer. For all buffers except the last buffer, it
should be the case that 'len' (bytes received) = buffer size. For
the last buffer, this code adjusts the truesize by comparing the
approximated buffer size with the bytes received into the buffer,
and adding the difference to the SKB truesize if the buffer size
is greater than the number of bytes received.

We approximate the buffer size by using the last packet buffer size
from that same page, which as you have correctly noted may be a buffer
that belongs to a different packet on the same virtio-net device. This
buffer size should be very close to the actual buffer size because our
EWMA estimator uses a high weight (so the packet buffer size changes very
slowly) and there are only a handful packets on a page (even order-3).
> Why head_skb only? Why not full buffer size that comes from host?
> This is simply len.
Sorry, I believe this code fragment should be clearer. Basically, we
have a corner case in that for packets with size <= GOOD_COPY_LEN, there
are no frags because page_to_skb() already unref'd the page and the entire
packet contents are copied to skb->data. In this case, the SKB truesize
is already accurate and should not be updated (and it would be unsafe to
access page->private as page is already unref'd).

I'll look at the above code again and cleanup (please let me know if you
have a preference) and/or add a comment to clarify.

Best,

Mike

Michael S. Tsirkin

2013-Dec-26 20:24 UTC

head link

[PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

On Thu, Dec 26, 2013 at 12:06:23PM -0800, Michael Dalton
wrote:> On Mon, Dec 23, 2013 at 4:51 AM, Michael S. Tsirkin <mst at
redhat.com> wrote:
> > OK so a high level benchmark shows it's worth it,
> > but how well does the logic work?
> > I think we should make the buffer size accessible in sysfs
> > or debugfs, and look at it, otherwise we don't really know.
> >
> Exporting the size sounds good to me, it is definitely an
> important metric and would give more visibility to the admin.
> 
> Do you have a preference for implementation strategy? I was
> thinking just add a DEVICE_ATTR to create a read-only sysfs file,
> 'mergeable_rx_buffer_size', and return a space-separated list of
the
> current buffer size (computed from the average packet size) for each
> receive queue. -EINVAL or a similar error could be returned if the
> netdev was not configured for mergeable rx buffers.
Not too important really. Maybe an attribute per queue? Might be
easier to parse and does not need tricky formatting.
> > I don't get the real motivation for this.
> >
> > We have skbs A,B,C sharing a page, with chunk D being unused.
> > This randomly charges chunk D to an skb that ended up last
> > in the page.
> > Correct?
> > Why does this make sense?
> 
> The intent of this code is to adjust the SKB true size for
> the packet. We should completely use each packet buffer except
> for the last buffer. For all buffers except the last buffer, it
> should be the case that 'len' (bytes received) = buffer size. For
> the last buffer, this code adjusts the truesize by comparing the
> approximated buffer size with the bytes received into the buffer,
> and adding the difference to the SKB truesize if the buffer size
> is greater than the number of bytes received.
> 
> We approximate the buffer size by using the last packet buffer size
> from that same page, which as you have correctly noted may be a buffer
> that belongs to a different packet on the same virtio-net device. This
> buffer size should be very close to the actual buffer size because our
> EWMA estimator uses a high weight (so the packet buffer size changes very
> slowly) and there are only a handful packets on a page (even order-3).
I'll need to ponder it, don't understand it exactly.
> > Why head_skb only? Why not full buffer size that comes from host?
> > This is simply len.
> 
> Sorry, I believe this code fragment should be clearer. Basically, we
> have a corner case in that for packets with size <= GOOD_COPY_LEN, there
> are no frags because page_to_skb() already unref'd the page and the
entire
> packet contents are copied to skb->data. In this case, the SKB truesize
> is already accurate and should not be updated (and it would be unsafe to
> access page->private as page is already unref'd).
> 
> I'll look at the above code again and cleanup (please let me know if
you
> have a preference) and/or add a comment to clarify.
> 
> Best,
> 
> Mike

Jason Wang

2013-Dec-27 03:04 UTC

head link

[PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

On 12/27/2013 04:06 AM, Michael Dalton wrote:> On Mon, Dec 23, 2013 at 4:51 AM, Michael S. Tsirkin <mst at
redhat.com> wrote:
>> OK so a high level benchmark shows it's worth it,
>> but how well does the logic work?
>> I think we should make the buffer size accessible in sysfs
>> or debugfs, and look at it, otherwise we don't really know.
>>
> Exporting the size sounds good to me, it is definitely an
> important metric and would give more visibility to the admin.
>
> Do you have a preference for implementation strategy? I was
> thinking just add a DEVICE_ATTR to create a read-only sysfs file,
> 'mergeable_rx_buffer_size', and return a space-separated list of
the
> current buffer size (computed from the average packet size) for each
> receive queue. -EINVAL or a similar error could be returned if the
> netdev was not configured for mergeable rx buffers.
>
>> I don't get the real motivation for this.
>>
>> We have skbs A,B,C sharing a page, with chunk D being unused.
>> This randomly charges chunk D to an skb that ended up last
>> in the page.
>> Correct?
>> Why does this make sense?
> The intent of this code is to adjust the SKB true size for
> the packet. We should completely use each packet buffer except
> for the last buffer. For all buffers except the last buffer, it
> should be the case that 'len' (bytes received) = buffer size. For
> the last buffer, this code adjusts the truesize by comparing the
> approximated buffer size with the bytes received into the buffer,
> and adding the difference to the SKB truesize if the buffer size
> is greater than the number of bytes received.
But look like it does not take the 'hole' into account which may cause
under estimation of trusize.>
> We approximate the buffer size by using the last packet buffer size
> from that same page, which as you have correctly noted may be a buffer
> that belongs to a different packet on the same virtio-net device. This
> buffer size should be very close to the actual buffer size because our
> EWMA estimator uses a high weight (so the packet buffer size changes very
> slowly) and there are only a handful packets on a page (even order-3).
We can make this more accurate by using extra data structure to track
the real buf size and using it as token.>> Why head_skb only? Why not full buffer size that comes from host?
>> This is simply len.
> Sorry, I believe this code fragment should be clearer. Basically, we
> have a corner case in that for packets with size <= GOOD_COPY_LEN, there
> are no frags because page_to_skb() already unref'd the page and the
entire
> packet contents are copied to skb->data. In this case, the SKB truesize
> is already accurate and should not be updated (and it would be unsafe to
> access page->private as page is already unref'd).
>
> I'll look at the above code again and cleanup (please let me know if
you
> have a preference) and/or add a comment to clarify.
>
> Best,
>
> Mike

Reasonably Related Threads

Search for more reasonably related threads

Linux Virtualization - Dec 2013 - [PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

[PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

[PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

[PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

[PATCH net-next 3/3] net: auto-tune mergeable rx buffer size for improved performance

Reasonably Related Threads