thr3ads.net - Linux Virtualization - [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance [Jan 2014]

If this information is useful, please help other people find it:
Share via:

Michael Dalton

2014-Jan-09 08:28 UTC

[PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

Hi Michael,

Here's a quick sketch of some code that enforces a minimum buffer
alignment of only 64, and has a maximum theoretical buffer size of
aligned GOOD_PACKET_LEN + (BUF_ALIGN - 1) * BUF_ALIGN, which is at least
1536 + 63 * 64 = 5568. On x86, we already use a 64 byte alignment, and
this code supports all current buffer sizes, from 1536 to PAGE_SIZE.

#if L1_CACHE_BYTES < 64
#define MERGEABLE_BUFFER_ALIGN 64
#define MERGEABLE_BUFFER_SHIFT 6
#else
#define MERGEABLE_BUFFER_ALIGN L1_CACHE_BYTES
#define MERGEABLE_BUFFER_SHIFT L1_CACHE_SHIFT
#endif
#define MERGEABLE_BUFFER_MIN ALIGN(GOOD_PACKET_LEN +
                                   sizeof(virtio_net_hdr_mrg_rbuf),
                                   MERGEABLE_BUFFER_ALIGN)
#define MERGEABLE_BUFFER_MAX min(MERGEABLE_BUFFER_MIN +
                                 (MERGEABLE_BUFFER_ALIGN - 1) *
                                 MERGEABLE_BUFFER_ALIGN, PAGE_SIZE)
/* Extract buffer length from a mergeable buffer context. */
static u16 get_mergeable_buf_ctx_len(void *ctx) {
        u16 len = (uintptr_t)ctx & (MERGEABLE_BUFFER_ALIGN - 1);
        return MERGEABLE_BUFFER_MIN + (len << MERGEABLE_BUFFER_SHIFT);
}
/* Extract buffer base address from a mergeable buffer context. */
static void *get_mergeable_buf_ctx_base(void *ctx) {
        return (void *) ((uintptr)ctx & -MERGEABLE_BUFFER_ALIGN);
}
/* Convert a base address and length to a mergeable buffer context. */
static void *to_mergeable_buf_ctx(void *base, u16 len) {
        len -= MERGEABLE_BUFFER_MIN;
        return (void *) ((uintptr)base | (len >> MERGEABLE_BUFFER_SHIFT));
}
/* Compute the packet buffer length for a receive queue. */
static u16 get_mergeable_buffer_len(struct receive_queue *rq) {
        u16 len = clamp_t(u16, MERGEABLE_BUFFER_MIN,
                          ewma_read(&rq->avg_pkt_len),
                          MERGEABLE_BUFFER_MAX);
        return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
}

Best,

Mike

Michael Dalton

2014-Jan-09 09:02 UTC

head link

[PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

If the prior code snippet looks good to you, I'll use something like
that as a baseline for a v3 patchset. I don't think we need a stricter
alignment than 64 to express values in the range (1536 ... 4096), as
the code snippet shows, which is great for x86 4KB pages.

On other architectures that have larger page sizes > 4KB with <= 64b
cachelines, we may want to increase the alignment so that the max buffer
size will be >= PAGE_SIZE (max size allowed by skb_page_frag_refill).

If we use a minimum alignment of 128, our maximum theoretical
packet buffer length is 1536 + 127 * 128 = 17792. With 256 byte
alignment, we can express a maximum packet buffer size > 65536.

Given the above, I think we want to select the min buffer alignment
based on the PAGE_SIZE:
        <= 4KB PAGE_SIZE: 64b min alignment
        <= 16KB PAGE_SIZE: 128b min alignment
        > 16KB PAGE_SIZE: 256b min alignment

So the prior code snippet would be relatively unchanged, except that
references to the previous minimum alignment of 64 would be replaced by
a #define'd constant derived from PAGE_SIZE as shown above.

This would guarantee that we use the minimum alignment necessary to
ensure that virtio-net can post a max size (PAGE_SIZE) buffer, and for
x86 this means we won't increase the alignment beyond the x86's current
L1_CACHE_BYTES value (64). Also, sorry I haven't had a chance to respond
yet to the debugfs feedback, I will get to that soon (just wanted to do
a further deep dive on some of the sysfs/debugfs tradeoffs).

Best,

Mike

Michael S. Tsirkin

2014-Jan-09 13:25 UTC

head link

[PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

On Thu, Jan 09, 2014 at 12:28:50AM -0800, Michael Dalton
wrote:> Hi Michael,
> 
> Here's a quick sketch of some code that enforces a minimum buffer
> alignment of only 64, and has a maximum theoretical buffer size of
> aligned GOOD_PACKET_LEN + (BUF_ALIGN - 1) * BUF_ALIGN, which is at least
> 1536 + 63 * 64 = 5568. On x86, we already use a 64 byte alignment, and
> this code supports all current buffer sizes, from 1536 to PAGE_SIZE.
I really think it's best to start with 256 alignment.
A bit simpler, better than what we had, and will let us go
above PAGE_SIZE long term.  The optimization shrinking
alignment to 64 can be done on top if we see a
work-load that's improved by it, which I doubt.
> 
> #if L1_CACHE_BYTES < 64
> #define MERGEABLE_BUFFER_ALIGN 64
> #define MERGEABLE_BUFFER_SHIFT 6
> #else
> #define MERGEABLE_BUFFER_ALIGN L1_CACHE_BYTES
> #define MERGEABLE_BUFFER_SHIFT L1_CACHE_SHIFT
> #endif
> #define MERGEABLE_BUFFER_MIN ALIGN(GOOD_PACKET_LEN +
>                                    sizeof(virtio_net_hdr_mrg_rbuf),
>                                    MERGEABLE_BUFFER_ALIGN)
> #define MERGEABLE_BUFFER_MAX min(MERGEABLE_BUFFER_MIN +
>                                  (MERGEABLE_BUFFER_ALIGN - 1) *
>                                  MERGEABLE_BUFFER_ALIGN, PAGE_SIZE)
> /* Extract buffer length from a mergeable buffer context. */
> static u16 get_mergeable_buf_ctx_len(void *ctx) {
>         u16 len = (uintptr_t)ctx & (MERGEABLE_BUFFER_ALIGN - 1);
>         return MERGEABLE_BUFFER_MIN + (len <<
MERGEABLE_BUFFER_SHIFT);
You can just do len * MERGEABLE_BUFFER_ALIGN too - my compiler seems
to be smart enough to see it's same.
This way there's no need for MERGEABLE_BUFFER_SHIFT
> }
> /* Extract buffer base address from a mergeable buffer context. */
> static void *get_mergeable_buf_ctx_base(void *ctx) {
>         return (void *) ((uintptr)ctx & -MERGEABLE_BUFFER_ALIGN);
uintptr_t? or just unsigned long ...
> }
> /* Convert a base address and length to a mergeable buffer context. */
> static void *to_mergeable_buf_ctx(void *base, u16 len) {
>         len -= MERGEABLE_BUFFER_MIN;
>         return (void *) ((uintptr)base | (len >>
MERGEABLE_BUFFER_SHIFT));
> }
The APIs are a bit inconsistent in naming.
Also it's unfortunate that we use void * for both virtio buffer
and frame memory.

How about
	mergeable_buf_to_len
	mergeable_buf_to_page
	mergeable_buf_to_offset
	mergeable_page_to_buf

this way we don't use void * for frame memory.

alternatively, cast virtio buffer to unsigned long
immediately and pass that around everywhere.

> /* Compute the packet buffer length for a receive queue. */
> static u16 get_mergeable_buffer_len(struct receive_queue *rq) {
>         u16 len = clamp_t(u16, MERGEABLE_BUFFER_MIN,
>                           ewma_read(&rq->avg_pkt_len),
>                           MERGEABLE_BUFFER_MAX);
>         return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
> }
> 
> Best,
> 
> Mike

Michael Dalton

2014-Jan-09 19:33 UTC

head link

[PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

Hi Michael,

Your improvements (code changes, more consistent naming, and use of
256-byte alignment only) all sound good to me. I will get started on a v3
patchset in conformance with your recommendations after sorting out
what we want to do with the debugfs/sysfs issues. I will followup soon on
the thread for patch 4/4 so we can close on what changes are needed for
debugfs/sysfs. Thanks!

Best,

Mike

Reasonably Related Threads

Search for more maybe matching threads

Linux Virtualization - Jan 2014 - [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

[PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

[PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

[PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

[PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance

Reasonably Related Threads