thr3ads.net - Xen devel - [Xen-devel] more profiling [Feb 2008]

If this information is useful, please help other people find it:
Share via:

James Harper

2008-Feb-29 11:00 UTC

[Xen-devel] more profiling

Andy,

I put some profiling around calls to GrantAccess and EndAccess, and have
the following results:

XenNet     TxBufferGC        Count =     108351, Avg Time =     227989
XenNet     TxBufferFree      Count =          0, Avg Time =          0
XenNet     RxBufferAlloc     Count =     108353, Avg Time =      17349
XenNet     RxBufferFree      Count =          0, Avg Time =          0
XenNet     ReturnPacket      Count =      65231, Avg Time =       1106
XenNet     RxBufferCheck     Count =     108353, Avg Time =     124069
XenNet     Linearize         Count =     129024, Avg Time =      29333
XenNet     SendPackets       Count =     129024, Avg Time =      67107
XenNet     SendQueuedPackets Count =     237369, Avg Time =      73055
XenNet     GrantAccess       Count =     194325, Avg Time =      25878
XenNet     EndAccess         Count =     194261, Avg Time =      27181

The time for GrantAccess and EndAccess is, I think, quite significant in
the scheme of things, especially as TxBufferGC and RxBufferCheck (the
two large times) will both have multiple calls to GrantAccess and
EndAccess.

What I''d like to do is implement a compromise between my previous
buffer
management approach (used lots of memory, but no allocate/grant per
packet) and your approach (uses minimum memory, but allocate/grant per
packet). We would maintain a pool of packets and buffers, and grow and
shrink the pool dynamically, as follows:
. Create a freelist of packets and buffers
. When we need a new packet or buffer, and there are none on the
freelist, allocate them and grant the buffer.
. When we are done with them, put them on the freelist
. Keep a count of the minimum size of the freelists. If the free list
has been greater than some value (32?) for some time (5 seconds?) then
free half of the items on the list.
. Maybe keep a freelist per processor too, to avoid the need for
spinlocks where we are running at DISPATCH_LEVEL

I think that gives us a pretty good compromise between memory usage and
calls to allocate/grant/ungrant/free.

I was going to look at getting rid of the Linearize, but if we don''t
Linearize then we have to GrantAccess to the kernel supplied buffer, and
I think a (max) 1500 byte memcpy is going to be cheaper than a call to
GrantAccess...

Thoughts?

James


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

James Harper

2008-Feb-29 13:44 UTC

head link

RE: [Xen-devel] more profiling

> What I''d like to do is implement a compromise between my previous
buffer> management approach (used lots of memory, but no allocate/grant per
> packet) and your approach (uses minimum memory, but allocate/grant per
> packet). We would maintain a pool of packets and buffers, and grow and
> shrink the pool dynamically, as follows:
> . Create a freelist of packets and buffers
> . When we need a new packet or buffer, and there are none on the
> freelist, allocate them and grant the buffer.
> . When we are done with them, put them on the freelist
> . Keep a count of the minimum size of the freelists. If the free list
> has been greater than some value (32?) for some time (5 seconds?) then
> free half of the items on the list.
> . Maybe keep a freelist per processor too, to avoid the need for
> spinlocks where we are running at DISPATCH_LEVEL
> 
> I think that gives us a pretty good compromise between memory usage
and> calls to allocate/grant/ungrant/free.
I have implemented something like the above, a ''page pool''
which is a
list of pre-granted pages. This drops the time spent in TxBufferGC and
SendQueuedPackets by 30-50%. A good start I think, although there
doesn''t appear to be much improvement in the iperf results, maybe only
20%.

It''s time for sleep now, but when I get a chance I''ll add the
same logic
to the receive path, and clean it up so xennet can unload properly
(currently it leaks and/or crashes on unload).

James


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Santos, Jose Renato G

2008-Feb-29 17:30 UTC

head link

RE: [Xen-devel] more profiling

James,

Could you please provide me some context and details of this work.
This seems related to the work we are doing in netchannel2 to reuse grants, but
I don''t think I understand what is that you are trying to do and how it
is related.

Thanks

Renato
> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of
> James Harper
> Sent: Friday, February 29, 2008 5:45 AM
> To: James Harper; Andy Grover
> Cc: xen-devel@lists.xensource.com
> Subject: RE: [Xen-devel] more profiling
>
> > What I''d like to do is implement a compromise between my
previous
> buffer
> > management approach (used lots of memory, but no allocate/grant per
> > packet) and your approach (uses minimum memory, but
> allocate/grant per
> > packet). We would maintain a pool of packets and buffers,
> and grow and
> > shrink the pool dynamically, as follows:
> > . Create a freelist of packets and buffers . When we need a
> new packet
> > or buffer, and there are none on the freelist, allocate
> them and grant
> > the buffer.
> > . When we are done with them, put them on the freelist .
> Keep a count
> > of the minimum size of the freelists. If the free list has been
> > greater than some value (32?) for some time (5 seconds?) then free
> > half of the items on the list.
> > . Maybe keep a freelist per processor too, to avoid the need for
> > spinlocks where we are running at DISPATCH_LEVEL
> >
> > I think that gives us a pretty good compromise between memory usage
> and
> > calls to allocate/grant/ungrant/free.
>
> I have implemented something like the above, a ''page
pool''
> which is a list of pre-granted pages. This drops the time
> spent in TxBufferGC and SendQueuedPackets by 30-50%. A good
> start I think, although there doesn''t appear to be much
> improvement in the iperf results, maybe only 20%.
>
> It''s time for sleep now, but when I get a chance I''ll add
the
> same logic to the receive path, and clean it up so xennet can
> unload properly (currently it leaks and/or crashes on unload).
>
> James
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

James Harper

2008-Feb-29 22:25 UTC

head link

RE: [Xen-devel] more profiling

> James,
> 
> Could you please provide me some context and details of this work.
> This seems related to the work we are doing in netchannel2 to reuse
> grants, but I don''t think I understand what is that you are trying
to
do> and how it is related.
> 
The solution I ended up implementing was to keep a list of pre-allocated
pre-granted pages. Any time we need a new page (either for putting on
the rx list, or for copying a tx packet to) it comes from the list. If
there are no page on the list, a new page is allocated and granted. When
we are finished with the page, it goes back on the free list.

I''ll also be writing some sort of garbage collector which runs
periodically (maybe every x seconds, or every x calls to
''put_page_on_freelist''). If during that interval the number of
free
pages has been constantly above some threshold (32?), then we will
ungrant and free half the pages on the list. This will keep memory usage
reasonable while keeping performance good.

In the tx path, the windows xennet driver currently takes the sg list of
buffers per packet and copies them to a single page buffer. At first I
thought there would be some performance to be had in just passing the
backend the list of pages, but it looks like the memory copy operation
is much less expensive than the grant operation.

James


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Santos, Jose Renato G

2008-Mar-03 03:09 UTC

head link

RE: [Xen-devel] more profiling

James

Thanks for the reply.
It seems that your changes do not include netback, i.e all the
changes are limited to netfront. Correct?
In that case, your changes avoid the cost of issuing and
revoking the grant (i.e. adding and removing the grant from
the grant table). I assume netback is still doing hypercalls
for grant operations on every I/O operation (i.e. grant map
for TX and grant copy operation for RX).
In netchannel2 we plan to avoid the grant operations in netback
as well.

In my experiments I also see overheads on issuing and revoking
grants due to the use of atomic operations, but these are
much less expensive than copying an entire packet as you
do on the TX path. I am surprised with your results.
Can you give more details about your configuration and how you
are comparing the cost of copy versus issuing grants on TX.

Thanks

Renato
> -----Original Message-----
> From: James Harper [mailto:james.harper@bendigoit.com.au]
> Sent: Friday, February 29, 2008 2:26 PM
> To: Santos, Jose Renato G; Andy Grover
> Cc: xen-devel@lists.xensource.com
> Subject: RE: [Xen-devel] more profiling
>
> > James,
> >
> > Could you please provide me some context and details of this work.
> > This seems related to the work we are doing in netchannel2 to reuse
> > grants, but I don''t think I understand what is that you are
> trying to
> do
> > and how it is related.
> >
>
> The solution I ended up implementing was to keep a list of
> pre-allocated pre-granted pages. Any time we need a new page
> (either for putting on the rx list, or for copying a tx
> packet to) it comes from the list. If there are no page on
> the list, a new page is allocated and granted. When we are
> finished with the page, it goes back on the free list.
>
> I''ll also be writing some sort of garbage collector which
> runs periodically (maybe every x seconds, or every x calls to
> ''put_page_on_freelist''). If during that interval the
number
> of free pages has been constantly above some threshold (32?),
> then we will ungrant and free half the pages on the list.
> This will keep memory usage reasonable while keeping performance good.
>
> In the tx path, the windows xennet driver currently takes the
> sg list of buffers per packet and copies them to a single
> page buffer. At first I thought there would be some
> performance to be had in just passing the backend the list of
> pages, but it looks like the memory copy operation is much
> less expensive than the grant operation.
>
> James
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

James Harper

2008-Mar-03 03:47 UTC

head link

RE: [Xen-devel] more profiling

> In my experiments I also see overheads on issuing and revoking
> grants due to the use of atomic operations, but these are
> much less expensive than copying an entire packet as you
> do on the TX path. I am surprised with your results.
> Can you give more details about your configuration and how you
> are comparing the cost of copy versus issuing grants on TX.
I think you are right in saying that the issuing and revoking of grants
is due to the use of atomic operations. Having looked into it some more,
it looks like KeAcquireSpinlock (the windows lock operation) is fairly
expensive.

Under windows, it is the code that gets the next free ref that is
protected by spinlocks. I believe that if we only get the ref once, but
then reuse that ref over and over, then we''d get a lot better
performace.

James

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Greg Law

2008-Mar-03 10:52 UTC

head link

[Xen-devel] Netchannel2 (was more profiling)

Hi Renato,

Santos, Jose Renato G wrote:> James,
> 
> Could you please provide me some context and details of this work.
> This seems related to the work we are doing in netchannel2 to reuse grants,
Is there any update on this work since the last Xen Summit?  I''m 
particularly interested in anything has happened with multi-queue
NIC''s.
  Two reasons:

  1/ Our NIC is multiqueue and I''d like to ensure we''re
compatible with
any Netchannel2 work done to exploit this.

  2/ Our ''accelerated plugin'' implementation could potentially
make use
of some of the skb allocator mods to avoid a copy on rx.

Cheers,

Greg
-- 
Greg Law               glaw@solarflare.com              +44 1223 518 040

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Santos, Jose Renato G

2008-Mar-03 18:35 UTC

head link

RE: [Xen-devel] Netchannel2 (was more profiling)

Greg,

We are currently working on support for multi-queue NICs. This will be the first
feature available. Grant reuse and copy on the guest will come later.
We don''t have code available yet but we have a spec of the interface
between netback and device driver for multi-queue support.
It will be good if your drivers supported that interface
I will send you the spec later on a private email.

Regards

Renato

> -----Original Message-----
> From: Greg Law [mailto:glaw@solarflare.com]
> Sent: Monday, March 03, 2008 2:53 AM
> To: Santos, Jose Renato G
> Cc: xen-devel@lists.xensource.com
> Subject: [Xen-devel] Netchannel2 (was more profiling)
>
> Hi Renato,
>
> Santos, Jose Renato G wrote:
> > James,
> >
> > Could you please provide me some context and details of this work.
> > This seems related to the work we are doing in netchannel2 to reuse
> > grants,
>
> Is there any update on this work since the last Xen Summit?
> I''m particularly interested in anything has happened with
> multi-queue NIC''s.
>   Two reasons:
>
>   1/ Our NIC is multiqueue and I''d like to ensure we''re
> compatible with any Netchannel2 work done to exploit this.
>
>   2/ Our ''accelerated plugin'' implementation could
> potentially make use of some of the skb allocator mods to
> avoid a copy on rx.
>
> Cheers,
>
> Greg
> --
> Greg Law               glaw@solarflare.com              +44
> 1223 518 040
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Santos, Jose Renato G

2008-Mar-03 19:24 UTC

head link

RE: [Xen-devel] more profiling

> -----Original Message-----
> From: James Harper [mailto:james.harper@bendigoit.com.au]
> Sent: Sunday, March 02, 2008 7:48 PM
> To: Santos, Jose Renato G; Andy Grover
> Cc: xen-devel@lists.xensource.com
> Subject: RE: [Xen-devel] more profiling
>
> > In my experiments I also see overheads on issuing and
> revoking grants
> > due to the use of atomic operations, but these are much
> less expensive
> > than copying an entire packet as you do on the TX path. I
> am surprised
> > with your results.
> > Can you give more details about your configuration and how you are
> > comparing the cost of copy versus issuing grants on TX.
>
> I think you are right in saying that the issuing and revoking
> of grants is due to the use of atomic operations. Having
> looked into it some more, it looks like KeAcquireSpinlock
> (the windows lock operation) is fairly expensive.
>
> Under windows, it is the code that gets the next free ref
> that is protected by spinlocks. I believe that if we only get
> the ref once, but then reuse that ref over and over, then
> we''d get a lot better performace.
>
  Yes. Avoiding the spinlock should improve performance. Definetely, it should
be a win on the RX path. But is it worth in the TX path, if you now have to copy
the packet? Do you have experimental data showing that copying is better than
the spinlock? I don''t have much experience with Windows but I think
this would be very surprising...

Regards

Renato
> James
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

James Harper

2008-Mar-03 23:02 UTC

head link

RE: [Xen-devel] more profiling

> > Under windows, it is the code that gets the next free ref
> > that is protected by spinlocks. I believe that if we only get
> > the ref once, but then reuse that ref over and over, then
> > we''d get a lot better performace.
> >
> 
>   Yes. Avoiding the spinlock should improve performance. Definetely,
it> should be a win on the RX path. But is it worth in the TX path, if you
now> have to copy the packet? Do you have experimental data showing that
> copying is better than the spinlock? I don''t have much experience
with
> Windows but I think this would be very surprising...
Looking at the profiling data that I have collected, the copy operation
(max 1500 bytes copy, probably around 200-300 on average) does appear to
consume less CPU resources than the acquire spinlock operation. The
other thing to consider is that Windows seems to give us 2-3 separate
pages of data per packet, one containing the header, another containing
the next header, and one containing the layer 3 data. This would be
three get grant entry operations.

However, if I can avoid the spinlock-per-grant-entry, and also avoid
copying, then things will be even better!

Btw, with ''request-rx-copy = 1'', does that mean that the
backend still
makes copies of the data to give to us?

James

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Feb 2008 - more profiling

[Xen-devel] more profiling

RE: [Xen-devel] more profiling

RE: [Xen-devel] more profiling

RE: [Xen-devel] more profiling

RE: [Xen-devel] more profiling

RE: [Xen-devel] more profiling

[Xen-devel] Netchannel2 (was more profiling)

RE: [Xen-devel] Netchannel2 (was more profiling)

RE: [Xen-devel] more profiling

RE: [Xen-devel] more profiling