thr3ads.net - Xen devel - [Xen-devel][PATCH] netback: netif cleanup [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Ky Srinivasan

2009-Dec-06 17:18 UTC

[Xen-devel][PATCH] netback: netif cleanup

We have seen cases where,  when a guest is shutdown during heavy network
traffic, the backend vif hangs around for considerable amounts of time holding
up the xenbus thread for the duration of the cleanup. We have seen cases (HP,
Novell) where it has taken hours to cleanup the interface and for that duration
we cannot launch new guests. We have traced this problem to the fact that the
front-end can shutdown with transmit resources still in transit on the back-end
side. On the back-end side, the moment the carrier is turned off on the
interface, the interface is not eligible for further scheduling in the backend.
Since turning off the carrier happens completely asynchronously with the rest of
the packet processing in the back-end (via tasklets today), we can have
situations where the vif cannot be cleaned up for a significant amount of time.

I am attaching a patch for addressing this issue. I have tried to fix the
problem with minimal changes to the current code. The main idea is to cleanup
the guest transmit resources quickly and aggressively. With this patch, the
cleanup time has been dramatically reduced. Even in this patch, the xenbus
thread is held up for the duration of the cleanup - however we attempt to
cleanup aggressively. This is in comparison to the current code where the xenbus
thread just waits for the refcnt  on the interface to drop to zero. An obvious
improvement would be to not tie up the xenbus thread for the duration of
cleaning up the interface. If there is interest, I can submit a patch for doing
this. This patch has been tested on 2.6.16 (SLES10 sp2, sp3 kernels) as well as
2.6.27 (sles11) kernel. This  patch is however against the 2.6.18-xen.hg
sources.

Signed-off-by: K. Y. Srinivasan <ksrinivasan@novell.com>



Regards,

K. Y



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Dec-06 21:44 UTC

head link

Re: [Xen-devel][PATCH] netback: netif cleanup

This looks like a hack. I think you''d get most of the benefit by simply
always calling net_tx_action_dealloc() at the top of net_tx_action() (i.e.,
remove the dealloc_cons!=dealloc_prod conditional). I think the presence of
that conditional is actually a bug, and removing it should get interface
shutdowns to a reasonable time with no other changes.

 -- Keir

On 06/12/2009 17:18, "Ky Srinivasan" <ksrinivasan@novell.com>
wrote:
> We have seen cases where,  when a guest is shutdown during heavy network
> traffic, the backend vif hangs around for considerable amounts of time
holding
> up the xenbus thread for the duration of the cleanup. We have seen cases
(HP,
> Novell) where it has taken hours to cleanup the interface and for that
> duration we cannot launch new guests. We have traced this problem to the
fact
> that the front-end can shutdown with transmit resources still in transit on
> the back-end side. On the back-end side, the moment the carrier is turned
off
> on the interface, the interface is not eligible for further scheduling in
the
> backend. Since turning off the carrier happens completely asynchronously
with
> the rest of the packet processing in the back-end (via tasklets today), we
can
> have situations where the vif cannot be cleaned up for a significant amount
of
> time.
> 
> I am attaching a patch for addressing this issue. I have tried to fix the
> problem with minimal changes to the current code. The main idea is to
cleanup
> the guest transmit resources quickly and aggressively. With this patch, the
> cleanup time has been dramatically reduced. Even in this patch, the xenbus
> thread is held up for the duration of the cleanup - however we attempt to
> cleanup aggressively. This is in comparison to the current code where the
> xenbus thread just waits for the refcnt  on the interface to drop to zero.
An
> obvious improvement would be to not tie up the xenbus thread for the
duration
> of cleaning up the interface. If there is interest, I can submit a patch
for
> doing this. This patch has been tested on 2.6.16 (SLES10 sp2, sp3 kernels)
as
> well as 2.6.27 (sles11) kernel. This  patch is however against the
> 2.6.18-xen.hg sources.
> 
> Signed-off-by: K. Y. Srinivasan <ksrinivasan@novell.com>
> 
> 
> 
> Regards,
> 
> K. Y
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ky Srinivasan

2009-Dec-06 23:15 UTC

head link

Re: [Xen-devel][PATCH] netback: netif cleanup

>>> On 12/6/2009 at  4:44 PM, in message
<C741D648.3B7A%keir.fraser@eu.citrix.com>,
Keir Fraser <keir.fraser@eu.citrix.com> wrote: > This looks like a hack. I think you''d get most of the benefit by
simply
> always calling net_tx_action_dealloc() at the top of net_tx_action() (i.e.,
> remove the dealloc_cons!=dealloc_prod conditional). I think the presence of
> that conditional is actually a bug, and removing it should get interface
> shutdowns to a reasonable time with no other changes.That was my first impression as well and was the first patch we tested. It does
not work. Take the case where we have only one netif that netback is handling
that gets into this weird state, then we would never call
net_tx_action_dealloc() if the carrier is turned off on the netif that is being
cleaned up. This patch, fixes that problem by forcefully scheduling the netif
whose carrier has been turned off. In the interest of minimizing the code
changes, I went about making these changes one at a time, and the combination of
what I have in the patch appears to reduce the cleanup time the most. Even with
these hacks, I am not particularly happy about the maximum time we have seen the
cleanup take, and for this duration the physical host is unusable for launching
new guests. So, as I noted in my earlier email, we may still want to not tie up
the xenbus thread for the duration of this cleanup.

Regards,

K. Y> 
>  -- Keir
> 
> On 06/12/2009 17:18, "Ky Srinivasan"
<ksrinivasan@novell.com> wrote:
> 
>> We have seen cases where,  when a guest is shutdown during heavy
network
>> traffic, the backend vif hangs around for considerable amounts of time 
> holding
>> up the xenbus thread for the duration of the cleanup. We have seen
cases
> (HP,
>> Novell) where it has taken hours to cleanup the interface and for that
>> duration we cannot launch new guests. We have traced this problem to
the
> fact
>> that the front-end can shutdown with transmit resources still in
transit on
>> the back-end side. On the back-end side, the moment the carrier is
turned off
>> on the interface, the interface is not eligible for further scheduling
in
> the
>> backend. Since turning off the carrier happens completely
asynchronously
> with
>> the rest of the packet processing in the back-end (via tasklets today),
we
> can
>> have situations where the vif cannot be cleaned up for a significant
amount
> of
>> time.
>> 
>> I am attaching a patch for addressing this issue. I have tried to fix
the
>> problem with minimal changes to the current code. The main idea is to 
> cleanup
>> the guest transmit resources quickly and aggressively. With this patch,
the
>> cleanup time has been dramatically reduced. Even in this patch, the
xenbus
>> thread is held up for the duration of the cleanup - however we attempt
to
>> cleanup aggressively. This is in comparison to the current code where
the
>> xenbus thread just waits for the refcnt  on the interface to drop to
zero.
> An
>> obvious improvement would be to not tie up the xenbus thread for the 
> duration
>> of cleaning up the interface. If there is interest, I can submit a
patch for
>> doing this. This patch has been tested on 2.6.16 (SLES10 sp2, sp3
kernels)
> as
>> well as 2.6.27 (sles11) kernel. This  patch is however against the
>> 2.6.18-xen.hg sources.
>> 
>> Signed-off-by: K. Y. Srinivasan <ksrinivasan@novell.com>
>> 
>> 
>> 
>> Regards,
>> 
>> K. Y
>> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Dec-07 08:04 UTC

head link

Re: [Xen-devel][PATCH] netback: netif cleanup

On 06/12/2009 23:15, "Ky Srinivasan" <ksrinivasan@novell.com>
wrote:
>> This looks like a hack. I think you''d get most of the benefit
by simply
>> always calling net_tx_action_dealloc() at the top of net_tx_action()
(i.e.,
>> remove the dealloc_cons!=dealloc_prod conditional). I think the
presence of
>> that conditional is actually a bug, and removing it should get
interface
>> shutdowns to a reasonable time with no other changes.
> That was my first impression as well and was the first patch we tested. It
> does not work. Take the case where we have only one netif that netback is
> handling that gets into this weird state, then we would never call
> net_tx_action_dealloc() if the carrier is turned off on the netif that is
> being cleaned up. This patch, fixes that problem by forcefully scheduling
the
> netif whose carrier has been turned off.
I think that this is simply another bug. There is an early
''return'' in
net_tx_action() which should actually goto the end of the function which
checks the pending list and conditionally calls mod_timer(). Otherwise the
pending list can get partially flushed but the timer doesn''t get
re-mod()ed.
> In the interest of minimizing the
> code changes, I went about making these changes one at a time, and the
> combination of what I have in the patch appears to reduce the cleanup time
the
> most. Even with these hacks, I am not particularly happy about the maximum
> time we have seen the cleanup take, and for this duration the physical host
is
> unusable for launching new guests. So, as I noted in my earlier email, we
may
> still want to not tie up the xenbus thread for the duration of this
cleanup.
Perhaps we could decouple the whole thing from xenbus thread. In any case,
the above bug fixes are to me preferable to the original patch, and will get
the orders-of-magnitude low-hanging fruit (delay xenbus thread for maybe a
second, rather than hours!).

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ky Srinivasan

2009-Dec-07 15:24 UTC

head link

Re: [Xen-devel][PATCH] netback: netif cleanup

>>> On 12/7/2009 at  3:04 AM, in message
<C7426771.3BA0%keir.fraser@eu.citrix.com>,
Keir Fraser <keir.fraser@eu.citrix.com> wrote: > On 06/12/2009 23:15, "Ky Srinivasan"
<ksrinivasan@novell.com> wrote:
> 
>>> This looks like a hack. I think you''d get most of the
benefit by simply
>>> always calling net_tx_action_dealloc() at the top of
net_tx_action() (i.e.,
>>> remove the dealloc_cons!=dealloc_prod conditional). I think the
presence of
>>> that conditional is actually a bug, and removing it should get
interface
>>> shutdowns to a reasonable time with no other changes.
>> That was my first impression as well and was the first patch we tested.
It
>> does not work. Take the case where we have only one netif that netback
is
>> handling that gets into this weird state, then we would never call
>> net_tx_action_dealloc() if the carrier is turned off on the netif that
is
>> being cleaned up. This patch, fixes that problem by forcefully
scheduling
> the
>> netif whose carrier has been turned off.
> 
> I think that this is simply another bug. There is an early
''return'' in
> net_tx_action() which should actually goto the end of the function which
> checks the pending list and conditionally calls mod_timer(). Otherwise the
> pending list can get partially flushed but the timer doesn''t get
re-mod()ed.
> 
I suspect you are referring to the conditional early return ( if
(mop==tx_map_ops)) . Anna Fischer of HP has a test environment where we can see
this problem most easily. I will give her a netback with these changes and
report back on the cleanup times.

Thanks,

K. Y
 >> In the interest of minimizing the
>> code changes, I went about making these changes one at a time, and the
>> combination of what I have in the patch appears to reduce the cleanup
time
> the
>> most. Even with these hacks, I am not particularly happy about the
maximum
>> time we have seen the cleanup take, and for this duration the physical
host
> is
>> unusable for launching new guests. So, as I noted in my earlier email,
we
> may
>> still want to not tie up the xenbus thread for the duration of this
cleanup.
> 
> Perhaps we could decouple the whole thing from xenbus thread. In any case,
> the above bug fixes are to me preferable to the original patch, and will
get
> the orders-of-magnitude low-hanging fruit (delay xenbus thread for maybe a
> second, rather than hours!).
> 
>  -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Dec 2009 - [PATCH] netback: netif cleanup

[Xen-devel][PATCH] netback: netif cleanup

Re: [Xen-devel][PATCH] netback: netif cleanup

Re: [Xen-devel][PATCH] netback: netif cleanup

Re: [Xen-devel][PATCH] netback: netif cleanup

Re: [Xen-devel][PATCH] netback: netif cleanup