thr3ads.net - Linux Virtualization - [summary] virtio network device failover writeup [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Liran Alon

2019-Mar-19 12:38 UTC

[summary] virtio network device failover writeup

Hi Michael,

Great blog-post which summarise everything very well!

Some comments I have:

1) I think that when we are using the term ?1-netdev model? on community
discussion, we tend to refer to what you have defined in blog-post as
"3-device model with hidden slaves?.
Therefore, I would suggest to just remove the ?1-netdev model? section and
rename the "3-device model with hidden slaves? section to ?1-netdev model?.

2) The userspace issues result both from using ?2-netdev model? and ?3-netdev
model?. However, they are described in blog-post as they only exist on ?3-netdev
model?.
The reason these issues are not seen in Azure environment is because these
issues were partially handled by Microsoft for their specific 2-netdev model.
Which leads me to the next comment.

3) I suggest that blog-post will also elaborate on what exactly are the
userspace issues which results in models different than ?1-netdev model?.
The issues that I?m aware of are (Please tell me if you are aware of others!):
(a) udev rename race-condition: When net-failover device is opened, it also
opens it's slaves. However, the order of events to udev on KOBJ_ADD is first
for the net-failover netdev and only then for the virtio-net netdev. This means
that if userspace will respond to first event by open the net-failover, then any
attempt of userspace to rename virtio-net netdev as a response to the second
event will fail because the virtio-net netdev is already opened. Also note that
this udev rename rule is useful because we would like to add rules that renames
virtio-net netdev to clearly signal that it?s used as the standby interface of
another net-failover netdev.
The way this problem was workaround by Microsoft in NetVSC is to delay the open
done on slave-VF from the open of the NetVSC netdev. However, this is still a
race and thus a hacky solution. It was accepted by community only because it?s
internal to the NetVSC driver. However, similar solution was rejected by
community for the net-failover driver.
The solution that we currently proposed to address this (Patch by Si-Wei) was to
change the rename kernel handling to allow a net-failover slave to be renamed
even if it is already opened. Patch is still not accepted.
(b) Issues caused because of various userspace components DHCP the net-failover
slaves: DHCP of course should only be done on the net-failover netdev.
Attempting to DHCP on net-failover slaves as-well will cause networking issues.
Therefore, userspace components should be taught to avoid doing DHCP on the
net-failover slaves. The various userspace components include:
b.1) dhclient: If run without parameters, it by default just enum all netdevs
and attempt to DHCP them all.
(I don?t think Microsoft has handled this)
b.2) initramfs / dracut: In order to mount the root file-system from iSCSI,
these components needs networking and therefore DHCP on all netdevs.
(Microsoft haven?t handled (b.2) because they don?t have images which perform
iSCSI boot in their Azure setup. Still an open issue)
b.3) cloud-init: If configured to perform network-configuration, it attempts to
configure all available netdevs. It should avoid however doing so on
net-failover slaves.
(Microsoft has handled this by adding a mechanism in cloud-init to blacklist a
netdev from being configured in case it is owned by a specific PCI driver.
Specifically, they blacklist Mellanox VF driver. However, this technique doesn?t
work for the net-failover mechanism because both the net-failover netdev and the
virtio-net netdev are owned by the virtio-net PCI driver).
b.4) Various distros network-manager need to be updated to avoid DHCP on
net-failover slaves? (Not sure. Asking...)

4) Another interesting use-case where the net-failover mechanism is useful is
for handling NIC firmware failures or NIC firmware Live-Upgrade.
In both cases, there is a need to perform a full PCIe reset of the NIC. Which
lose all the NIC eSwitch configuration of the various VFs.
To handle these cases gracefully, one could just hot-unplug all VFs from guests
running on host (which will make all guests now use the virtio-net netdev which
is backed by a netdev that eventually is on top of PF). Therefore, networking
will be restored to guests once the PCIe reset is completed and the PF is
functional again. To re-acceelrate the guests network, hypervisor can just
hot-plug new VFs to guests.

P.S:
I would very appreciate all this forum help in closing on the pending items
written in (3). Which currently prevents using this net-failover mechanism in
real production use-cases.

Regards,
-Liran
> On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst at redhat.com>
wrote:
> 
> Hi all,
> I've put up a blog post with a summary of where network
> device failover stands and some open issues.
> Not sure where best to host it, I just put it up on blogspot:
>
https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e>
> Comments, corrections are welcome!
> 
> -- 
> MST

Michael S. Tsirkin

2019-Mar-19 21:19 UTC

head link

[summary] virtio network device failover writeup

On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger
wrote:> On Tue, 19 Mar 2019 14:38:06 +0200
> Liran Alon <liran.alon at oracle.com> wrote:
> 
> > b.3) cloud-init: If configured to perform network-configuration, it
attempts to configure all available netdevs. It should avoid however doing so on
net-failover slaves.
> > (Microsoft has handled this by adding a mechanism in cloud-init to
blacklist a netdev from being configured in case it is owned by a specific PCI
driver. Specifically, they blacklist Mellanox VF driver. However, this technique
doesn?t work for the net-failover mechanism because both the net-failover netdev
and the virtio-net netdev are owned by the virtio-net PCI driver).
> 
> Cloud-init should really just ignore all devices that have a master device.
> That would have been more general, and safer for other use cases.
Given lots of userspace doesn't do this, I wonder whether it would be
safer to just somehow pretend to userspace that the slave links are
down? And add a special attribute for the actual link state.

-- 
MST

Liran Alon

2019-Mar-19 23:06 UTC

head link

[summary] virtio network device failover writeup

> On 19 Mar 2019, at 23:06, Michael S. Tsirkin <mst at redhat.com>
wrote:
> 
> On Tue, Mar 19, 2019 at 02:38:06PM +0200, Liran Alon wrote:
>> Hi Michael,
>> 
>> Great blog-post which summarise everything very well!
>> 
>> Some comments I have:
> 
> Thanks!
> I'll try to update everything in the post when I'm not so
jet-lagged.
> 
>> 1) I think that when we are using the term ?1-netdev model? on
community discussion, we tend to refer to what you have defined in blog-post as
"3-device model with hidden slaves?.
>> Therefore, I would suggest to just remove the ?1-netdev model? section
and rename the "3-device model with hidden slaves? section to ?1-netdev
model?.
>> 
>> 2) The userspace issues result both from using ?2-netdev model? and
?3-netdev model?. However, they are described in blog-post as they only exist on
?3-netdev model?.
>> The reason these issues are not seen in Azure environment is because
these issues were partially handled by Microsoft for their specific 2-netdev
model.
>> Which leads me to the next comment.
>> 
>> 3) I suggest that blog-post will also elaborate on what exactly are the
userspace issues which results in models different than ?1-netdev model?.
>> The issues that I?m aware of are (Please tell me if you are aware of
others!):
>> (a) udev rename race-condition: When net-failover device is opened, it
also opens it's slaves. However, the order of events to udev on KOBJ_ADD is
first for the net-failover netdev and only then for the virtio-net netdev. This
means that if userspace will respond to first event by open the net-failover,
then any attempt of userspace to rename virtio-net netdev as a response to the
second event will fail because the virtio-net netdev is already opened. Also
note that this udev rename rule is useful because we would like to add rules
that renames virtio-net netdev to clearly signal that it?s used as the standby
interface of another net-failover netdev.
>> The way this problem was workaround by Microsoft in NetVSC is to delay
the open done on slave-VF from the open of the NetVSC netdev. However, this is
still a race and thus a hacky solution. It was accepted by community only
because it?s internal to the NetVSC driver. However, similar solution was
rejected by community for the net-failover driver.
>> The solution that we currently proposed to address this (Patch by
Si-Wei) was to change the rename kernel handling to allow a net-failover slave
to be renamed even if it is already opened. Patch is still not accepted.
>> (b) Issues caused because of various userspace components DHCP the
net-failover slaves: DHCP of course should only be done on the net-failover
netdev. Attempting to DHCP on net-failover slaves as-well will cause networking
issues. Therefore, userspace components should be taught to avoid doing DHCP on
the net-failover slaves. The various userspace components include:
>> b.1) dhclient: If run without parameters, it by default just enum all
netdevs and attempt to DHCP them all.
>> (I don?t think Microsoft has handled this)
>> b.2) initramfs / dracut: In order to mount the root file-system from
iSCSI, these components needs networking and therefore DHCP on all netdevs.
>> (Microsoft haven?t handled (b.2) because they don?t have images which
perform iSCSI boot in their Azure setup. Still an open issue)
>> b.3) cloud-init: If configured to perform network-configuration, it
attempts to configure all available netdevs. It should avoid however doing so on
net-failover slaves.
>> (Microsoft has handled this by adding a mechanism in cloud-init to
blacklist a netdev from being configured in case it is owned by a specific PCI
driver. Specifically, they blacklist Mellanox VF driver. However, this technique
doesn?t work for the net-failover mechanism because both the net-failover netdev
and the virtio-net netdev are owned by the virtio-net PCI driver).
>> b.4) Various distros network-manager need to be updated to avoid DHCP
on net-failover slaves? (Not sure. Asking...)
>> 
>> 4) Another interesting use-case where the net-failover mechanism is
useful is for handling NIC firmware failures or NIC firmware Live-Upgrade.
>> In both cases, there is a need to perform a full PCIe reset of the NIC.
Which lose all the NIC eSwitch configuration of the various VFs.
> 
> In this setup, how does VF keep going? If it doesn't keep going, why is
> it helpful?
Let me attempt to clarify.

First, let?s analyse what can a cloud provider do when it wishes to upgrade the
NIC firmware when there are currently running guests utilising SR-IOV.
He can perform the following operations in order:
1) Hot-unplug all VFs from all running guests.
2) Upgrade NIC firmware. Will result in PCIe reset which will cause momentary
network down-time on PF but immediately afterwards PF will be set up again and
guests will have network connectivity.
3) Provision and hot-plug new VFs for all running guests. Guests again have
accelerated networking.

Without the net-failover mechanism, host will have to hot-unplug all VFs from
all running guests and provision new VFs and hot-plug them anyway. But in that
case, the network down-time for guests is longer.

Second, let?s analyse what will happen when health service running on host
notice that NIC firmware is in a bad state and therefore NIC should be reset to
recover.
The health service can take exactly the same order of operations as described
above besides (2) which will just become a PCIe reset.
Again, guests have shorter network down-time in this case as-well when utilising
the net-failover mechanism.
> 
>> To handle these cases gracefully, one could just hot-unplug all VFs
from guests running on host (which will make all guests now use the virtio-net
netdev which is backed by a netdev that eventually is on top of PF). Therefore,
networking will be restored to guests once the PCIe reset is completed and the
PF is functional again. To re-acceelrate the guests network, hypervisor can just
hot-plug new VFs to guests.
>> 
>> P.S:
>> I would very appreciate all this forum help in closing on the pending
items written in (3). Which currently prevents using this net-failover mechanism
in real production use-cases.
>> 
>> Regards,
>> -Liran
>> 
>>> On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst at
redhat.com> wrote:
>>> 
>>> Hi all,
>>> I've put up a blog post with a summary of where network
>>> device failover stands and some open issues.
>>> Not sure where best to host it, I just put it up on blogspot:
>>>
https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e>>>
>>> Comments, corrections are welcome!
>>> 
>>> -- 
>>> MST

Seemingly Similar Threads

Search for more reasonably related threads

Linux Virtualization - Mar 2019 - [summary] virtio network device failover writeup

[summary] virtio network device failover writeup

[summary] virtio network device failover writeup

[summary] virtio network device failover writeup

Seemingly Similar Threads