Hi all, I've put up a blog post with a summary of where network device failover stands and some open issues. Not sure where best to host it, I just put it up on blogspot: https://mstsirkin.blogspot.com/2019/03/virtio-network-device-failover-support.html Comments, corrections are welcome! -- MST
Hi Michael, Great blog-post which summarise everything very well! Some comments I have: 1) I think that when we are using the term ?1-netdev model? on community discussion, we tend to refer to what you have defined in blog-post as "3-device model with hidden slaves?. Therefore, I would suggest to just remove the ?1-netdev model? section and rename the "3-device model with hidden slaves? section to ?1-netdev model?. 2) The userspace issues result both from using ?2-netdev model? and ?3-netdev model?. However, they are described in blog-post as they only exist on ?3-netdev model?. The reason these issues are not seen in Azure environment is because these issues were partially handled by Microsoft for their specific 2-netdev model. Which leads me to the next comment. 3) I suggest that blog-post will also elaborate on what exactly are the userspace issues which results in models different than ?1-netdev model?. The issues that I?m aware of are (Please tell me if you are aware of others!): (a) udev rename race-condition: When net-failover device is opened, it also opens it's slaves. However, the order of events to udev on KOBJ_ADD is first for the net-failover netdev and only then for the virtio-net netdev. This means that if userspace will respond to first event by open the net-failover, then any attempt of userspace to rename virtio-net netdev as a response to the second event will fail because the virtio-net netdev is already opened. Also note that this udev rename rule is useful because we would like to add rules that renames virtio-net netdev to clearly signal that it?s used as the standby interface of another net-failover netdev. The way this problem was workaround by Microsoft in NetVSC is to delay the open done on slave-VF from the open of the NetVSC netdev. However, this is still a race and thus a hacky solution. It was accepted by community only because it?s internal to the NetVSC driver. However, similar solution was rejected by community for the net-failover driver. The solution that we currently proposed to address this (Patch by Si-Wei) was to change the rename kernel handling to allow a net-failover slave to be renamed even if it is already opened. Patch is still not accepted. (b) Issues caused because of various userspace components DHCP the net-failover slaves: DHCP of course should only be done on the net-failover netdev. Attempting to DHCP on net-failover slaves as-well will cause networking issues. Therefore, userspace components should be taught to avoid doing DHCP on the net-failover slaves. The various userspace components include: b.1) dhclient: If run without parameters, it by default just enum all netdevs and attempt to DHCP them all. (I don?t think Microsoft has handled this) b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, these components needs networking and therefore DHCP on all netdevs. (Microsoft haven?t handled (b.2) because they don?t have images which perform iSCSI boot in their Azure setup. Still an open issue) b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves. (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn?t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver). b.4) Various distros network-manager need to be updated to avoid DHCP on net-failover slaves? (Not sure. Asking...) 4) Another interesting use-case where the net-failover mechanism is useful is for handling NIC firmware failures or NIC firmware Live-Upgrade. In both cases, there is a need to perform a full PCIe reset of the NIC. Which lose all the NIC eSwitch configuration of the various VFs. To handle these cases gracefully, one could just hot-unplug all VFs from guests running on host (which will make all guests now use the virtio-net netdev which is backed by a netdev that eventually is on top of PF). Therefore, networking will be restored to guests once the PCIe reset is completed and the PF is functional again. To re-acceelrate the guests network, hypervisor can just hot-plug new VFs to guests. P.S: I would very appreciate all this forum help in closing on the pending items written in (3). Which currently prevents using this net-failover mechanism in real production use-cases. Regards, -Liran> On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst at redhat.com> wrote: > > Hi all, > I've put up a blog post with a summary of where network > device failover stands and some open issues. > Not sure where best to host it, I just put it up on blogspot: > https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e> > Comments, corrections are welcome! > > -- > MST
On Tue, 19 Mar 2019 14:38:06 +0200 Liran Alon <liran.alon at oracle.com> wrote:> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves. > (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn?t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver).Cloud-init should really just ignore all devices that have a master device. That would have been more general, and safer for other use cases.
On Tue, Mar 19, 2019 at 02:38:06PM +0200, Liran Alon wrote:> Hi Michael, > > Great blog-post which summarise everything very well! > > Some comments I have:Thanks! I'll try to update everything in the post when I'm not so jet-lagged.> 1) I think that when we are using the term ?1-netdev model? on community discussion, we tend to refer to what you have defined in blog-post as "3-device model with hidden slaves?. > Therefore, I would suggest to just remove the ?1-netdev model? section and rename the "3-device model with hidden slaves? section to ?1-netdev model?. > > 2) The userspace issues result both from using ?2-netdev model? and ?3-netdev model?. However, they are described in blog-post as they only exist on ?3-netdev model?. > The reason these issues are not seen in Azure environment is because these issues were partially handled by Microsoft for their specific 2-netdev model. > Which leads me to the next comment. > > 3) I suggest that blog-post will also elaborate on what exactly are the userspace issues which results in models different than ?1-netdev model?. > The issues that I?m aware of are (Please tell me if you are aware of others!): > (a) udev rename race-condition: When net-failover device is opened, it also opens it's slaves. However, the order of events to udev on KOBJ_ADD is first for the net-failover netdev and only then for the virtio-net netdev. This means that if userspace will respond to first event by open the net-failover, then any attempt of userspace to rename virtio-net netdev as a response to the second event will fail because the virtio-net netdev is already opened. Also note that this udev rename rule is useful because we would like to add rules that renames virtio-net netdev to clearly signal that it?s used as the standby interface of another net-failover netdev. > The way this problem was workaround by Microsoft in NetVSC is to delay the open done on slave-VF from the open of the NetVSC netdev. However, this is still a race and thus a hacky solution. It was accepted by community only because it?s internal to the NetVSC driver. However, similar solution was rejected by community for the net-failover driver. > The solution that we currently proposed to address this (Patch by Si-Wei) was to change the rename kernel handling to allow a net-failover slave to be renamed even if it is already opened. Patch is still not accepted. > (b) Issues caused because of various userspace components DHCP the net-failover slaves: DHCP of course should only be done on the net-failover netdev. Attempting to DHCP on net-failover slaves as-well will cause networking issues. Therefore, userspace components should be taught to avoid doing DHCP on the net-failover slaves. The various userspace components include: > b.1) dhclient: If run without parameters, it by default just enum all netdevs and attempt to DHCP them all. > (I don?t think Microsoft has handled this) > b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, these components needs networking and therefore DHCP on all netdevs. > (Microsoft haven?t handled (b.2) because they don?t have images which perform iSCSI boot in their Azure setup. Still an open issue) > b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves. > (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn?t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver). > b.4) Various distros network-manager need to be updated to avoid DHCP on net-failover slaves? (Not sure. Asking...) > > 4) Another interesting use-case where the net-failover mechanism is useful is for handling NIC firmware failures or NIC firmware Live-Upgrade. > In both cases, there is a need to perform a full PCIe reset of the NIC. Which lose all the NIC eSwitch configuration of the various VFs.In this setup, how does VF keep going? If it doesn't keep going, why is it helpful?> To handle these cases gracefully, one could just hot-unplug all VFs from guests running on host (which will make all guests now use the virtio-net netdev which is backed by a netdev that eventually is on top of PF). Therefore, networking will be restored to guests once the PCIe reset is completed and the PF is functional again. To re-acceelrate the guests network, hypervisor can just hot-plug new VFs to guests. > > P.S: > I would very appreciate all this forum help in closing on the pending items written in (3). Which currently prevents using this net-failover mechanism in real production use-cases. > > Regards, > -Liran > > > On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst at redhat.com> wrote: > > > > Hi all, > > I've put up a blog post with a summary of where network > > device failover stands and some open issues. > > Not sure where best to host it, I just put it up on blogspot: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e> > > > Comments, corrections are welcome! > > > > -- > > MST