> On 19 Mar 2019, at 23:06, Michael S. Tsirkin <mst at redhat.com>
wrote:
>
> On Tue, Mar 19, 2019 at 02:38:06PM +0200, Liran Alon wrote:
>> Hi Michael,
>>
>> Great blog-post which summarise everything very well!
>>
>> Some comments I have:
>
> Thanks!
> I'll try to update everything in the post when I'm not so
jet-lagged.
>
>> 1) I think that when we are using the term ?1-netdev model? on
community discussion, we tend to refer to what you have defined in blog-post as
"3-device model with hidden slaves?.
>> Therefore, I would suggest to just remove the ?1-netdev model? section
and rename the "3-device model with hidden slaves? section to ?1-netdev
model?.
>>
>> 2) The userspace issues result both from using ?2-netdev model? and
?3-netdev model?. However, they are described in blog-post as they only exist on
?3-netdev model?.
>> The reason these issues are not seen in Azure environment is because
these issues were partially handled by Microsoft for their specific 2-netdev
model.
>> Which leads me to the next comment.
>>
>> 3) I suggest that blog-post will also elaborate on what exactly are the
userspace issues which results in models different than ?1-netdev model?.
>> The issues that I?m aware of are (Please tell me if you are aware of
others!):
>> (a) udev rename race-condition: When net-failover device is opened, it
also opens it's slaves. However, the order of events to udev on KOBJ_ADD is
first for the net-failover netdev and only then for the virtio-net netdev. This
means that if userspace will respond to first event by open the net-failover,
then any attempt of userspace to rename virtio-net netdev as a response to the
second event will fail because the virtio-net netdev is already opened. Also
note that this udev rename rule is useful because we would like to add rules
that renames virtio-net netdev to clearly signal that it?s used as the standby
interface of another net-failover netdev.
>> The way this problem was workaround by Microsoft in NetVSC is to delay
the open done on slave-VF from the open of the NetVSC netdev. However, this is
still a race and thus a hacky solution. It was accepted by community only
because it?s internal to the NetVSC driver. However, similar solution was
rejected by community for the net-failover driver.
>> The solution that we currently proposed to address this (Patch by
Si-Wei) was to change the rename kernel handling to allow a net-failover slave
to be renamed even if it is already opened. Patch is still not accepted.
>> (b) Issues caused because of various userspace components DHCP the
net-failover slaves: DHCP of course should only be done on the net-failover
netdev. Attempting to DHCP on net-failover slaves as-well will cause networking
issues. Therefore, userspace components should be taught to avoid doing DHCP on
the net-failover slaves. The various userspace components include:
>> b.1) dhclient: If run without parameters, it by default just enum all
netdevs and attempt to DHCP them all.
>> (I don?t think Microsoft has handled this)
>> b.2) initramfs / dracut: In order to mount the root file-system from
iSCSI, these components needs networking and therefore DHCP on all netdevs.
>> (Microsoft haven?t handled (b.2) because they don?t have images which
perform iSCSI boot in their Azure setup. Still an open issue)
>> b.3) cloud-init: If configured to perform network-configuration, it
attempts to configure all available netdevs. It should avoid however doing so on
net-failover slaves.
>> (Microsoft has handled this by adding a mechanism in cloud-init to
blacklist a netdev from being configured in case it is owned by a specific PCI
driver. Specifically, they blacklist Mellanox VF driver. However, this technique
doesn?t work for the net-failover mechanism because both the net-failover netdev
and the virtio-net netdev are owned by the virtio-net PCI driver).
>> b.4) Various distros network-manager need to be updated to avoid DHCP
on net-failover slaves? (Not sure. Asking...)
>>
>> 4) Another interesting use-case where the net-failover mechanism is
useful is for handling NIC firmware failures or NIC firmware Live-Upgrade.
>> In both cases, there is a need to perform a full PCIe reset of the NIC.
Which lose all the NIC eSwitch configuration of the various VFs.
>
> In this setup, how does VF keep going? If it doesn't keep going, why is
> it helpful?
Let me attempt to clarify.
First, let?s analyse what can a cloud provider do when it wishes to upgrade the
NIC firmware when there are currently running guests utilising SR-IOV.
He can perform the following operations in order:
1) Hot-unplug all VFs from all running guests.
2) Upgrade NIC firmware. Will result in PCIe reset which will cause momentary
network down-time on PF but immediately afterwards PF will be set up again and
guests will have network connectivity.
3) Provision and hot-plug new VFs for all running guests. Guests again have
accelerated networking.
Without the net-failover mechanism, host will have to hot-unplug all VFs from
all running guests and provision new VFs and hot-plug them anyway. But in that
case, the network down-time for guests is longer.
Second, let?s analyse what will happen when health service running on host
notice that NIC firmware is in a bad state and therefore NIC should be reset to
recover.
The health service can take exactly the same order of operations as described
above besides (2) which will just become a PCIe reset.
Again, guests have shorter network down-time in this case as-well when utilising
the net-failover mechanism.
>
>> To handle these cases gracefully, one could just hot-unplug all VFs
from guests running on host (which will make all guests now use the virtio-net
netdev which is backed by a netdev that eventually is on top of PF). Therefore,
networking will be restored to guests once the PCIe reset is completed and the
PF is functional again. To re-acceelrate the guests network, hypervisor can just
hot-plug new VFs to guests.
>>
>> P.S:
>> I would very appreciate all this forum help in closing on the pending
items written in (3). Which currently prevents using this net-failover mechanism
in real production use-cases.
>>
>> Regards,
>> -Liran
>>
>>> On 17 Mar 2019, at 15:55, Michael S. Tsirkin <mst at
redhat.com> wrote:
>>>
>>> Hi all,
>>> I've put up a blog post with a summary of where network
>>> device failover stands and some open issues.
>>> Not sure where best to host it, I just put it up on blogspot:
>>>
https://urldefense.proofpoint.com/v2/url?u=https-3A__mstsirkin.blogspot.com_2019_03_virtio-2Dnetwork-2Ddevice-2Dfailover-2Dsupport.html&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=jd0emHx6EkPSTvO0TytfYmG4rOMQ9htenhrgKprrh9E&s=5EJamlc_g1lZa0Ga7K30E6aWVg3jy8lizhw1aSguo3A&e>>>
>>> Comments, corrections are welcome!
>>>
>>> --
>>> MST