Siwei Liu
2019-Feb-22 01:14 UTC
net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
Sorry for replying to this ancient thread. There was some remaining issue that I don't think the initial net_failover patch got addressed cleanly, see: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268 The renaming of 'eth0' to 'ens4' fails because the udev userspace was not specifically writtten for such kernel automatic enslavement. Specifically, if it is a bond or team, the slave would typically get renamed *before* virtual device gets created, that's what udev can control (without getting netdev opened early by the other part of kernel) and other userspace components for e.g. initramfs, init-scripts can coordinate well in between. The in-kernel auto-enslavement of net_failover breaks this userspace convention, which don't provides a solution if user care about consistent naming on the slave netdevs specifically. Previously this issue had been specifically called out when IFF_HIDDEN and the 1-netdev was proposed, but no one gives out a solution to this problem ever since. Please share your mind how to proceed and solve this userspace issue if netdev does not welcome a 1-netdev model. On Wed, Apr 11, 2018 at 12:53 AM Jiri Pirko <jiri at resnulli.us> wrote:> > Tue, Apr 10, 2018 at 11:26:08PM CEST, stephen at networkplumber.org wrote: > >On Tue, 10 Apr 2018 11:59:50 -0700 > >Sridhar Samudrala <sridhar.samudrala at intel.com> wrote: > > > >> Use the registration/notification framework supported by the generic > >> bypass infrastructure. > >> > >> Signed-off-by: Sridhar Samudrala <sridhar.samudrala at intel.com> > >> --- > > > >Thanks for doing this. Your current version has couple show stopper > >issues. > > > >First, the slave device is instantly taking over the slave. > >This doesn't allow udev/systemd to do its device rename of the slave > >device. Netvsc uses a delayed work to workaround this. > > Wait. Why the fact a device is enslaved has to affect the udev in any > way? If it does, smells like a bug in udev.See above for clarifications. Thanks,> > > > > >Secondly, the select queue needs to call queue selection in VF. > >The bonding/teaming logic doesn't work well for UDP flows. > >Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF") > >fixed this performance problem. > > > >Lastly, more indirection is bad in current climate. > > > >I am not completely adverse to this but it needs to be fast, simple > >and completely transparent.
Michael S. Tsirkin
2019-Feb-22 01:39 UTC
net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:> Sorry for replying to this ancient thread. There was some remaining > issue that I don't think the initial net_failover patch got addressed > cleanly, see: > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268 > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was > not specifically writtten for such kernel automatic enslavement. > Specifically, if it is a bond or team, the slave would typically get > renamed *before* virtual device gets created, that's what udev can > control (without getting netdev opened early by the other part of > kernel) and other userspace components for e.g. initramfs, > init-scripts can coordinate well in between. The in-kernel > auto-enslavement of net_failover breaks this userspace convention, > which don't provides a solution if user care about consistent naming > on the slave netdevs specifically. > > Previously this issue had been specifically called out when IFF_HIDDEN > and the 1-netdev was proposed, but no one gives out a solution to this > problem ever since. Please share your mind how to proceed and solve > this userspace issue if netdev does not welcome a 1-netdev model.Above says: there's no motivation in the systemd/udevd community at this point to refactor the rename logic and make it work well with 3-netdev. What would the fix be? Skip slave devices? -- MST
Samudrala, Sridhar
2019-Feb-22 07:00 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On 2/21/2019 7:33 PM, si-wei liu wrote:> > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote: >> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote: >>> Sorry for replying to this ancient thread. There was some remaining >>> issue that I don't think the initial net_failover patch got addressed >>> cleanly, see: >>> >>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268 >>> >>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was >>> not specifically writtten for such kernel automatic enslavement. >>> Specifically, if it is a bond or team, the slave would typically get >>> renamed *before* virtual device gets created, that's what udev can >>> control (without getting netdev opened early by the other part of >>> kernel) and other userspace components for e.g. initramfs, >>> init-scripts can coordinate well in between. The in-kernel >>> auto-enslavement of net_failover breaks this userspace convention, >>> which don't provides a solution if user care about consistent naming >>> on the slave netdevs specifically. >>> >>> Previously this issue had been specifically called out when IFF_HIDDEN >>> and the 1-netdev was proposed, but no one gives out a solution to this >>> problem ever since. Please share your mind how to proceed and solve >>> this userspace issue if netdev does not welcome a 1-netdev model. >> Above says: >> >> ????there's no motivation in the systemd/udevd community at >> ????this point to refactor the rename logic and make it work well with >> ????3-netdev. >> >> What would the fix be? Skip slave devices? >> > There's nothing user can get if just skipping slave devices - the name > is still unchanged and unpredictable e.g. eth0, or eth1 the next > reboot, while the rest may conform to the naming scheme (ens3 and > such). There's no way one can fix this in userspace alone - when the > failover is created the enslaved netdev was opened by the kernel > earlier than the userspace is made aware of, and there's no > negotiation protocol for kernel to know when userspace has done > initial renaming of the interface. I would expect netdev list should > at least provide the direction in general for how this can be solved... >Is there an issue if slave device names are not predictable? The user/admin scripts are expected to only work with the master failover device. Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion about moving them to a hidden network namespace so that they are not visible from the default namespace. I looked into this sometime back, but did not find the right kernel api to create a network namespace within kernel. If so, we could use this mechanism to simulate a 1-netdev model.> -Siwei > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20190221/24b776d3/attachment-0001.html>
Michael S. Tsirkin
2019-Feb-22 15:14 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:> > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote: > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote: > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote: > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote: > > > > > Sorry for replying to this ancient thread. There was some remaining > > > > > issue that I don't think the initial net_failover patch got addressed > > > > > cleanly, see: > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268 > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was > > > > > not specifically writtten for such kernel automatic enslavement. > > > > > Specifically, if it is a bond or team, the slave would typically get > > > > > renamed *before* virtual device gets created, that's what udev can > > > > > control (without getting netdev opened early by the other part of > > > > > kernel) and other userspace components for e.g. initramfs, > > > > > init-scripts can coordinate well in between. The in-kernel > > > > > auto-enslavement of net_failover breaks this userspace convention, > > > > > which don't provides a solution if user care about consistent naming > > > > > on the slave netdevs specifically. > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN > > > > > and the 1-netdev was proposed, but no one gives out a solution to this > > > > > problem ever since. Please share your mind how to proceed and solve > > > > > this userspace issue if netdev does not welcome a 1-netdev model. > > > > Above says: > > > > > > > > there's no motivation in the systemd/udevd community at > > > > this point to refactor the rename logic and make it work well with > > > > 3-netdev. > > > > > > > > What would the fix be? Skip slave devices? > > > > > > > There's nothing user can get if just skipping slave devices - the > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the > > > next reboot, while the rest may conform to the naming scheme (ens3 > > > and such). There's no way one can fix this in userspace alone - when > > > the failover is created the enslaved netdev was opened by the kernel > > > earlier than the userspace is made aware of, and there's no > > > negotiation protocol for kernel to know when userspace has done > > > initial renaming of the interface. I would expect netdev list should > > > at least provide the direction in general for how this can be > > > solved...I was just wondering what did you mean when you said "refactor the rename logic and make it work well with 3-netdev" - was there a proposal udev rejected? Anyway, can we write a time diagram for what happens in which order that leads to failure? That would help look for triggers that we can tie into, or add new ones.> > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected > > to only work with the master failover device. > Where does this expectation come from? > > Admin users may have ethtool or tc configurations that need to deal with > predictable interface name. Third-party app which was built upon specifying > certain interface name can't be modified to chase dynamic names. > > Specifically, we have pre-canned image that uses ethtool to fine tune VF > offload settings post boot for specific workload. Those images won't work > well if the name is constantly changing just after couple rounds of live > migration.It should be possible to specify the ethtool configuration on the master and have it automatically propagated to the slave. BTW this is something we should look at IMHO.> > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion > > about moving them to a hidden network namespace so that they are not visible from the default namespace. > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within > > kernel. If so, we could use this mechanism to simulate a 1-netdev model. > Yes, that's one possible implementation (IMHO the key is to make 1-netdev > model as much transparent to a real NIC as possible, while a hidden netns is > just the vehicle). However, I recall there was resistance around this > discussion that even the concept of hiding itself is a taboo for Linux > netdev. I would like to summon potential alternatives before concluding > 1-netdev is the only solution too soon. > > Thanks, > -SiweiYour scripts would not work at all then, right?> > > > > -Siwei > > > > > >
Stephen Hemminger
2019-Feb-26 01:39 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Mon, 25 Feb 2019 16:58:07 -0800 si-wei liu <si-wei.liu at oracle.com> wrote:> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote: > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote: > >> > >> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote: > >>> > >>> On 2/21/2019 7:33 PM, si-wei liu wrote: > >>>> > >>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote: > >>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote: > >>>>>> Sorry for replying to this ancient thread. There was some remaining > >>>>>> issue that I don't think the initial net_failover patch got addressed > >>>>>> cleanly, see: > >>>>>> > >>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268 > >>>>>> > >>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was > >>>>>> not specifically writtten for such kernel automatic enslavement. > >>>>>> Specifically, if it is a bond or team, the slave would typically get > >>>>>> renamed *before* virtual device gets created, that's what udev can > >>>>>> control (without getting netdev opened early by the other part of > >>>>>> kernel) and other userspace components for e.g. initramfs, > >>>>>> init-scripts can coordinate well in between. The in-kernel > >>>>>> auto-enslavement of net_failover breaks this userspace convention, > >>>>>> which don't provides a solution if user care about consistent naming > >>>>>> on the slave netdevs specifically. > >>>>>> > >>>>>> Previously this issue had been specifically called out when IFF_HIDDEN > >>>>>> and the 1-netdev was proposed, but no one gives out a solution to this > >>>>>> problem ever since. Please share your mind how to proceed and solve > >>>>>> this userspace issue if netdev does not welcome a 1-netdev model. > >>>>> Above says: > >>>>> > >>>>> there's no motivation in the systemd/udevd community at > >>>>> this point to refactor the rename logic and make it work well with > >>>>> 3-netdev. > >>>>> > >>>>> What would the fix be? Skip slave devices? > >>>>> > >>>> There's nothing user can get if just skipping slave devices - the > >>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the > >>>> next reboot, while the rest may conform to the naming scheme (ens3 > >>>> and such). There's no way one can fix this in userspace alone - when > >>>> the failover is created the enslaved netdev was opened by the kernel > >>>> earlier than the userspace is made aware of, and there's no > >>>> negotiation protocol for kernel to know when userspace has done > >>>> initial renaming of the interface. I would expect netdev list should > >>>> at least provide the direction in general for how this can be > >>>> solved... > > > > I was just wondering what did you mean when you said > > "refactor the rename logic and make it work well with 3-netdev" - > > was there a proposal udev rejected? > No. I never believed this particular issue can be fixed in userspace > alone. Previously someone had said it could be, but I never see any work > or relevant discussion ever happened in various userspace communities > (for e.g. dracut, initramfs-tools, systemd, udev, and NetworkManager). > IMHO the root of the issue derives from the kernel, it makes more sense > to start from netdev, work out and decide on a solution: see what can be > done in the kernel in order to fix it, then after that engage userspace > community for the feasibility... > > > Anyway, can we write a time diagram for what happens in which order that > > leads to failure? That would help look for triggers that we can tie > > into, or add new ones. > > > > See attached diagram. > > > > > > > > > > >>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected > >>> to only work with the master failover device. > >> Where does this expectation come from? > >> > >> Admin users may have ethtool or tc configurations that need to deal with > >> predictable interface name. Third-party app which was built upon specifying > >> certain interface name can't be modified to chase dynamic names. > >> > >> Specifically, we have pre-canned image that uses ethtool to fine tune VF > >> offload settings post boot for specific workload. Those images won't work > >> well if the name is constantly changing just after couple rounds of live > >> migration. > > It should be possible to specify the ethtool configuration on the > > master and have it automatically propagated to the slave. > > > > BTW this is something we should look at IMHO. > I was elaborating a few examples that the expectation and assumption > that user/admin scripts only deal with master failover device is > incorrect. It had never been taken good care of, although I did try to > emphasize it from the very beginning. > > Basically what you said about propagating the ethtool configuration down > to the slave is the key pursuance of 1-netdev model. However, what I am > seeking now is any alternative that can also fix the specific udev > rename problem, before concluding that 1-netdev is the only solution. > Generally a 1-netdev scheme would take time to implement, while I'm > trying to find a way out to fix this particular naming problem under > 3-netdev. > > > > >>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion > >>> about moving them to a hidden network namespace so that they are not visible from the default namespace. > >>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within > >>> kernel. If so, we could use this mechanism to simulate a 1-netdev model. > >> Yes, that's one possible implementation (IMHO the key is to make 1-netdev > >> model as much transparent to a real NIC as possible, while a hidden netns is > >> just the vehicle). However, I recall there was resistance around this > >> discussion that even the concept of hiding itself is a taboo for Linux > >> netdev. I would like to summon potential alternatives before concluding > >> 1-netdev is the only solution too soon. > >> > >> Thanks, > >> -Siwei > > Your scripts would not work at all then, right? > At this point we don't claim images with such usage as SR-IOV live > migrate-able. We would flag it as live migrate-able until this ethtool > config issue is fully addressed and a transparent live migration > solution emerges in upstream eventually.The hyper-v netvsc with 1-dev model uses a timeout to allow udev to do its rename. I proposed a patch to key state change off of the udev rename, but that patch was rejected.
Michael S. Tsirkin
2019-Feb-26 02:05 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Mon, Feb 25, 2019 at 05:39:12PM -0800, Stephen Hemminger wrote:> > >>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion > > >>> about moving them to a hidden network namespace so that they are not visible from the default namespace. > > >>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within > > >>> kernel. If so, we could use this mechanism to simulate a 1-netdev model. > > >> Yes, that's one possible implementation (IMHO the key is to make 1-netdev > > >> model as much transparent to a real NIC as possible, while a hidden netns is > > >> just the vehicle). However, I recall there was resistance around this > > >> discussion that even the concept of hiding itself is a taboo for Linux > > >> netdev. I would like to summon potential alternatives before concluding > > >> 1-netdev is the only solution too soon. > > >> > > >> Thanks, > > >> -Siwei > > > Your scripts would not work at all then, right? > > At this point we don't claim images with such usage as SR-IOV live > > migrate-able. We would flag it as live migrate-able until this ethtool > > config issue is fully addressed and a transparent live migration > > solution emerges in upstream eventually. > > The hyper-v netvsc with 1-dev model uses a timeout to allow udev to do its rename. > I proposed a patch to key state change off of the udev rename, but that patch was > rejected.Of course that would mean nothing works without udev - was that the objection? Could you help me find that discussion pls? -- MST
Michael S. Tsirkin
2019-Feb-26 02:08 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:> > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote: > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote: > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote: > > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote: > > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote: > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote: > > > > > > > Sorry for replying to this ancient thread. There was some remaining > > > > > > > issue that I don't think the initial net_failover patch got addressed > > > > > > > cleanly, see: > > > > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268 > > > > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was > > > > > > > not specifically writtten for such kernel automatic enslavement. > > > > > > > Specifically, if it is a bond or team, the slave would typically get > > > > > > > renamed *before* virtual device gets created, that's what udev can > > > > > > > control (without getting netdev opened early by the other part of > > > > > > > kernel) and other userspace components for e.g. initramfs, > > > > > > > init-scripts can coordinate well in between. The in-kernel > > > > > > > auto-enslavement of net_failover breaks this userspace convention, > > > > > > > which don't provides a solution if user care about consistent naming > > > > > > > on the slave netdevs specifically. > > > > > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this > > > > > > > problem ever since. Please share your mind how to proceed and solve > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model. > > > > > > Above says: > > > > > > > > > > > > there's no motivation in the systemd/udevd community at > > > > > > this point to refactor the rename logic and make it work well with > > > > > > 3-netdev. > > > > > > > > > > > > What would the fix be? Skip slave devices? > > > > > > > > > > > There's nothing user can get if just skipping slave devices - the > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the > > > > > next reboot, while the rest may conform to the naming scheme (ens3 > > > > > and such). There's no way one can fix this in userspace alone - when > > > > > the failover is created the enslaved netdev was opened by the kernel > > > > > earlier than the userspace is made aware of, and there's no > > > > > negotiation protocol for kernel to know when userspace has done > > > > > initial renaming of the interface. I would expect netdev list should > > > > > at least provide the direction in general for how this can be > > > > > solved... > > > > I was just wondering what did you mean when you said > > "refactor the rename logic and make it work well with 3-netdev" - > > was there a proposal udev rejected? > No. I never believed this particular issue can be fixed in userspace alone. > Previously someone had said it could be, but I never see any work or > relevant discussion ever happened in various userspace communities (for e.g. > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root > of the issue derives from the kernel, it makes more sense to start from > netdev, work out and decide on a solution: see what can be done in the > kernel in order to fix it, then after that engage userspace community for > the feasibility... > > > Anyway, can we write a time diagram for what happens in which order that > > leads to failure? That would help look for triggers that we can tie > > into, or add new ones. > > > > See attached diagram. > > > > > > > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected > > > > to only work with the master failover device. > > > Where does this expectation come from? > > > > > > Admin users may have ethtool or tc configurations that need to deal with > > > predictable interface name. Third-party app which was built upon specifying > > > certain interface name can't be modified to chase dynamic names. > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF > > > offload settings post boot for specific workload. Those images won't work > > > well if the name is constantly changing just after couple rounds of live > > > migration. > > It should be possible to specify the ethtool configuration on the > > master and have it automatically propagated to the slave. > > > > BTW this is something we should look at IMHO. > I was elaborating a few examples that the expectation and assumption that > user/admin scripts only deal with master failover device is incorrect. It > had never been taken good care of, although I did try to emphasize it from > the very beginning. > > Basically what you said about propagating the ethtool configuration down to > the slave is the key pursuance of 1-netdev model. However, what I am seeking > now is any alternative that can also fix the specific udev rename problem, > before concluding that 1-netdev is the only solution. Generally a 1-netdev > scheme would take time to implement, while I'm trying to find a way out to > fix this particular naming problem under 3-netdev. > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace. > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model. > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev > > > model as much transparent to a real NIC as possible, while a hidden netns is > > > just the vehicle). However, I recall there was resistance around this > > > discussion that even the concept of hiding itself is a taboo for Linux > > > netdev. I would like to summon potential alternatives before concluding > > > 1-netdev is the only solution too soon. > > > > > > Thanks, > > > -Siwei > > Your scripts would not work at all then, right? > At this point we don't claim images with such usage as SR-IOV live > migrate-able. We would flag it as live migrate-able until this ethtool > config issue is fully addressed and a transparent live migration solution > emerges in upstream eventually. > > > Thanks, > -Siwei > > > > > > > > > -Siwei > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org > > >> > net_failover(kernel) | network.service (user) | systemd-udevd (user) > --------------------------------------------------+------------------------------+-------------------------------------------- > (standby virtio-net and net_failover | | > devices created and initialized, | | > i.e. virtnet_probe()-> | | > net_failover_create() | | > was done.) | | > | | > | runs `ifup ens3' -> | > | ip link set dev ens3 up | > net_failover_open() | | > dev_open(virtnet_dev) | | > virtnet_open(virtnet_dev) | | > netif_carrier_on(failover_dev) | | > ... | | > | | > (VF hot plugged in) | | > ixgbevf_probe() | | > register_netdev(ixgbevf_netdev) | | > netdev_register_kobject(ixgbevf_netdev) | | > kobject_add(ixgbevf_dev) | | > device_add(ixgbevf_dev) | | > kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) | | > netlink_broadcast() | | > ... | | > call_netdevice_notifiers(NETDEV_REGISTER) | | > failover_event(..., NETDEV_REGISTER, ...) | | > failover_slave_register(ixgbevf_netdev) | | > net_failover_slave_register(ixgbevf_netdev) | | > dev_open(ixgbevf_netdev) | | > | | > | | > | | received ADD uevent from netlink fd > | | ... > | | udev-builtin-net_id.c:dev_pci_slot() > | | (decided to renamed 'eth0' ) > | | ip link set dev eth0 name ens4 > (dev_change_name() returns -EBUSY as | | > ixgbevf_netdev->flags has IFF_UP) | | > | | >Given renaming slaves does not work anyway: would it work if we just hard-coded slave names instead? E.g. 1. fail slave renames 2. rename of failover to XX automatically renames standby to XXnsby and primary to XXnpry -- MST
Stephen Hemminger
2019-Feb-27 21:57 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Tue, 26 Feb 2019 16:17:21 -0800 si-wei liu <si-wei.liu at oracle.com> wrote:> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote: > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote: > >> > >> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote: > >>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote: > >>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote: > >>>>> On 2/21/2019 7:33 PM, si-wei liu wrote: > >>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote: > >>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote: > >>>>>>>> Sorry for replying to this ancient thread. There was some remaining > >>>>>>>> issue that I don't think the initial net_failover patch got addressed > >>>>>>>> cleanly, see: > >>>>>>>> > >>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268 > >>>>>>>> > >>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was > >>>>>>>> not specifically writtten for such kernel automatic enslavement. > >>>>>>>> Specifically, if it is a bond or team, the slave would typically get > >>>>>>>> renamed *before* virtual device gets created, that's what udev can > >>>>>>>> control (without getting netdev opened early by the other part of > >>>>>>>> kernel) and other userspace components for e.g. initramfs, > >>>>>>>> init-scripts can coordinate well in between. The in-kernel > >>>>>>>> auto-enslavement of net_failover breaks this userspace convention, > >>>>>>>> which don't provides a solution if user care about consistent naming > >>>>>>>> on the slave netdevs specifically. > >>>>>>>> > >>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN > >>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this > >>>>>>>> problem ever since. Please share your mind how to proceed and solve > >>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model. > >>>>>>> Above says: > >>>>>>> > >>>>>>> there's no motivation in the systemd/udevd community at > >>>>>>> this point to refactor the rename logic and make it work well with > >>>>>>> 3-netdev. > >>>>>>> > >>>>>>> What would the fix be? Skip slave devices? > >>>>>>> > >>>>>> There's nothing user can get if just skipping slave devices - the > >>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the > >>>>>> next reboot, while the rest may conform to the naming scheme (ens3 > >>>>>> and such). There's no way one can fix this in userspace alone - when > >>>>>> the failover is created the enslaved netdev was opened by the kernel > >>>>>> earlier than the userspace is made aware of, and there's no > >>>>>> negotiation protocol for kernel to know when userspace has done > >>>>>> initial renaming of the interface. I would expect netdev list should > >>>>>> at least provide the direction in general for how this can be > >>>>>> solved... > >>> I was just wondering what did you mean when you said > >>> "refactor the rename logic and make it work well with 3-netdev" - > >>> was there a proposal udev rejected? > >> No. I never believed this particular issue can be fixed in userspace alone. > >> Previously someone had said it could be, but I never see any work or > >> relevant discussion ever happened in various userspace communities (for e.g. > >> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root > >> of the issue derives from the kernel, it makes more sense to start from > >> netdev, work out and decide on a solution: see what can be done in the > >> kernel in order to fix it, then after that engage userspace community for > >> the feasibility... > >> > >>> Anyway, can we write a time diagram for what happens in which order that > >>> leads to failure? That would help look for triggers that we can tie > >>> into, or add new ones. > >>> > >> See attached diagram. > >> > >>> > >>> > >>> > >>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected > >>>>> to only work with the master failover device. > >>>> Where does this expectation come from? > >>>> > >>>> Admin users may have ethtool or tc configurations that need to deal with > >>>> predictable interface name. Third-party app which was built upon specifying > >>>> certain interface name can't be modified to chase dynamic names. > >>>> > >>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF > >>>> offload settings post boot for specific workload. Those images won't work > >>>> well if the name is constantly changing just after couple rounds of live > >>>> migration. > >>> It should be possible to specify the ethtool configuration on the > >>> master and have it automatically propagated to the slave. > >>> > >>> BTW this is something we should look at IMHO. > >> I was elaborating a few examples that the expectation and assumption that > >> user/admin scripts only deal with master failover device is incorrect. It > >> had never been taken good care of, although I did try to emphasize it from > >> the very beginning. > >> > >> Basically what you said about propagating the ethtool configuration down to > >> the slave is the key pursuance of 1-netdev model. However, what I am seeking > >> now is any alternative that can also fix the specific udev rename problem, > >> before concluding that 1-netdev is the only solution. Generally a 1-netdev > >> scheme would take time to implement, while I'm trying to find a way out to > >> fix this particular naming problem under 3-netdev. > >> > >>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion > >>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace. > >>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within > >>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model. > >>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev > >>>> model as much transparent to a real NIC as possible, while a hidden netns is > >>>> just the vehicle). However, I recall there was resistance around this > >>>> discussion that even the concept of hiding itself is a taboo for Linux > >>>> netdev. I would like to summon potential alternatives before concluding > >>>> 1-netdev is the only solution too soon. > >>>> > >>>> Thanks, > >>>> -Siwei > >>> Your scripts would not work at all then, right? > >> At this point we don't claim images with such usage as SR-IOV live > >> migrate-able. We would flag it as live migrate-able until this ethtool > >> config issue is fully addressed and a transparent live migration solution > >> emerges in upstream eventually. > >> > >> > >> Thanks, > >> -Siwei > >>> > >>>>>> -Siwei > >>>>>> > >>>>>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > >>> For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org > >>> > >> net_failover(kernel) | network.service (user) | systemd-udevd (user) > >> --------------------------------------------------+------------------------------+-------------------------------------------- > >> (standby virtio-net and net_failover | | > >> devices created and initialized, | | > >> i.e. virtnet_probe()-> | | > >> net_failover_create() | | > >> was done.) | | > >> | | > >> | runs `ifup ens3' -> | > >> | ip link set dev ens3 up | > >> net_failover_open() | | > >> dev_open(virtnet_dev) | | > >> virtnet_open(virtnet_dev) | | > >> netif_carrier_on(failover_dev) | | > >> ... | | > >> | | > >> (VF hot plugged in) | | > >> ixgbevf_probe() | | > >> register_netdev(ixgbevf_netdev) | | > >> netdev_register_kobject(ixgbevf_netdev) | | > >> kobject_add(ixgbevf_dev) | | > >> device_add(ixgbevf_dev) | | > >> kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) | | > >> netlink_broadcast() | | > >> ... | | > >> call_netdevice_notifiers(NETDEV_REGISTER) | | > >> failover_event(..., NETDEV_REGISTER, ...) | | > >> failover_slave_register(ixgbevf_netdev) | | > >> net_failover_slave_register(ixgbevf_netdev) | | > >> dev_open(ixgbevf_netdev) | | > >> | | > >> | | > >> | | received ADD uevent from netlink fd > >> | | ... > >> | | udev-builtin-net_id.c:dev_pci_slot() > >> | | (decided to renamed 'eth0' ) > >> | | ip link set dev eth0 name ens4 > >> (dev_change_name() returns -EBUSY as | | > >> ixgbevf_netdev->flags has IFF_UP) | | > >> | | > >> > > Given renaming slaves does not work anyway: > I was actually thinking what if we relieve the rename restriction just > for the failover slave? What the impact would be? I think users don't > care about slave being renamed when it's in use, especially the initial > rename. Thoughts? > > > would it work if we just > > hard-coded slave names instead? > > > > E.g. > > 1. fail slave renames > > 2. rename of failover to XX automatically renames standby to XXnsby > > and primary to XXnpry > That wouldn't help. The time when the failover master gets renamed, the > VF may not be present. I don't like the idea to delay exposing failover > master until VF is hot plugged in (probably subject to various failures) > later.What netvsc does now is wait 2 seconds (to allow udev to do rename) before bringing the VF link up. This works, has had no problems even with slow distributions and is widely used. A patch to allow ending the timeout after rename was proposed but rejected. https://lore.kernel.org/netdev/20171220223323.21125-1-sthemmin at microsoft.com/ Allow network devices to change name when up is too risky. There are things like netfilter rules and other state in and out of the kernel that may break. Userspace does not like it when the rules change.
Michael S. Tsirkin
2019-Feb-27 22:38 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:> > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote: > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote: > > > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote: > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote: > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote: > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote: > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote: > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote: > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining > > > > > > > > > issue that I don't think the initial net_failover patch got addressed > > > > > > > > > cleanly, see: > > > > > > > > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268 > > > > > > > > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was > > > > > > > > > not specifically writtten for such kernel automatic enslavement. > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get > > > > > > > > > renamed *before* virtual device gets created, that's what udev can > > > > > > > > > control (without getting netdev opened early by the other part of > > > > > > > > > kernel) and other userspace components for e.g. initramfs, > > > > > > > > > init-scripts can coordinate well in between. The in-kernel > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention, > > > > > > > > > which don't provides a solution if user care about consistent naming > > > > > > > > > on the slave netdevs specifically. > > > > > > > > > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this > > > > > > > > > problem ever since. Please share your mind how to proceed and solve > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model. > > > > > > > > Above says: > > > > > > > > > > > > > > > > there's no motivation in the systemd/udevd community at > > > > > > > > this point to refactor the rename logic and make it work well with > > > > > > > > 3-netdev. > > > > > > > > > > > > > > > > What would the fix be? Skip slave devices? > > > > > > > > > > > > > > > There's nothing user can get if just skipping slave devices - the > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3 > > > > > > > and such). There's no way one can fix this in userspace alone - when > > > > > > > the failover is created the enslaved netdev was opened by the kernel > > > > > > > earlier than the userspace is made aware of, and there's no > > > > > > > negotiation protocol for kernel to know when userspace has done > > > > > > > initial renaming of the interface. I would expect netdev list should > > > > > > > at least provide the direction in general for how this can be > > > > > > > solved... > > > > I was just wondering what did you mean when you said > > > > "refactor the rename logic and make it work well with 3-netdev" - > > > > was there a proposal udev rejected? > > > No. I never believed this particular issue can be fixed in userspace alone. > > > Previously someone had said it could be, but I never see any work or > > > relevant discussion ever happened in various userspace communities (for e.g. > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root > > > of the issue derives from the kernel, it makes more sense to start from > > > netdev, work out and decide on a solution: see what can be done in the > > > kernel in order to fix it, then after that engage userspace community for > > > the feasibility... > > > > > > > Anyway, can we write a time diagram for what happens in which order that > > > > leads to failure? That would help look for triggers that we can tie > > > > into, or add new ones. > > > > > > > See attached diagram. > > > > > > > > > > > > > > > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected > > > > > > to only work with the master failover device. > > > > > Where does this expectation come from? > > > > > > > > > > Admin users may have ethtool or tc configurations that need to deal with > > > > > predictable interface name. Third-party app which was built upon specifying > > > > > certain interface name can't be modified to chase dynamic names. > > > > > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF > > > > > offload settings post boot for specific workload. Those images won't work > > > > > well if the name is constantly changing just after couple rounds of live > > > > > migration. > > > > It should be possible to specify the ethtool configuration on the > > > > master and have it automatically propagated to the slave. > > > > > > > > BTW this is something we should look at IMHO. > > > I was elaborating a few examples that the expectation and assumption that > > > user/admin scripts only deal with master failover device is incorrect. It > > > had never been taken good care of, although I did try to emphasize it from > > > the very beginning. > > > > > > Basically what you said about propagating the ethtool configuration down to > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking > > > now is any alternative that can also fix the specific udev rename problem, > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev > > > scheme would take time to implement, while I'm trying to find a way out to > > > fix this particular naming problem under 3-netdev. > > > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace. > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model. > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev > > > > > model as much transparent to a real NIC as possible, while a hidden netns is > > > > > just the vehicle). However, I recall there was resistance around this > > > > > discussion that even the concept of hiding itself is a taboo for Linux > > > > > netdev. I would like to summon potential alternatives before concluding > > > > > 1-netdev is the only solution too soon. > > > > > > > > > > Thanks, > > > > > -Siwei > > > > Your scripts would not work at all then, right? > > > At this point we don't claim images with such usage as SR-IOV live > > > migrate-able. We would flag it as live migrate-able until this ethtool > > > config issue is fully addressed and a transparent live migration solution > > > emerges in upstream eventually. > > > > > > > > > Thanks, > > > -Siwei > > > > > > > > > > > -Siwei > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > > > > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org > > > > > > > net_failover(kernel) | network.service (user) | systemd-udevd (user) > > > --------------------------------------------------+------------------------------+-------------------------------------------- > > > (standby virtio-net and net_failover | | > > > devices created and initialized, | | > > > i.e. virtnet_probe()-> | | > > > net_failover_create() | | > > > was done.) | | > > > | | > > > | runs `ifup ens3' -> | > > > | ip link set dev ens3 up | > > > net_failover_open() | | > > > dev_open(virtnet_dev) | | > > > virtnet_open(virtnet_dev) | | > > > netif_carrier_on(failover_dev) | | > > > ... | | > > > | | > > > (VF hot plugged in) | | > > > ixgbevf_probe() | | > > > register_netdev(ixgbevf_netdev) | | > > > netdev_register_kobject(ixgbevf_netdev) | | > > > kobject_add(ixgbevf_dev) | | > > > device_add(ixgbevf_dev) | | > > > kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) | | > > > netlink_broadcast() | | > > > ... | | > > > call_netdevice_notifiers(NETDEV_REGISTER) | | > > > failover_event(..., NETDEV_REGISTER, ...) | | > > > failover_slave_register(ixgbevf_netdev) | | > > > net_failover_slave_register(ixgbevf_netdev) | | > > > dev_open(ixgbevf_netdev) | | > > > | | > > > | | > > > | | received ADD uevent from netlink fd > > > | | ... > > > | | udev-builtin-net_id.c:dev_pci_slot() > > > | | (decided to renamed 'eth0' ) > > > | | ip link set dev eth0 name ens4 > > > (dev_change_name() returns -EBUSY as | | > > > ixgbevf_netdev->flags has IFF_UP) | | > > > | | > > > > > Given renaming slaves does not work anyway: > I was actually thinking what if we relieve the rename restriction just for > the failover slave? What the impact would be? I think users don't care about > slave being renamed when it's in use, especially the initial rename. > Thoughts? > > > would it work if we just > > hard-coded slave names instead? > > > > E.g. > > 1. fail slave renames > > 2. rename of failover to XX automatically renames standby to XXnsby > > and primary to XXnpry > That wouldn't help. The time when the failover master gets renamed, the VF > may not be present.In this scheme if VF is not there it will be renamed immediately after registration.> I don't like the idea to delay exposing failover master > until VF is hot plugged in (probably subject to various failures) later. > > Thanks, > -SiweiI agree, this was not what I meant.> > > >
Michael S. Tsirkin
2019-Feb-27 23:50 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:> > > On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote: > > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote: > > > > > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote: > > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote: > > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote: > > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote: > > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote: > > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote: > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote: > > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote: > > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining > > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed > > > > > > > > > > > cleanly, see: > > > > > > > > > > > > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268 > > > > > > > > > > > > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was > > > > > > > > > > > not specifically writtten for such kernel automatic enslavement. > > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get > > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can > > > > > > > > > > > control (without getting netdev opened early by the other part of > > > > > > > > > > > kernel) and other userspace components for e.g. initramfs, > > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel > > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention, > > > > > > > > > > > which don't provides a solution if user care about consistent naming > > > > > > > > > > > on the slave netdevs specifically. > > > > > > > > > > > > > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN > > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this > > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve > > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model. > > > > > > > > > > Above says: > > > > > > > > > > > > > > > > > > > > there's no motivation in the systemd/udevd community at > > > > > > > > > > this point to refactor the rename logic and make it work well with > > > > > > > > > > 3-netdev. > > > > > > > > > > > > > > > > > > > > What would the fix be? Skip slave devices? > > > > > > > > > > > > > > > > > > > There's nothing user can get if just skipping slave devices - the > > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the > > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3 > > > > > > > > > and such). There's no way one can fix this in userspace alone - when > > > > > > > > > the failover is created the enslaved netdev was opened by the kernel > > > > > > > > > earlier than the userspace is made aware of, and there's no > > > > > > > > > negotiation protocol for kernel to know when userspace has done > > > > > > > > > initial renaming of the interface. I would expect netdev list should > > > > > > > > > at least provide the direction in general for how this can be > > > > > > > > > solved... > > > > > > I was just wondering what did you mean when you said > > > > > > "refactor the rename logic and make it work well with 3-netdev" - > > > > > > was there a proposal udev rejected? > > > > > No. I never believed this particular issue can be fixed in userspace alone. > > > > > Previously someone had said it could be, but I never see any work or > > > > > relevant discussion ever happened in various userspace communities (for e.g. > > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root > > > > > of the issue derives from the kernel, it makes more sense to start from > > > > > netdev, work out and decide on a solution: see what can be done in the > > > > > kernel in order to fix it, then after that engage userspace community for > > > > > the feasibility... > > > > > > > > > > > Anyway, can we write a time diagram for what happens in which order that > > > > > > leads to failure? That would help look for triggers that we can tie > > > > > > into, or add new ones. > > > > > > > > > > > See attached diagram. > > > > > > > > > > > > > > > > > > > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected > > > > > > > > to only work with the master failover device. > > > > > > > Where does this expectation come from? > > > > > > > > > > > > > > Admin users may have ethtool or tc configurations that need to deal with > > > > > > > predictable interface name. Third-party app which was built upon specifying > > > > > > > certain interface name can't be modified to chase dynamic names. > > > > > > > > > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF > > > > > > > offload settings post boot for specific workload. Those images won't work > > > > > > > well if the name is constantly changing just after couple rounds of live > > > > > > > migration. > > > > > > It should be possible to specify the ethtool configuration on the > > > > > > master and have it automatically propagated to the slave. > > > > > > > > > > > > BTW this is something we should look at IMHO. > > > > > I was elaborating a few examples that the expectation and assumption that > > > > > user/admin scripts only deal with master failover device is incorrect. It > > > > > had never been taken good care of, although I did try to emphasize it from > > > > > the very beginning. > > > > > > > > > > Basically what you said about propagating the ethtool configuration down to > > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking > > > > > now is any alternative that can also fix the specific udev rename problem, > > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev > > > > > scheme would take time to implement, while I'm trying to find a way out to > > > > > fix this particular naming problem under 3-netdev. > > > > > > > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion > > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace. > > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within > > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model. > > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev > > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is > > > > > > > just the vehicle). However, I recall there was resistance around this > > > > > > > discussion that even the concept of hiding itself is a taboo for Linux > > > > > > > netdev. I would like to summon potential alternatives before concluding > > > > > > > 1-netdev is the only solution too soon. > > > > > > > > > > > > > > Thanks, > > > > > > > -Siwei > > > > > > Your scripts would not work at all then, right? > > > > > At this point we don't claim images with such usage as SR-IOV live > > > > > migrate-able. We would flag it as live migrate-able until this ethtool > > > > > config issue is fully addressed and a transparent live migration solution > > > > > emerges in upstream eventually. > > > > > > > > > > > > > > > Thanks, > > > > > -Siwei > > > > > > > > > -Siwei > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > > > > > > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org > > > > > > > > > > > net_failover(kernel) | network.service (user) | systemd-udevd (user) > > > > > --------------------------------------------------+------------------------------+-------------------------------------------- > > > > > (standby virtio-net and net_failover | | > > > > > devices created and initialized, | | > > > > > i.e. virtnet_probe()-> | | > > > > > net_failover_create() | | > > > > > was done.) | | > > > > > | | > > > > > | runs `ifup ens3' -> | > > > > > | ip link set dev ens3 up | > > > > > net_failover_open() | | > > > > > dev_open(virtnet_dev) | | > > > > > virtnet_open(virtnet_dev) | | > > > > > netif_carrier_on(failover_dev) | | > > > > > ... | | > > > > > | | > > > > > (VF hot plugged in) | | > > > > > ixgbevf_probe() | | > > > > > register_netdev(ixgbevf_netdev) | | > > > > > netdev_register_kobject(ixgbevf_netdev) | | > > > > > kobject_add(ixgbevf_dev) | | > > > > > device_add(ixgbevf_dev) | | > > > > > kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) | | > > > > > netlink_broadcast() | | > > > > > ... | | > > > > > call_netdevice_notifiers(NETDEV_REGISTER) | | > > > > > failover_event(..., NETDEV_REGISTER, ...) | | > > > > > failover_slave_register(ixgbevf_netdev) | | > > > > > net_failover_slave_register(ixgbevf_netdev) | | > > > > > dev_open(ixgbevf_netdev) | | > > > > > | | > > > > > | | > > > > > | | received ADD uevent from netlink fd > > > > > | | ... > > > > > | | udev-builtin-net_id.c:dev_pci_slot() > > > > > | | (decided to renamed 'eth0' ) > > > > > | | ip link set dev eth0 name ens4 > > > > > (dev_change_name() returns -EBUSY as | | > > > > > ixgbevf_netdev->flags has IFF_UP) | | > > > > > | | > > > > > > > > > Given renaming slaves does not work anyway: > > > I was actually thinking what if we relieve the rename restriction just for > > > the failover slave? What the impact would be? I think users don't care about > > > slave being renamed when it's in use, especially the initial rename. > > > Thoughts? > > > > > > > would it work if we just > > > > hard-coded slave names instead? > > > > > > > > E.g. > > > > 1. fail slave renames > > > > 2. rename of failover to XX automatically renames standby to XXnsby > > > > and primary to XXnpry > > > That wouldn't help. The time when the failover master gets renamed, the VF > > > may not be present. > > In this scheme if VF is not there it will be renamed immediately after registration. > Who will be responsible to rename the slave, the kernel?That's the idea.> Note the master's > name may or may not come from the userspace. If it comes from the userspace, > should the userspace daemon change their expectation not to name/rename > _any_ slaves (today there's no distinction)?Yes the idea would be to fail renaming slaves.> How do users know which name to > trust, depending on which wins the race more often? Say if kernel wants a > ens3npry name while userspace wants it named as ens4. > > -SiweiWith this approach kernel will deny attempts by userspace to rename slaves. Slaves will always be named XXXnsby and XXnpry. Master renames will rename both slaves. It seems pretty solid to me, the only issue is that in theory userspace can use a name like XXXnsby for something else. But this seems unlikely.> > > > > I don't like the idea to delay exposing failover master > > > until VF is hot plugged in (probably subject to various failures) later. > > > > > > Thanks, > > > -Siwei > > > > I agree, this was not what I meant. > > > > > >
Liran Alon
2019-Feb-28 00:01 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
> On 28 Feb 2019, at 1:50, Michael S. Tsirkin <mst at redhat.com> wrote: > > On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote: >> >> >> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote: >>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote: >>>> >>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote: >>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote: >>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote: >>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote: >>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote: >>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote: >>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote: >>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote: >>>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining >>>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed >>>>>>>>>>>> cleanly, see: >>>>>>>>>>>> >>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubuntu_-2Bsource_linux_-2Bbug_1815268&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=aL-QfUoSYx8r0XCOBkcDtF8f-cYxrJI3skYLFTb8XJE&s=yk6Nqv3a6_JMzyrXKY67h00FyNrDJyQ-PYMFffDSTXM&e>>>>>>>>>>>> >>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was >>>>>>>>>>>> not specifically writtten for such kernel automatic enslavement. >>>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get >>>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can >>>>>>>>>>>> control (without getting netdev opened early by the other part of >>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs, >>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel >>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention, >>>>>>>>>>>> which don't provides a solution if user care about consistent naming >>>>>>>>>>>> on the slave netdevs specifically. >>>>>>>>>>>> >>>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN >>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this >>>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve >>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model. >>>>>>>>>>> Above says: >>>>>>>>>>> >>>>>>>>>>> there's no motivation in the systemd/udevd community at >>>>>>>>>>> this point to refactor the rename logic and make it work well with >>>>>>>>>>> 3-netdev. >>>>>>>>>>> >>>>>>>>>>> What would the fix be? Skip slave devices? >>>>>>>>>>> >>>>>>>>>> There's nothing user can get if just skipping slave devices - the >>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the >>>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3 >>>>>>>>>> and such). There's no way one can fix this in userspace alone - when >>>>>>>>>> the failover is created the enslaved netdev was opened by the kernel >>>>>>>>>> earlier than the userspace is made aware of, and there's no >>>>>>>>>> negotiation protocol for kernel to know when userspace has done >>>>>>>>>> initial renaming of the interface. I would expect netdev list should >>>>>>>>>> at least provide the direction in general for how this can be >>>>>>>>>> solved... >>>>>>> I was just wondering what did you mean when you said >>>>>>> "refactor the rename logic and make it work well with 3-netdev" - >>>>>>> was there a proposal udev rejected? >>>>>> No. I never believed this particular issue can be fixed in userspace alone. >>>>>> Previously someone had said it could be, but I never see any work or >>>>>> relevant discussion ever happened in various userspace communities (for e.g. >>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root >>>>>> of the issue derives from the kernel, it makes more sense to start from >>>>>> netdev, work out and decide on a solution: see what can be done in the >>>>>> kernel in order to fix it, then after that engage userspace community for >>>>>> the feasibility... >>>>>> >>>>>>> Anyway, can we write a time diagram for what happens in which order that >>>>>>> leads to failure? That would help look for triggers that we can tie >>>>>>> into, or add new ones. >>>>>>> >>>>>> See attached diagram. >>>>>> >>>>>>> >>>>>>> >>>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected >>>>>>>>> to only work with the master failover device. >>>>>>>> Where does this expectation come from? >>>>>>>> >>>>>>>> Admin users may have ethtool or tc configurations that need to deal with >>>>>>>> predictable interface name. Third-party app which was built upon specifying >>>>>>>> certain interface name can't be modified to chase dynamic names. >>>>>>>> >>>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF >>>>>>>> offload settings post boot for specific workload. Those images won't work >>>>>>>> well if the name is constantly changing just after couple rounds of live >>>>>>>> migration. >>>>>>> It should be possible to specify the ethtool configuration on the >>>>>>> master and have it automatically propagated to the slave. >>>>>>> >>>>>>> BTW this is something we should look at IMHO. >>>>>> I was elaborating a few examples that the expectation and assumption that >>>>>> user/admin scripts only deal with master failover device is incorrect. It >>>>>> had never been taken good care of, although I did try to emphasize it from >>>>>> the very beginning. >>>>>> >>>>>> Basically what you said about propagating the ethtool configuration down to >>>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking >>>>>> now is any alternative that can also fix the specific udev rename problem, >>>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev >>>>>> scheme would take time to implement, while I'm trying to find a way out to >>>>>> fix this particular naming problem under 3-netdev. >>>>>> >>>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion >>>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace. >>>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within >>>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model. >>>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev >>>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is >>>>>>>> just the vehicle). However, I recall there was resistance around this >>>>>>>> discussion that even the concept of hiding itself is a taboo for Linux >>>>>>>> netdev. I would like to summon potential alternatives before concluding >>>>>>>> 1-netdev is the only solution too soon. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> -Siwei >>>>>>> Your scripts would not work at all then, right? >>>>>> At this point we don't claim images with such usage as SR-IOV live >>>>>> migrate-able. We would flag it as live migrate-able until this ethtool >>>>>> config issue is fully addressed and a transparent live migration solution >>>>>> emerges in upstream eventually. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> -Siwei >>>>>>>>>> -Siwei >>>>>>>>>> >>>>>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org >>>>>>> For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org >>>>>>> >>>>>> net_failover(kernel) | network.service (user) | systemd-udevd (user) >>>>>> --------------------------------------------------+------------------------------+-------------------------------------------- >>>>>> (standby virtio-net and net_failover | | >>>>>> devices created and initialized, | | >>>>>> i.e. virtnet_probe()-> | | >>>>>> net_failover_create() | | >>>>>> was done.) | | >>>>>> | | >>>>>> | runs `ifup ens3' -> | >>>>>> | ip link set dev ens3 up | >>>>>> net_failover_open() | | >>>>>> dev_open(virtnet_dev) | | >>>>>> virtnet_open(virtnet_dev) | | >>>>>> netif_carrier_on(failover_dev) | | >>>>>> ... | | >>>>>> | | >>>>>> (VF hot plugged in) | | >>>>>> ixgbevf_probe() | | >>>>>> register_netdev(ixgbevf_netdev) | | >>>>>> netdev_register_kobject(ixgbevf_netdev) | | >>>>>> kobject_add(ixgbevf_dev) | | >>>>>> device_add(ixgbevf_dev) | | >>>>>> kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) | | >>>>>> netlink_broadcast() | | >>>>>> ... | | >>>>>> call_netdevice_notifiers(NETDEV_REGISTER) | | >>>>>> failover_event(..., NETDEV_REGISTER, ...) | | >>>>>> failover_slave_register(ixgbevf_netdev) | | >>>>>> net_failover_slave_register(ixgbevf_netdev) | | >>>>>> dev_open(ixgbevf_netdev) | | >>>>>> | | >>>>>> | | >>>>>> | | received ADD uevent from netlink fd >>>>>> | | ... >>>>>> | | udev-builtin-net_id.c:dev_pci_slot() >>>>>> | | (decided to renamed 'eth0' ) >>>>>> | | ip link set dev eth0 name ens4 >>>>>> (dev_change_name() returns -EBUSY as | | >>>>>> ixgbevf_netdev->flags has IFF_UP) | | >>>>>> | | >>>>>> >>>>> Given renaming slaves does not work anyway: >>>> I was actually thinking what if we relieve the rename restriction just for >>>> the failover slave? What the impact would be? I think users don't care about >>>> slave being renamed when it's in use, especially the initial rename. >>>> Thoughts? >>>> >>>>> would it work if we just >>>>> hard-coded slave names instead? >>>>> >>>>> E.g. >>>>> 1. fail slave renames >>>>> 2. rename of failover to XX automatically renames standby to XXnsby >>>>> and primary to XXnpry >>>> That wouldn't help. The time when the failover master gets renamed, the VF >>>> may not be present. >>> In this scheme if VF is not there it will be renamed immediately after registration. >> Who will be responsible to rename the slave, the kernel? > > That's the idea. > >> Note the master's >> name may or may not come from the userspace. If it comes from the userspace, >> should the userspace daemon change their expectation not to name/rename >> _any_ slaves (today there's no distinction)? > > Yes the idea would be to fail renaming slaves. > >> How do users know which name to >> trust, depending on which wins the race more often? Say if kernel wants a >> ens3npry name while userspace wants it named as ens4. >> >> -Siwei > > With this approach kernel will deny attempts by userspace to rename > slaves. Slaves will always be named XXXnsby and XXnpry. Master renames > will rename both slaves. > > It seems pretty solid to me, the only issue is that in theory userspace > can use a name like XXXnsby for something else. But this seems unlikely.I?m fond of this idea and I have similar opinion. I think it simplifies the issue here. I don?t see a real reason for customer to define udev rule to rename a net-failover slave to have different postfix. -Liran> > >>> >>>> I don't like the idea to delay exposing failover master >>>> until VF is hot plugged in (probably subject to various failures) later. >>>> >>>> Thanks, >>>> -Siwei >>> >>> I agree, this was not what I meant. >>> >>>>>
Stephen Hemminger
2019-Feb-28 00:03 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Wed, 27 Feb 2019 18:50:44 -0500 "Michael S. Tsirkin" <mst at redhat.com> wrote:> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote: > > > > > > On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote: > > > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote: > > > > > > > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote: > > > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote: > > > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote: > > > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote: > > > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote: > > > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote: > > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote: > > > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote: > > > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining > > > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed > > > > > > > > > > > > cleanly, see: > > > > > > > > > > > > > > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268 > > > > > > > > > > > > > > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was > > > > > > > > > > > > not specifically writtten for such kernel automatic enslavement. > > > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get > > > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can > > > > > > > > > > > > control (without getting netdev opened early by the other part of > > > > > > > > > > > > kernel) and other userspace components for e.g. initramfs, > > > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel > > > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention, > > > > > > > > > > > > which don't provides a solution if user care about consistent naming > > > > > > > > > > > > on the slave netdevs specifically. > > > > > > > > > > > > > > > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN > > > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this > > > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve > > > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model. > > > > > > > > > > > Above says: > > > > > > > > > > > > > > > > > > > > > > there's no motivation in the systemd/udevd community at > > > > > > > > > > > this point to refactor the rename logic and make it work well with > > > > > > > > > > > 3-netdev. > > > > > > > > > > > > > > > > > > > > > > What would the fix be? Skip slave devices? > > > > > > > > > > > > > > > > > > > > > There's nothing user can get if just skipping slave devices - the > > > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the > > > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3 > > > > > > > > > > and such). There's no way one can fix this in userspace alone - when > > > > > > > > > > the failover is created the enslaved netdev was opened by the kernel > > > > > > > > > > earlier than the userspace is made aware of, and there's no > > > > > > > > > > negotiation protocol for kernel to know when userspace has done > > > > > > > > > > initial renaming of the interface. I would expect netdev list should > > > > > > > > > > at least provide the direction in general for how this can be > > > > > > > > > > solved... > > > > > > > I was just wondering what did you mean when you said > > > > > > > "refactor the rename logic and make it work well with 3-netdev" - > > > > > > > was there a proposal udev rejected? > > > > > > No. I never believed this particular issue can be fixed in userspace alone. > > > > > > Previously someone had said it could be, but I never see any work or > > > > > > relevant discussion ever happened in various userspace communities (for e.g. > > > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root > > > > > > of the issue derives from the kernel, it makes more sense to start from > > > > > > netdev, work out and decide on a solution: see what can be done in the > > > > > > kernel in order to fix it, then after that engage userspace community for > > > > > > the feasibility... > > > > > > > > > > > > > Anyway, can we write a time diagram for what happens in which order that > > > > > > > leads to failure? That would help look for triggers that we can tie > > > > > > > into, or add new ones. > > > > > > > > > > > > > See attached diagram. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected > > > > > > > > > to only work with the master failover device. > > > > > > > > Where does this expectation come from? > > > > > > > > > > > > > > > > Admin users may have ethtool or tc configurations that need to deal with > > > > > > > > predictable interface name. Third-party app which was built upon specifying > > > > > > > > certain interface name can't be modified to chase dynamic names. > > > > > > > > > > > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF > > > > > > > > offload settings post boot for specific workload. Those images won't work > > > > > > > > well if the name is constantly changing just after couple rounds of live > > > > > > > > migration. > > > > > > > It should be possible to specify the ethtool configuration on the > > > > > > > master and have it automatically propagated to the slave. > > > > > > > > > > > > > > BTW this is something we should look at IMHO. > > > > > > I was elaborating a few examples that the expectation and assumption that > > > > > > user/admin scripts only deal with master failover device is incorrect. It > > > > > > had never been taken good care of, although I did try to emphasize it from > > > > > > the very beginning. > > > > > > > > > > > > Basically what you said about propagating the ethtool configuration down to > > > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking > > > > > > now is any alternative that can also fix the specific udev rename problem, > > > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev > > > > > > scheme would take time to implement, while I'm trying to find a way out to > > > > > > fix this particular naming problem under 3-netdev. > > > > > > > > > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion > > > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace. > > > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within > > > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model. > > > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev > > > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is > > > > > > > > just the vehicle). However, I recall there was resistance around this > > > > > > > > discussion that even the concept of hiding itself is a taboo for Linux > > > > > > > > netdev. I would like to summon potential alternatives before concluding > > > > > > > > 1-netdev is the only solution too soon. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > -Siwei > > > > > > > Your scripts would not work at all then, right? > > > > > > At this point we don't claim images with such usage as SR-IOV live > > > > > > migrate-able. We would flag it as live migrate-able until this ethtool > > > > > > config issue is fully addressed and a transparent live migration solution > > > > > > emerges in upstream eventually. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > -Siwei > > > > > > > > > > -Siwei > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > > > > > > > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org > > > > > > > > > > > > > net_failover(kernel) | network.service (user) | systemd-udevd (user) > > > > > > --------------------------------------------------+------------------------------+-------------------------------------------- > > > > > > (standby virtio-net and net_failover | | > > > > > > devices created and initialized, | | > > > > > > i.e. virtnet_probe()-> | | > > > > > > net_failover_create() | | > > > > > > was done.) | | > > > > > > | | > > > > > > | runs `ifup ens3' -> | > > > > > > | ip link set dev ens3 up | > > > > > > net_failover_open() | | > > > > > > dev_open(virtnet_dev) | | > > > > > > virtnet_open(virtnet_dev) | | > > > > > > netif_carrier_on(failover_dev) | | > > > > > > ... | | > > > > > > | | > > > > > > (VF hot plugged in) | | > > > > > > ixgbevf_probe() | | > > > > > > register_netdev(ixgbevf_netdev) | | > > > > > > netdev_register_kobject(ixgbevf_netdev) | | > > > > > > kobject_add(ixgbevf_dev) | | > > > > > > device_add(ixgbevf_dev) | | > > > > > > kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) | | > > > > > > netlink_broadcast() | | > > > > > > ... | | > > > > > > call_netdevice_notifiers(NETDEV_REGISTER) | | > > > > > > failover_event(..., NETDEV_REGISTER, ...) | | > > > > > > failover_slave_register(ixgbevf_netdev) | | > > > > > > net_failover_slave_register(ixgbevf_netdev) | | > > > > > > dev_open(ixgbevf_netdev) | | > > > > > > | | > > > > > > | | > > > > > > | | received ADD uevent from netlink fd > > > > > > | | ... > > > > > > | | udev-builtin-net_id.c:dev_pci_slot() > > > > > > | | (decided to renamed 'eth0' ) > > > > > > | | ip link set dev eth0 name ens4 > > > > > > (dev_change_name() returns -EBUSY as | | > > > > > > ixgbevf_netdev->flags has IFF_UP) | | > > > > > > | | > > > > > > > > > > > Given renaming slaves does not work anyway: > > > > I was actually thinking what if we relieve the rename restriction just for > > > > the failover slave? What the impact would be? I think users don't care about > > > > slave being renamed when it's in use, especially the initial rename. > > > > Thoughts? > > > > > > > > > would it work if we just > > > > > hard-coded slave names instead? > > > > > > > > > > E.g. > > > > > 1. fail slave renames > > > > > 2. rename of failover to XX automatically renames standby to XXnsby > > > > > and primary to XXnpry > > > > That wouldn't help. The time when the failover master gets renamed, the VF > > > > may not be present. > > > In this scheme if VF is not there it will be renamed immediately after registration. > > Who will be responsible to rename the slave, the kernel? > > That's the idea. > > > Note the master's > > name may or may not come from the userspace. If it comes from the userspace, > > should the userspace daemon change their expectation not to name/rename > > _any_ slaves (today there's no distinction)? > > Yes the idea would be to fail renaming slaves. > > > How do users know which name to > > trust, depending on which wins the race more often? Say if kernel wants a > > ens3npry name while userspace wants it named as ens4. > > > > -Siwei > > With this approach kernel will deny attempts by userspace to rename > slaves. Slaves will always be named XXXnsby and XXnpry. Master renames > will rename both slaves. > > It seems pretty solid to me, the only issue is that in theory userspace > can use a name like XXXnsby for something else. But this seems unlikely.Similar schemes (with kernel providing naming) were also previously rejected upstream. It has been a consistent theme that the kernel should not be in the renaming business. It will certainly break userspace.
Michael S. Tsirkin
2019-Feb-28 00:41 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Wed, Feb 27, 2019 at 04:38:00PM -0800, si-wei liu wrote:> > > On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote: > > On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote: > > > > > > On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote: > > > > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote: > > > > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote: > > > > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote: > > > > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote: > > > > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote: > > > > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote: > > > > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote: > > > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote: > > > > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote: > > > > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining > > > > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed > > > > > > > > > > > > > cleanly, see: > > > > > > > > > > > > > > > > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268 > > > > > > > > > > > > > > > > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was > > > > > > > > > > > > > not specifically writtten for such kernel automatic enslavement. > > > > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get > > > > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can > > > > > > > > > > > > > control (without getting netdev opened early by the other part of > > > > > > > > > > > > > kernel) and other userspace components for e.g. initramfs, > > > > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel > > > > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention, > > > > > > > > > > > > > which don't provides a solution if user care about consistent naming > > > > > > > > > > > > > on the slave netdevs specifically. > > > > > > > > > > > > > > > > > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN > > > > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this > > > > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve > > > > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model. > > > > > > > > > > > > Above says: > > > > > > > > > > > > > > > > > > > > > > > > there's no motivation in the systemd/udevd community at > > > > > > > > > > > > this point to refactor the rename logic and make it work well with > > > > > > > > > > > > 3-netdev. > > > > > > > > > > > > > > > > > > > > > > > > What would the fix be? Skip slave devices? > > > > > > > > > > > > > > > > > > > > > > > There's nothing user can get if just skipping slave devices - the > > > > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the > > > > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3 > > > > > > > > > > > and such). There's no way one can fix this in userspace alone - when > > > > > > > > > > > the failover is created the enslaved netdev was opened by the kernel > > > > > > > > > > > earlier than the userspace is made aware of, and there's no > > > > > > > > > > > negotiation protocol for kernel to know when userspace has done > > > > > > > > > > > initial renaming of the interface. I would expect netdev list should > > > > > > > > > > > at least provide the direction in general for how this can be > > > > > > > > > > > solved... > > > > > > > > I was just wondering what did you mean when you said > > > > > > > > "refactor the rename logic and make it work well with 3-netdev" - > > > > > > > > was there a proposal udev rejected? > > > > > > > No. I never believed this particular issue can be fixed in userspace alone. > > > > > > > Previously someone had said it could be, but I never see any work or > > > > > > > relevant discussion ever happened in various userspace communities (for e.g. > > > > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root > > > > > > > of the issue derives from the kernel, it makes more sense to start from > > > > > > > netdev, work out and decide on a solution: see what can be done in the > > > > > > > kernel in order to fix it, then after that engage userspace community for > > > > > > > the feasibility... > > > > > > > > > > > > > > > Anyway, can we write a time diagram for what happens in which order that > > > > > > > > leads to failure? That would help look for triggers that we can tie > > > > > > > > into, or add new ones. > > > > > > > > > > > > > > > See attached diagram. > > > > > > > > > > > > > > > > > > > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected > > > > > > > > > > to only work with the master failover device. > > > > > > > > > Where does this expectation come from? > > > > > > > > > > > > > > > > > > Admin users may have ethtool or tc configurations that need to deal with > > > > > > > > > predictable interface name. Third-party app which was built upon specifying > > > > > > > > > certain interface name can't be modified to chase dynamic names. > > > > > > > > > > > > > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF > > > > > > > > > offload settings post boot for specific workload. Those images won't work > > > > > > > > > well if the name is constantly changing just after couple rounds of live > > > > > > > > > migration. > > > > > > > > It should be possible to specify the ethtool configuration on the > > > > > > > > master and have it automatically propagated to the slave. > > > > > > > > > > > > > > > > BTW this is something we should look at IMHO. > > > > > > > I was elaborating a few examples that the expectation and assumption that > > > > > > > user/admin scripts only deal with master failover device is incorrect. It > > > > > > > had never been taken good care of, although I did try to emphasize it from > > > > > > > the very beginning. > > > > > > > > > > > > > > Basically what you said about propagating the ethtool configuration down to > > > > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking > > > > > > > now is any alternative that can also fix the specific udev rename problem, > > > > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev > > > > > > > scheme would take time to implement, while I'm trying to find a way out to > > > > > > > fix this particular naming problem under 3-netdev. > > > > > > > > > > > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion > > > > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace. > > > > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within > > > > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model. > > > > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev > > > > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is > > > > > > > > > just the vehicle). However, I recall there was resistance around this > > > > > > > > > discussion that even the concept of hiding itself is a taboo for Linux > > > > > > > > > netdev. I would like to summon potential alternatives before concluding > > > > > > > > > 1-netdev is the only solution too soon. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > -Siwei > > > > > > > > Your scripts would not work at all then, right? > > > > > > > At this point we don't claim images with such usage as SR-IOV live > > > > > > > migrate-able. We would flag it as live migrate-able until this ethtool > > > > > > > config issue is fully addressed and a transparent live migration solution > > > > > > > emerges in upstream eventually. > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > -Siwei > > > > > > > > > > > -Siwei > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe at lists.oasis-open.org > > > > > > > > For additional commands, e-mail: virtio-dev-help at lists.oasis-open.org > > > > > > > > > > > > > > > net_failover(kernel) | network.service (user) | systemd-udevd (user) > > > > > > > --------------------------------------------------+------------------------------+-------------------------------------------- > > > > > > > (standby virtio-net and net_failover | | > > > > > > > devices created and initialized, | | > > > > > > > i.e. virtnet_probe()-> | | > > > > > > > net_failover_create() | | > > > > > > > was done.) | | > > > > > > > | | > > > > > > > | runs `ifup ens3' -> | > > > > > > > | ip link set dev ens3 up | > > > > > > > net_failover_open() | | > > > > > > > dev_open(virtnet_dev) | | > > > > > > > virtnet_open(virtnet_dev) | | > > > > > > > netif_carrier_on(failover_dev) | | > > > > > > > ... | | > > > > > > > | | > > > > > > > (VF hot plugged in) | | > > > > > > > ixgbevf_probe() | | > > > > > > > register_netdev(ixgbevf_netdev) | | > > > > > > > netdev_register_kobject(ixgbevf_netdev) | | > > > > > > > kobject_add(ixgbevf_dev) | | > > > > > > > device_add(ixgbevf_dev) | | > > > > > > > kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) | | > > > > > > > netlink_broadcast() | | > > > > > > > ... | | > > > > > > > call_netdevice_notifiers(NETDEV_REGISTER) | | > > > > > > > failover_event(..., NETDEV_REGISTER, ...) | | > > > > > > > failover_slave_register(ixgbevf_netdev) | | > > > > > > > net_failover_slave_register(ixgbevf_netdev) | | > > > > > > > dev_open(ixgbevf_netdev) | | > > > > > > > | | > > > > > > > | | > > > > > > > | | received ADD uevent from netlink fd > > > > > > > | | ... > > > > > > > | | udev-builtin-net_id.c:dev_pci_slot() > > > > > > > | | (decided to renamed 'eth0' ) > > > > > > > | | ip link set dev eth0 name ens4 > > > > > > > (dev_change_name() returns -EBUSY as | | > > > > > > > ixgbevf_netdev->flags has IFF_UP) | | > > > > > > > | | > > > > > > > > > > > > > Given renaming slaves does not work anyway: > > > > > I was actually thinking what if we relieve the rename restriction just for > > > > > the failover slave? What the impact would be? I think users don't care about > > > > > slave being renamed when it's in use, especially the initial rename. > > > > > Thoughts? > > > > > > > > > > > would it work if we just > > > > > > hard-coded slave names instead? > > > > > > > > > > > > E.g. > > > > > > 1. fail slave renames > > > > > > 2. rename of failover to XX automatically renames standby to XXnsby > > > > > > and primary to XXnpry > > > > > That wouldn't help. The time when the failover master gets renamed, the VF > > > > > may not be present. > > > > In this scheme if VF is not there it will be renamed immediately after registration. > > > Who will be responsible to rename the slave, the kernel? > > That's the idea. > > > > > Note the master's > > > name may or may not come from the userspace. If it comes from the userspace, > > > should the userspace daemon change their expectation not to name/rename > > > _any_ slaves (today there's no distinction)? > > Yes the idea would be to fail renaming slaves. > No I was asking about the userspace expectation: whether it should track and > detect the lifecycle events of failover slaves and decide what to do. How > does it get back to the user specified name if VF is not enslaved (say > someone unloads the virtio-net module)?When virtio net is removed VF will shortly be removed too.> As this scheme adds much complexity to the kernel naming convention > (currently it's just ethX names) that no userspace can understand.Anything that pokes at slaves needs to be specially designed anyway. Naming seems like a minor issue.> Will the > change break userspace further? > > -SiweiDidn't you show userspace is already broken. You can't "further break it", rename already fails.> > > > > How do users know which name to > > > trust, depending on which wins the race more often? Say if kernel wants a > > > ens3npry name while userspace wants it named as ens4. > > > > > > -Siwei > > With this approach kernel will deny attempts by userspace to rename > > slaves. Slaves will always be named XXXnsby and XXnpry. Master renames > > will rename both slaves. > > > > It seems pretty solid to me, the only issue is that in theory userspace > > can use a name like XXXnsby for something else. But this seems unlikely. > > > > > > > > > I don't like the idea to delay exposing failover master > > > > > until VF is hot plugged in (probably subject to various failures) later. > > > > > > > > > > Thanks, > > > > > -Siwei > > > > I agree, this was not what I meant. > > > >
Jakub Kicinski
2019-Feb-28 00:52 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:> > As this scheme adds much complexity to the kernel naming convention > > (currently it's just ethX names) that no userspace can understand. > > Anything that pokes at slaves needs to be specially designed anyway. > Naming seems like a minor issue.Can the users who care about the naming put net_failover into "user space will do the bond enslavement" mode, and do the bond creation/management themselves from user space (in systemd/ Network Manager) based on the failover flag?
Michael S. Tsirkin
2019-Feb-28 14:26 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:> > > Will the > > > change break userspace further? > > > > > > -Siwei > > Didn't you show userspace is already broken. You can't "further > > break it", rename already fails. > It's a race, userspace tends to give slave a user(space) desired name but > sometimes may fail due to this race. Today if failover master is not up, > rename would succeed anyway. While what you proposed prohibits user from > providing a name in all circumstances if I understand you correctly. That's > what I meant of breaking userspace further. On the other hand, you seem to > tighten the kernel default naming to udev predictable names, which is > derived from only recent systemd-udevd, while there exists many possible > userspace naming schemes out of that. Users today who deliberately chooses > to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to > kernel provided names would expect the ethX pattern, with this change > admin/user scripts which matches the ethX pattern could potentially break.Whatever crashes with a name not matching ethX will crash on the standby interface *anyway*. So I think what you are saying is that someone might have already written scripts and gotten them to work on v4.17 when STANDBY was included and these scripts rely on ethX. Now these scripts will break. Maybe it is still early enough (just half a year passed) that the number of these users would be small. So how about a kernel config option and maybe a module parameter to rename the primary? People can then opt in to the old broken behaviour. -- MST
Michael S. Tsirkin
2019-Mar-01 13:27 UTC
[virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On Thu, Feb 28, 2019 at 05:30:56PM -0800, si-wei liu wrote:> > > On 2/28/2019 6:26 AM, Michael S. Tsirkin wrote: > > On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote: > > > > > Will the > > > > > change break userspace further? > > > > > > > > > > -Siwei > > > > Didn't you show userspace is already broken. You can't "further > > > > break it", rename already fails. > > > It's a race, userspace tends to give slave a user(space) desired name but > > > sometimes may fail due to this race. Today if failover master is not up, > > > rename would succeed anyway. While what you proposed prohibits user from > > > providing a name in all circumstances if I understand you correctly. That's > > > what I meant of breaking userspace further. On the other hand, you seem to > > > tighten the kernel default naming to udev predictable names, which is > > > derived from only recent systemd-udevd, while there exists many possible > > > userspace naming schemes out of that. Users today who deliberately chooses > > > to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to > > > kernel provided names would expect the ethX pattern, with this change > > > admin/user scripts which matches the ethX pattern could potentially break. > > Whatever crashes with a name not matching ethX will crash on the > > standby interface *anyway*. > With udev predictable naming disabled they should not. It's not hard for > user to look for device attribute to persistent the name well, in a > consistent and reliable way.Well that's special code for failover already. So far we just taught userspace to skip renaming slave interfaces.> > > > So I think what you are saying is that someone might have already > > written scripts and gotten them to work on v4.17 when STANDBY was > > included and these scripts rely on ethX. Now these scripts > > will break. > The controversial part is the new kernel naming pattern. Initially I thought > there shouldn't be such crazy scripts relying on the pattern, but when I > worked on cloud-init it I realized that there's already a lot of software > taking assumption around the 'eth0' name. In the past I've seen random > scripts that parses the ethX name assumes (incorrectly) the name ends up > with digits, or even the digits and name are 1:1 mapped. Of course, you can > say these are bugs in scripts themselves.No what I say is that they will crash on rename of standby too.> Anyway, I'll let others in the netdev to comment on this new scheme, maybe > that's the concern of merely myself. The good part of your proposal is that > we can get consistent slave name, which still plays its role until we move > towards making slave names less relevant, i.e. ideally a 1-netdev model. I > think we both agree that the master matters more than the slave names. > > > > Maybe it is still early enough (just half a year passed) that the > > number of these users would be small. So how about a kernel config > > option and maybe a module parameter to rename the primary? People can > > then opt in to the old broken behaviour. > Were I could I would ask why a similar opt-in (kernel config or module > parameter) couldn't be implemented to open up the rename restriction on > slave, net_failover in particular. What I felt about this rename restriction > was more because of historical reason than anything else, while net_failover > is comparatively a new type of link that we are now designing proper use > case it should support, and can get it shaped to whatever it fits. My > personal view is that the slave can't be renamed when master is running is > just implementation details that got incorrectly exposed to userspace apps > for many years. It's old behavior with historical reason for sure, but I > don't think this applies to net_failover. > > (FWIW as one previous bond maintainer for another OS, we relieved the rename > restriction slaves 13 year ago, while no single complaint or issue was ever > raised because of this change over the years, neither from the customers of > tens of millions of installation base, nor the FOSS software running atop. > Of course, Linux is different so that experience doesn't count.) > > Thanks, > -Siwei >