> On 21 Mar 2019, at 0:10, Michael S. Tsirkin <mst at redhat.com> wrote: > > On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote: >> >> >>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst at redhat.com> wrote: >>> >>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote: >>>> >>>> >>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst at redhat.com> wrote: >>>>> >>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote: >>>>>> >>>>>> >>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst at redhat.com> wrote: >>>>>>> >>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote: >>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200 >>>>>>>> Liran Alon <liran.alon at oracle.com> wrote: >>>>>>>> >>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves. >>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn?t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver). >>>>>>>> >>>>>>>> Cloud-init should really just ignore all devices that have a master device. >>>>>>>> That would have been more general, and safer for other use cases. >>>>>>> >>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be >>>>>>> safer to just somehow pretend to userspace that the slave links are >>>>>>> down? And add a special attribute for the actual link state. >>>>>> >>>>>> I think this may be problematic as it would also break legit use case >>>>>> of userspace attempt to set various config on VF slave. >>>>>> In general, lying to userspace usually leads to problems. >>>>> >>>>> I hear you on this. So how about instead of lying, >>>>> we basically just fail some accesses to slaves >>>>> unless a flag is set e.g. in ethtool. >>>>> >>>>> Some userspace will need to change to set it but in a minor way. >>>>> Arguably/hopefully failure to set config would generally be a safer >>>>> failure. >>>> >>>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work. >>> >>> Sorry about being unclear, the idea would be to require the flag on each ethtool operation. >> >> Oh. I have indeed misunderstood your previous email then. :) >> Thanks for clarifying. >> >>> >>>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed. >>> >>> I think sending/receiving should probably just fail unconditionally. >> >> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev >> unless skb is marked with some flag to indicate it has been sent via the net-failover master? > > We can maybe avoid binding a protocol socket to the device?That is indeed another possibility that would work to avoid the DHCP issues. And will still allow checking connectivity. So it is better. However, I still think it provides an non-intuitive customer experience. In addition, I also want to take into account that most customers are expected a 1:1 mapping between a vNIC and a netdev. i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it defined. Customers usually don?t care how they get accelerated networking. They just care they do.> >> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.). >> >> However, I see a couple of down-sides to it: >> 1) It doesn?t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves. >> It also doesn?t help with regard to Ubuntu?s netplan issue that creates udev rules that match only by MAC. > > > How about we fail to retrieve mac from the slave?That would work but I think it is cleaner to just not bind PV and VF based on having the same MAC.> >> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity >> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity. >> >> The set of changes I vision to fix our issues are: >> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly. >> (E.g. Configure the net-failover VF slave in some special way). >> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor. >> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit >> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC). >> >> Is there any clear issue we see regarding the above suggestion? >> >> -Liran > > The issue would be this: how do we avoid conflicting with namespaces > created by users?This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal. To reference a hidden netns, you need to do it explicitly. Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I?m overloading the term namespace here?). Does this seems reasonable? -Liran> >>> >>>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace. >>>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated >>>> by userspace that it wishes to perform a set of actions on the net-failover slave. >>>> >>>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns. >>>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev. >>>> But of course maybe there are other ideas that can achieve similar behaviour. >>>> >>>> -Liran >>>> >>>>> >>>>> Which things to fail? Probably sending/receiving packets? Getting MAC? >>>>> More? >>>>> >>>>>> If we reach >>>>>> to a scenario where we try to avoid userspace issues generically and >>>>>> not on a userspace component basis, I believe the right path should be >>>>>> to hide the net-failover slaves such that explicit action is required >>>>>> to actually manipulate them (As described in blog-post). E.g. >>>>>> Automatically move net-failover slaves by kernel to a different netns. >>>>>> >>>>>> -Liran >>>>>> >>>>>>> >>>>>>> -- >>>>>>> MST
On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote:> > > > On 21 Mar 2019, at 0:10, Michael S. Tsirkin <mst at redhat.com> wrote: > > > > On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote: > >> > >> > >>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst at redhat.com> wrote: > >>> > >>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote: > >>>> > >>>> > >>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst at redhat.com> wrote: > >>>>> > >>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote: > >>>>>> > >>>>>> > >>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst at redhat.com> wrote: > >>>>>>> > >>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote: > >>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200 > >>>>>>>> Liran Alon <liran.alon at oracle.com> wrote: > >>>>>>>> > >>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves. > >>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn?t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver). > >>>>>>>> > >>>>>>>> Cloud-init should really just ignore all devices that have a master device. > >>>>>>>> That would have been more general, and safer for other use cases. > >>>>>>> > >>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be > >>>>>>> safer to just somehow pretend to userspace that the slave links are > >>>>>>> down? And add a special attribute for the actual link state. > >>>>>> > >>>>>> I think this may be problematic as it would also break legit use case > >>>>>> of userspace attempt to set various config on VF slave. > >>>>>> In general, lying to userspace usually leads to problems. > >>>>> > >>>>> I hear you on this. So how about instead of lying, > >>>>> we basically just fail some accesses to slaves > >>>>> unless a flag is set e.g. in ethtool. > >>>>> > >>>>> Some userspace will need to change to set it but in a minor way. > >>>>> Arguably/hopefully failure to set config would generally be a safer > >>>>> failure. > >>>> > >>>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work. > >>> > >>> Sorry about being unclear, the idea would be to require the flag on each ethtool operation. > >> > >> Oh. I have indeed misunderstood your previous email then. :) > >> Thanks for clarifying. > >> > >>> > >>>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed. > >>> > >>> I think sending/receiving should probably just fail unconditionally. > >> > >> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev > >> unless skb is marked with some flag to indicate it has been sent via the net-failover master? > > > > We can maybe avoid binding a protocol socket to the device? > > That is indeed another possibility that would work to avoid the DHCP issues. > And will still allow checking connectivity. So it is better. > However, I still think it provides an non-intuitive customer experience. > In addition, I also want to take into account that most customers are expected a 1:1 mapping between a vNIC and a netdev. > i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it defined. > Customers usually don?t care how they get accelerated networking. They just care they do. > > > > >> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.). > >> > >> However, I see a couple of down-sides to it: > >> 1) It doesn?t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves. > >> It also doesn?t help with regard to Ubuntu?s netplan issue that creates udev rules that match only by MAC. > > > > > > How about we fail to retrieve mac from the slave? > > That would work but I think it is cleaner to just not bind PV and VF based on having the same MAC.There's a reference to that under "Non-MAC based pairing". I'll look into making it more explicit.> > > >> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity > >> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity. > >> > >> The set of changes I vision to fix our issues are: > >> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly. > >> (E.g. Configure the net-failover VF slave in some special way). > >> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor. > >> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit > >> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC). > >> > >> Is there any clear issue we see regarding the above suggestion? > >> > >> -Liran > > > > The issue would be this: how do we avoid conflicting with namespaces > > created by users? > > This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal. > To reference a hidden netns, you need to do it explicitly. > Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I?m overloading the term namespace here?).Maybe it's an unnamed namespace. Hidden until userspace gives it a name?> Does this seems reasonable? > > -LiranReasonable I'd say yes, easy to implement probably no. But maybe I missed a trick or two.> > > >>> > >>>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace. > >>>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated > >>>> by userspace that it wishes to perform a set of actions on the net-failover slave. > >>>> > >>>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns. > >>>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev. > >>>> But of course maybe there are other ideas that can achieve similar behaviour. > >>>> > >>>> -Liran > >>>> > >>>>> > >>>>> Which things to fail? Probably sending/receiving packets? Getting MAC? > >>>>> More? > >>>>> > >>>>>> If we reach > >>>>>> to a scenario where we try to avoid userspace issues generically and > >>>>>> not on a userspace component basis, I believe the right path should be > >>>>>> to hide the net-failover slaves such that explicit action is required > >>>>>> to actually manipulate them (As described in blog-post). E.g. > >>>>>> Automatically move net-failover slaves by kernel to a different netns. > >>>>>> > >>>>>> -Liran > >>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> MST
> On 21 Mar 2019, at 10:58, Michael S. Tsirkin <mst at redhat.com> wrote: > > On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote: >> >> >>> On 21 Mar 2019, at 0:10, Michael S. Tsirkin <mst at redhat.com> wrote: >>> >>> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote: >>>> >>>> >>>>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst at redhat.com> wrote: >>>>> >>>>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote: >>>>>> >>>>>> >>>>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst at redhat.com> wrote: >>>>>>> >>>>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote: >>>>>>>> >>>>>>>> >>>>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst at redhat.com> wrote: >>>>>>>>> >>>>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote: >>>>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200 >>>>>>>>>> Liran Alon <liran.alon at oracle.com> wrote: >>>>>>>>>> >>>>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves. >>>>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn?t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver). >>>>>>>>>> >>>>>>>>>> Cloud-init should really just ignore all devices that have a master device. >>>>>>>>>> That would have been more general, and safer for other use cases. >>>>>>>>> >>>>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be >>>>>>>>> safer to just somehow pretend to userspace that the slave links are >>>>>>>>> down? And add a special attribute for the actual link state. >>>>>>>> >>>>>>>> I think this may be problematic as it would also break legit use case >>>>>>>> of userspace attempt to set various config on VF slave. >>>>>>>> In general, lying to userspace usually leads to problems. >>>>>>> >>>>>>> I hear you on this. So how about instead of lying, >>>>>>> we basically just fail some accesses to slaves >>>>>>> unless a flag is set e.g. in ethtool. >>>>>>> >>>>>>> Some userspace will need to change to set it but in a minor way. >>>>>>> Arguably/hopefully failure to set config would generally be a safer >>>>>>> failure. >>>>>> >>>>>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work. >>>>> >>>>> Sorry about being unclear, the idea would be to require the flag on each ethtool operation. >>>> >>>> Oh. I have indeed misunderstood your previous email then. :) >>>> Thanks for clarifying. >>>> >>>>> >>>>>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed. >>>>> >>>>> I think sending/receiving should probably just fail unconditionally. >>>> >>>> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev >>>> unless skb is marked with some flag to indicate it has been sent via the net-failover master? >>> >>> We can maybe avoid binding a protocol socket to the device? >> >> That is indeed another possibility that would work to avoid the DHCP issues. >> And will still allow checking connectivity. So it is better. >> However, I still think it provides an non-intuitive customer experience. >> In addition, I also want to take into account that most customers are expected a 1:1 mapping between a vNIC and a netdev. >> i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it defined. >> Customers usually don?t care how they get accelerated networking. They just care they do. >> >>> >>>> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.). >>>> >>>> However, I see a couple of down-sides to it: >>>> 1) It doesn?t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves. >>>> It also doesn?t help with regard to Ubuntu?s netplan issue that creates udev rules that match only by MAC. >>> >>> >>> How about we fail to retrieve mac from the slave? >> >> That would work but I think it is cleaner to just not bind PV and VF based on having the same MAC. > > There's a reference to that under "Non-MAC based pairing". > > I'll look into making it more explicit.Yes I know. I was referring to what you described in that section.> >>> >>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity >>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity. >>>> >>>> The set of changes I vision to fix our issues are: >>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly. >>>> (E.g. Configure the net-failover VF slave in some special way). >>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor. >>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit >>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC). >>>> >>>> Is there any clear issue we see regarding the above suggestion? >>>> >>>> -Liran >>> >>> The issue would be this: how do we avoid conflicting with namespaces >>> created by users? >> >> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal. >> To reference a hidden netns, you need to do it explicitly. >> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I?m overloading the term namespace here?). > > Maybe it's an unnamed namespace. Hidden until userspace gives it a name?This is also a good idea that will solve the issue. Yes.> >> Does this seems reasonable? >> >> -Liran > > Reasonable I'd say yes, easy to implement probably no. But maybe I > missed a trick or two.BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this, it was better to create an kernel auto-generated name (e.g. ?kernel_net_failover_slaves") that will break only userspace workloads that by a very rare-chance have a netns that collides with this then the breakage we have today for the various userspace components. -Liran> >>> >>>>> >>>>>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace. >>>>>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated >>>>>> by userspace that it wishes to perform a set of actions on the net-failover slave. >>>>>> >>>>>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns. >>>>>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev. >>>>>> But of course maybe there are other ideas that can achieve similar behaviour. >>>>>> >>>>>> -Liran >>>>>> >>>>>>> >>>>>>> Which things to fail? Probably sending/receiving packets? Getting MAC? >>>>>>> More? >>>>>>> >>>>>>>> If we reach >>>>>>>> to a scenario where we try to avoid userspace issues generically and >>>>>>>> not on a userspace component basis, I believe the right path should be >>>>>>>> to hide the net-failover slaves such that explicit action is required >>>>>>>> to actually manipulate them (As described in blog-post). E.g. >>>>>>>> Automatically move net-failover slaves by kernel to a different netns. >>>>>>>> >>>>>>>> -Liran >>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> MST