Alexander Duyck
2018-Feb-27 21:16 UTC
[RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, Feb 27, 2018 at 12:49 AM, Jiri Pirko <jiri at resnulli.us> wrote:> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.duyck at gmail.com wrote: >>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri at resnulli.us> wrote: >>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala at intel.com wrote: >>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be >>>>used by hypervisor to indicate that virtio_net interface should act as >>>>a backup for another device with the same MAC address. >>>> >>>>Ppatch 2 is in response to the community request for a 3 netdev >>>>solution. However, it creates some issues we'll get into in a moment. >>>>It extends virtio_net to use alternate datapath when available and >>>>registered. When BACKUP feature is enabled, virtio_net driver creates >>>>an additional 'bypass' netdev that acts as a master device and controls >>>>2 slave devices. The original virtio_net netdev is registered as >>>>'backup' netdev and a passthru/vf device with the same MAC gets >>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >>>>associated with the same 'pci' device. The user accesses the network >>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >>>>as default for transmits when it is available with link up and running. >>> >>> Sorry, but this is ridiculous. You are apparently re-implemeting part >>> of bonding driver as a part of NIC driver. Bond and team drivers >>> are mature solutions, well tested, broadly used, with lots of issues >>> resolved in the past. What you try to introduce is a weird shortcut >>> that already has couple of issues as you mentioned and will certanly >>> have many more. Also, I'm pretty sure that in future, someone comes up >>> with ideas like multiple VFs, LACP and similar bonding things. >> >>The problem with the bond and team drivers is they are too large and >>have too many interfaces available for configuration so as a result >>they can really screw this interface up. >> >>Essentially this is meant to be a bond that is more-or-less managed by >>the host, not the guest. We want the host to be able to configure it >>and have it automatically kick in on the guest. For now we want to >>avoid adding too much complexity as this is meant to be just the first >>step. Trying to go in and implement the whole solution right from the >>start based on existing drivers is going to be a massive time sink and >>will likely never get completed due to the fact that there is always >>going to be some other thing that will interfere. >> >>My personal hope is that we can look at doing a virtio-bond sort of >>device that will handle all this as well as providing a communication >>channel, but that is much further down the road. For now we only have >>a single bit so the goal for now is trying to keep this as simple as >>possible. > > I have another usecase that would require the solution to be different > then what you suggest. Consider following scenario: > - baremetal has 2 sr-iov nics > - there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net > - baremetal would like to somehow tell the VM to bond vf0 and vf1 > together and how this bonding should be configured, according to how > the VF representors are configured on the baremetal (LACP for example) > > The baremetal could decide to remove any VF during the VM runtime, it > can add another VF there. For migration, it can add virtio_net. The VM > should be inctructed to bond all interfaces together according to how > baremetal decided - as it knows better. > > For this we need a separate communication channel from baremetal to VM > (perhaps something re-usable already exists), we need something to > listen to the events coming from this channel (kernel/userspace) and to > react accordingly (create bond/team, enslave, etc). > > Now the question is: is it possible to merge the demands you have and > the generic needs I described into a single solution? From what I see, > that would be quite hard/impossible. So at the end, I think that we have > to end-up with 2 solutions: > 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config > solution that works for all (no matter what OS you use in VM) > 2) team/bond solution with assistance of preferably userspace daemon > getting info from baremetal. This is not 0config, but minimal config > - user just have to define this "magic bonding" should be on. > This covers all possible usecases, including multiple VFs, RDMA, etc. > > Thoughts?So that is about what I had in mind. We end up having to do something completely different to support this more complex solution. I think we might have referred to it as v2/v3 in a different thread, and virt-bond in this thread. Basically we need some sort of PCI or PCIe topology mapping for the devices that can be translated into something we can communicate over the communication channel. After that we also have the added complexity of how do we figure out which Tx path we want to choose. This is one of the reasons why I was thinking of something like a eBPF blob that is handed up from the host side and into the guest to select the Tx queue. That way when we add some new approach such as a NUMA/cpu based netdev selection then we just provide an eBPF blob that does that. Most of this is just theoretical at this point though since I haven't had a chance to look into it too deeply yet. If you want to take something like this on the help would always be welcome.. :) The other thing I am looking at is trying to find a good way to do dirty page tracking in the hypervisor using something like a para-virtual IOMMU. However I don't have any ETA on that as I am just starting out and have limited development time. If we get that in place we can leave the VF in the guest until the very last moments instead of having to remove it before we start the live migration. - Alex
Michael S. Tsirkin
2018-Feb-27 21:23 UTC
[RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, Feb 27, 2018 at 01:16:21PM -0800, Alexander Duyck wrote:> The other thing I am looking at is trying to find a good way to do > dirty page tracking in the hypervisor using something like a > para-virtual IOMMU. However I don't have any ETA on that as I am just > starting out and have limited development time. If we get that in > place we can leave the VF in the guest until the very last moments > instead of having to remove it before we start the live migration. > > - AlexI actually think your old RFC would be a good starting point: https://lkml.org/lkml/2016/1/5/104 What is missing is I think enabling/disabling dynamically. Seems to be easier than tracking by the hypervisor. -- MST
Jakub Kicinski
2018-Feb-27 21:41 UTC
[RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:> Basically we need some sort of PCI or PCIe topology mapping for the > devices that can be translated into something we can communicate over > the communication channel.Hm. This is probably a completely stupid idea, but if we need to start marshalling configuration requests/hints maybe the entire problem could be solved by opening a netlink socket from hypervisor? Even make teamd run on the hypervisor side...
Jiri Pirko
2018-Feb-28 07:08 UTC
[RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Tue, Feb 27, 2018 at 10:41:49PM CET, kubakici at wp.pl wrote:>On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote: >> Basically we need some sort of PCI or PCIe topology mapping for the >> devices that can be translated into something we can communicate over >> the communication channel. > >Hm. This is probably a completely stupid idea, but if we need to >start marshalling configuration requests/hints maybe the entire problem >could be solved by opening a netlink socket from hypervisor? Even make >teamd run on the hypervisor side...Interesting. That would be more trickier then just to fwd 1 genetlink socket to the hypervisor. Also, I think that the solution should handle multiple guest oses. What I'm thinking about is some generic bonding description passed over some communication channel into vm. The vm either use it for configuration, or ignores it if it is not smart enough/updated enough.
Possibly Parallel Threads
- [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
- [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
- [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
- [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
- [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device