thr3ads.net - Linux Virtualization - [summary] virtio network device failover writeup [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Liran Alon

2019-Mar-21 14:16 UTC

[summary] virtio network device failover writeup

> On 21 Mar 2019, at 15:51, Michael S. Tsirkin <mst at redhat.com>
wrote:
> 
> On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 15:12, Michael S. Tsirkin <mst at
redhat.com> wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst at
redhat.com> wrote:
>>>>> 
>>>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin
<mst at redhat.com> wrote:
>>>>>>> 
>>>>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran
Alon wrote:
>>>>>>>>>>>> 2) It brings non-intuitive
customer experience. For example, a customer may attempt to analyse connectivity
issue by checking the connectivity
>>>>>>>>>>>> on a net-failover slave (e.g.
the VF) but will see no connectivity when in-fact checking the connectivity on
the net-failover master netdev shows correct connectivity.
>>>>>>>>>>>> 
>>>>>>>>>>>> The set of changes I vision to
fix our issues are:
>>>>>>>>>>>> 1) Hide net-failover slaves in
a different netns created and managed by the kernel. But that user can enter to
it and manage the netdevs there if wishes to do so explicitly.
>>>>>>>>>>>> (E.g. Configure the
net-failover VF slave in some special way).
>>>>>>>>>>>> 2) Match the virtio-net and the
VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g.
Provide a virtio-net interface to get PCI slot where the matching VF will be
hot-plugged by hypervisor.
>>>>>>>>>>>> 3) Have an explicit virtio-net
control message to command hypervisor to switch data-path from virtio-net to VF
and vice-versa. Instead of relying on intercepting the PCI master enable-bit
>>>>>>>>>>>> as an indicator on when VF is
about to be set up. (Similar to as done in NetVSC).
>>>>>>>>>>>> 
>>>>>>>>>>>> Is there any clear issue we see
regarding the above suggestion?
>>>>>>>>>>>> 
>>>>>>>>>>>> -Liran
>>>>>>>>>>> 
>>>>>>>>>>> The issue would be this: how do we
avoid conflicting with namespaces
>>>>>>>>>>> created by users?
>>>>>>>>>> 
>>>>>>>>>> This is kinda controversial, but maybe
separate netns names into 2 groups: hidden and normal.
>>>>>>>>>> To reference a hidden netns, you need
to do it explicitly.
>>>>>>>>>> Hidden and normal netns names can
collide as they will be maintained in different namespaces (Yes I?m overloading
the term namespace here?).
>>>>>>>>> 
>>>>>>>>> Maybe it's an unnamed namespace. Hidden
until userspace gives it a name?
>>>>>>>> 
>>>>>>>> This is also a good idea that will solve the
issue. Yes.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Does this seems reasonable?
>>>>>>>>>> 
>>>>>>>>>> -Liran
>>>>>>>>> 
>>>>>>>>> Reasonable I'd say yes, easy to
implement probably no. But maybe I
>>>>>>>>> missed a trick or two.
>>>>>>>> 
>>>>>>>> BTW, from a practical point of view, I think
that even until we figure out a solution on how to implement this,
>>>>>>>> it was better to create an kernel
auto-generated name (e.g. ?kernel_net_failover_slaves")
>>>>>>>> that will break only userspace workloads that
by a very rare-chance have a netns that collides with this then
>>>>>>>> the breakage we have today for the various
userspace components.
>>>>>>>> 
>>>>>>>> -Liran
>>>>>>> 
>>>>>>> It seems quite easy to supply that as a module
parameter. Do we need two
>>>>>>> namespaces though? Won't some userspace still
be confused by the two
>>>>>>> slaves sharing the MAC address?
>>>>>> 
>>>>>> That?s one reasonable option.
>>>>>> Another one is that we will indeed change the mechanism
by which we determine a VF should be bonded with a virtio-net device.
>>>>>> i.e. Expose a new virtio-net property that specify the
PCI slot of the VF to be bonded with.
>>>>>> 
>>>>>> The second seems cleaner but I don?t have a strong
opinion on this. Both seem reasonable to me and your suggestion is faster to
implement from current state of things.
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> OK. Now what happens if master is moved to another
namespace? Do we need
>>>>> to move the slaves too?
>>>> 
>>>> No. Why would we move the slaves?
>>> 
>>> 
>>> The reason we have 3 device model at all is so users can fine tune
the
>>> slaves.
>> 
>> I Agree.
>> 
>>> I don't see why this applies to the root namespace but not
>>> a container. If it has access to failover it should have access
>>> to slaves.
>> 
>> Oh now I see your point. I haven?t thought about the containers usage.
>> My thinking was that customer can always just enter to the ?hidden?
netns and configure there whatever he wants.
>> 
>> Do you have a suggestion how to handle this?
>> 
>> One option can be that every "visible" netns on system will
have a ?hidden? unnamed netns where the net-failover slaves reside in.
>> If customer wishes to be able to enter to that netns and manage the
net-failover slaves explicitly, it will need to have an updated iproute2
>> that knows how to enter to that hidden netns. For most customers, they
won?t need to ever enter that netns and thus it is ok they don?t
>> have this updated iproute2.
> 
> Right so slaves need to be moved whenever master is moved.
> 
> Given the amount of mess involved, should we just teach
> userspace to create the hidden netns and move slaves there?
That?s a good question.

However, I believe that it is easier and more suitable to happen in kernel. This
is because:
1) Implementation is generic across all various distros.
2) We seem to discover more and more issues with userspace as we keep testing
this on various distros, configurations and workloads.
3) It seems weird that kernel does some things automagically and some things
don?t. i.e. Kernel automatically binds the virtio-net and VF to net-failover
master
and automatically opens the net-failover slave when the net-failover master is
opened, but it doesn?t care about the consequences these actions have on
userspace.
Therefore, I propose let?s go ?all in?: Kernel should also be responsible for
hiding it?s artefacts unless customer userspace explicitly wants to view and
manipulate them.
> 
>>> 
>>>> The whole point is to make most customer ignore the
net-failover slaves and remain them ?hidden? in their dedicated netns.
>>> 
>>> So that makes the common case easy. That is good. My worry is it
might
>>> make some uncommon cases impossible.
>>> 
>>>> We won?t prevent customer from explicitly moving the
net-failover slaves out of this netns, but we will not move them out of there
automatically.
>>>> 
>>>>> 
>>>>> Also siwei's patch is then kind of extraneous right?
>>>>> Attempts to rename a slave will now fail as it's in a
namespace?
>>>> 
>>>> I?m not sure actually. Isn't udev/systemd netns-aware?
>>>> I would expect it to be able to provide names also to netdevs
in netns different than default netns.
>>> 
>>> I think most people move devices after they are renamed.
>> 
>> So?
>> Si-Wei patch handles the issue that resolves from the fact the
net-failover master will be opened before the rename on the net-failover slaves
occur.
>> This should happen (to my understanding) regardless of network
namespaces.
>> 
>> -Liran
> 
> My point was that any tool that moves devices after they
> are renamed will be broken by kernel automatically putting
> them in a namespace.
I?m not sure I follow. How is this related to Si-Wei patch?
Si-Wei patch (and the root-cause that leads to the issue it fixes) have nothing
to do with network namespaces.

What do you mean tool that moves devices after they are renamed will be broken
by kernel?
Care to give an example to clarify?

-Liran
> 
>>> 
>>>> If that?s the case, Si-Wei patch to be able to rename a
net-failover slave when it is already open is still required. As the
race-condition still exists.
>>>> 
>>>> -Liran
>>>> 
>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> MST

Michael S. Tsirkin

2019-Mar-21 15:15 UTC

head link

[summary] virtio network device failover writeup

On Thu, Mar 21, 2019 at 04:16:14PM +0200, Liran Alon
wrote:> 
> 
> > On 21 Mar 2019, at 15:51, Michael S. Tsirkin <mst at redhat.com>
wrote:
> > 
> > On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 21 Mar 2019, at 15:12, Michael S. Tsirkin <mst at
redhat.com> wrote:
> >>> 
> >>> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
> >>>> 
> >>>> 
> >>>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin <mst
at redhat.com> wrote:
> >>>>> 
> >>>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon
wrote:
> >>>>>> 
> >>>>>> 
> >>>>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin
<mst at redhat.com> wrote:
> >>>>>>> 
> >>>>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200,
Liran Alon wrote:
> >>>>>>>>>>>> 2) It brings non-intuitive
customer experience. For example, a customer may attempt to analyse connectivity
issue by checking the connectivity
> >>>>>>>>>>>> on a net-failover slave
(e.g. the VF) but will see no connectivity when in-fact checking the
connectivity on the net-failover master netdev shows correct connectivity.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> The set of changes I
vision to fix our issues are:
> >>>>>>>>>>>> 1) Hide net-failover
slaves in a different netns created and managed by the kernel. But that user can
enter to it and manage the netdevs there if wishes to do so explicitly.
> >>>>>>>>>>>> (E.g. Configure the
net-failover VF slave in some special way).
> >>>>>>>>>>>> 2) Match the virtio-net
and the VF based on a PV attribute instead of MAC. (Similar to as done in
NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching
VF will be hot-plugged by hypervisor.
> >>>>>>>>>>>> 3) Have an explicit
virtio-net control message to command hypervisor to switch data-path from
virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI
master enable-bit
> >>>>>>>>>>>> as an indicator on when VF
is about to be set up. (Similar to as done in NetVSC).
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Is there any clear issue
we see regarding the above suggestion?
> >>>>>>>>>>>> 
> >>>>>>>>>>>> -Liran
> >>>>>>>>>>> 
> >>>>>>>>>>> The issue would be this: how
do we avoid conflicting with namespaces
> >>>>>>>>>>> created by users?
> >>>>>>>>>> 
> >>>>>>>>>> This is kinda controversial, but
maybe separate netns names into 2 groups: hidden and normal.
> >>>>>>>>>> To reference a hidden netns, you
need to do it explicitly.
> >>>>>>>>>> Hidden and normal netns names can
collide as they will be maintained in different namespaces (Yes I?m overloading
the term namespace here?).
> >>>>>>>>> 
> >>>>>>>>> Maybe it's an unnamed namespace.
Hidden until userspace gives it a name?
> >>>>>>>> 
> >>>>>>>> This is also a good idea that will solve
the issue. Yes.
> >>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> Does this seems reasonable?
> >>>>>>>>>> 
> >>>>>>>>>> -Liran
> >>>>>>>>> 
> >>>>>>>>> Reasonable I'd say yes, easy to
implement probably no. But maybe I
> >>>>>>>>> missed a trick or two.
> >>>>>>>> 
> >>>>>>>> BTW, from a practical point of view, I
think that even until we figure out a solution on how to implement this,
> >>>>>>>> it was better to create an kernel
auto-generated name (e.g. ?kernel_net_failover_slaves")
> >>>>>>>> that will break only userspace workloads
that by a very rare-chance have a netns that collides with this then
> >>>>>>>> the breakage we have today for the various
userspace components.
> >>>>>>>> 
> >>>>>>>> -Liran
> >>>>>>> 
> >>>>>>> It seems quite easy to supply that as a module
parameter. Do we need two
> >>>>>>> namespaces though? Won't some userspace
still be confused by the two
> >>>>>>> slaves sharing the MAC address?
> >>>>>> 
> >>>>>> That?s one reasonable option.
> >>>>>> Another one is that we will indeed change the
mechanism by which we determine a VF should be bonded with a virtio-net device.
> >>>>>> i.e. Expose a new virtio-net property that specify
the PCI slot of the VF to be bonded with.
> >>>>>> 
> >>>>>> The second seems cleaner but I don?t have a strong
opinion on this. Both seem reasonable to me and your suggestion is faster to
implement from current state of things.
> >>>>>> 
> >>>>>> -Liran
> >>>>> 
> >>>>> OK. Now what happens if master is moved to another
namespace? Do we need
> >>>>> to move the slaves too?
> >>>> 
> >>>> No. Why would we move the slaves?
> >>> 
> >>> 
> >>> The reason we have 3 device model at all is so users can fine
tune the
> >>> slaves.
> >> 
> >> I Agree.
> >> 
> >>> I don't see why this applies to the root namespace but not
> >>> a container. If it has access to failover it should have
access
> >>> to slaves.
> >> 
> >> Oh now I see your point. I haven?t thought about the containers
usage.
> >> My thinking was that customer can always just enter to the
?hidden? netns and configure there whatever he wants.
> >> 
> >> Do you have a suggestion how to handle this?
> >> 
> >> One option can be that every "visible" netns on system
will have a ?hidden? unnamed netns where the net-failover slaves reside in.
> >> If customer wishes to be able to enter to that netns and manage
the net-failover slaves explicitly, it will need to have an updated iproute2
> >> that knows how to enter to that hidden netns. For most customers,
they won?t need to ever enter that netns and thus it is ok they don?t
> >> have this updated iproute2.
> > 
> > Right so slaves need to be moved whenever master is moved.
> > 
> > Given the amount of mess involved, should we just teach
> > userspace to create the hidden netns and move slaves there?
> 
> That?s a good question.
> 
> However, I believe that it is easier and more suitable to happen in kernel.
This is because:
> 1) Implementation is generic across all various distros.
> 2) We seem to discover more and more issues with userspace as we keep
testing this on various distros, configurations and workloads.
> 3) It seems weird that kernel does some things automagically and some
things don?t. i.e. Kernel automatically binds the virtio-net and VF to
net-failover master
> and automatically opens the net-failover slave when the net-failover master
is opened, but it doesn?t care about the consequences these actions have on
userspace.
> Therefore, I propose let?s go ?all in?: Kernel should also be responsible
for hiding it?s artefacts unless customer userspace explicitly wants to view and
manipulate them.
Just a minor point: failover device is an artefact of kernel. Standy and
primary devices are created by the hypervisor.
> > 
> >>> 
> >>>> The whole point is to make most customer ignore the
net-failover slaves and remain them ?hidden? in their dedicated netns.
> >>> 
> >>> So that makes the common case easy. That is good. My worry is
it might
> >>> make some uncommon cases impossible.
> >>> 
> >>>> We won?t prevent customer from explicitly moving the
net-failover slaves out of this netns, but we will not move them out of there
automatically.
> >>>> 
> >>>>> 
> >>>>> Also siwei's patch is then kind of extraneous
right?
> >>>>> Attempts to rename a slave will now fail as it's
in a namespace?
> >>>> 
> >>>> I?m not sure actually. Isn't udev/systemd netns-aware?
> >>>> I would expect it to be able to provide names also to
netdevs in netns different than default netns.
> >>> 
> >>> I think most people move devices after they are renamed.
> >> 
> >> So?
> >> Si-Wei patch handles the issue that resolves from the fact the
net-failover master will be opened before the rename on the net-failover slaves
occur.
> >> This should happen (to my understanding) regardless of network
namespaces.
> >> 
> >> -Liran
> > 
> > My point was that any tool that moves devices after they
> > are renamed will be broken by kernel automatically putting
> > them in a namespace.
> 
> I?m not sure I follow. How is this related to Si-Wei patch?
> Si-Wei patch (and the root-cause that leads to the issue it fixes) have
nothing to do with network namespaces.
> 
> What do you mean tool that moves devices after they are renamed will be
broken by kernel?
> Care to give an example to clarify?
> 
> -Liran
I'll have to get back to you next week when I'm less jetlaged and more
lucid.
> > 
> >>> 
> >>>> If that?s the case, Si-Wei patch to be able to rename a
net-failover slave when it is already open is still required. As the
race-condition still exists.
> >>>> 
> >>>> -Liran
> >>>> 
> >>>>> 
> >>>>>>> 
> >>>>>>> -- 
> >>>>>>> MST

Apparently Analagous Threads

Search for more apparently analagous threads

Linux Virtualization - Mar 2019 - [summary] virtio network device failover writeup

[summary] virtio network device failover writeup

[summary] virtio network device failover writeup

Apparently Analagous Threads