Sridhar Samudrala
2018-Feb-16 18:11 UTC
[RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be used by hypervisor to indicate that virtio_net interface should act as a backup for another device with the same MAC address. Ppatch 2 is in response to the community request for a 3 netdev solution. However, it creates some issues we'll get into in a moment. It extends virtio_net to use alternate datapath when available and registered. When BACKUP feature is enabled, virtio_net driver creates an additional 'bypass' netdev that acts as a master device and controls 2 slave devices. The original virtio_net netdev is registered as 'backup' netdev and a passthru/vf device with the same MAC gets registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev as default for transmits when it is available with link up and running. We noticed a couple of issues with this approach during testing. - As both 'bypass' and 'backup' netdevs are associated with the same virtio pci device, udev tries to rename both of them with the same name and the 2nd rename will fail. This would be OK as long as the first netdev to be renamed is the 'bypass' netdev, but the order in which udev gets to rename the 2 netdevs is not reliable. - When the 'active' netdev is unplugged OR not present on a destination system after live migration, the user will see 2 virtio_net netdevs. Patch 3 refactors much of the changes made in patch 2, which was done on purpose just to show the solution we recommend as part of one patch set. If we submit a final version of this, we would combine patch 2/3 together. This patch removes the creation of an additional netdev, Instead, it uses a new virtnet_bypass_info struct added to the original 'backup' netdev to track the 'bypass' information and introduces an additional set of ndo and ethtool ops that are used when BACKUP feature is enabled. One difference with the 3 netdev model compared to the 2 netdev model is that the 'bypass' netdev is created with 'noqueue' qdisc marked as 'NETIF_F_LLTX'. This avoids going through an additional qdisc and acquiring an additional qdisc and tx lock during transmits. If we can replace the qdisc of virtio netdev dynamically, it should be possible to get these optimizations enabled even with 2 netdev model when BACKUP feature is enabled. As this patch series is initially focusing on usecases where hypervisor fully controls the VM networking and the guest is not expected to directly configure any hardware settings, it doesn't expose all the ndo/ethtool ops that are supported by virtio_net at this time. To support additional usecases, it should be possible to enable additional ops later by caching the state in virtio netdev and replaying when the 'active' netdev gets registered. The hypervisor needs to enable only one datapath at any time so that packets don't get looped back to the VM over the other datapath. When a VF is plugged, the virtio datapath link state can be marked as down. At the time of live migration, the hypervisor needs to unplug the VF device from the guest on the source host and reset the MAC filter of the VF to initiate failover of datapath to virtio before starting the migration. After the migration is completed, the destination hypervisor sets the MAC filter on the VF and plugs it back to the guest to switch over to VF datapath. This patch is based on the discussion initiated by Jesse on this thread. https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 Sridhar Samudrala (3): virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit virtio_net: Extend virtio to use VF datapath when available virtio_net: Enable alternate datapath without creating an additional netdev drivers/net/virtio_net.c | 564 +++++++++++++++++++++++++++++++++++++++- include/uapi/linux/virtio_net.h | 3 + 2 files changed, 563 insertions(+), 4 deletions(-) -- 2.14.3
Sridhar Samudrala
2018-Feb-16 18:11 UTC
[RFC PATCH v3 1/3] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
This feature bit can be used by hypervisor to indicate virtio_net device to act as a backup for another device with the same MAC address. VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit. Signed-off-by: Sridhar Samudrala <sridhar.samudrala at intel.com> --- drivers/net/virtio_net.c | 2 +- include/uapi/linux/virtio_net.h | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 626c27352ae2..bcd13fe906ca 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2920,7 +2920,7 @@ static struct virtio_device_id id_table[] = { VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \ VIRTIO_NET_F_CTRL_MAC_ADDR, \ VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \ - VIRTIO_NET_F_SPEED_DUPLEX + VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP static unsigned int features[] = { VIRTNET_FEATURES, diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h index 5de6ed37695b..c7c35fd1a5ed 100644 --- a/include/uapi/linux/virtio_net.h +++ b/include/uapi/linux/virtio_net.h @@ -57,6 +57,9 @@ * Steering */ #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ +#define VIRTIO_NET_F_BACKUP 62 /* Act as backup for another device + * with the same MAC. + */ #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */ #ifndef VIRTIO_NET_NO_LEGACY -- 2.14.3
Sridhar Samudrala
2018-Feb-16 18:11 UTC
[RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available
This patch enables virtio_net to switch over to a VF datapath when a VF netdev is present with the same MAC address. It allows live migration of a VM with a direct attached VF without the need to setup a bond/team between a VF and virtio net device in the guest. The hypervisor needs to enable only one datapath at any time so that packets don't get looped back to the VM over the other datapath. When a VF is plugged, the virtio datapath link state can be marked as down. The hypervisor needs to unplug the VF device from the guest on the source host and reset the MAC filter of the VF to initiate failover of datapath to virtio before starting the migration. After the migration is completed, the destination hypervisor sets the MAC filter on the VF and plugs it back to the guest to switch over to VF datapath. When BACKUP feature is enabled, an additional netdev(bypass netdev) is created that acts as a master device and tracks the state of the 2 lower netdevs. The original virtio_net netdev is marked as 'backup' netdev and a passthru device with the same MAC is registered as 'active' netdev. This patch is based on the discussion initiated by Jesse on this thread. https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 Signed-off-by: Sridhar Samudrala <sridhar.samudrala at intel.com> Signed-off-by: Alexander Duyck <alexander.h.duyck at intel.com> --- drivers/net/virtio_net.c | 639 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 638 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index bcd13fe906ca..14679806c1b1 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -30,6 +30,7 @@ #include <linux/cpu.h> #include <linux/average.h> #include <linux/filter.h> +#include <linux/netdevice.h> #include <net/route.h> #include <net/xdp.h> @@ -147,6 +148,27 @@ struct receive_queue { struct xdp_rxq_info xdp_rxq; }; +/* bypass state maintained when BACKUP feature is enabled */ +struct virtnet_bypass_info { + /* passthru netdev with same MAC */ + struct net_device __rcu *active_netdev; + + /* virtio_net netdev */ + struct net_device __rcu *backup_netdev; + + /* active netdev stats */ + struct rtnl_link_stats64 active_stats; + + /* backup netdev stats */ + struct rtnl_link_stats64 backup_stats; + + /* aggregated stats */ + struct rtnl_link_stats64 bypass_stats; + + /* spinlock while updating stats */ + spinlock_t stats_lock; +}; + struct virtnet_info { struct virtio_device *vdev; struct virtqueue *cvq; @@ -206,6 +228,9 @@ struct virtnet_info { u32 speed; unsigned long guest_offloads; + + /* upper netdev created when BACKUP feature enabled */ + struct net_device *bypass_netdev; }; struct padded_vnet_hdr { @@ -2255,6 +2280,11 @@ static const struct net_device_ops virtnet_netdev = { .ndo_features_check = passthru_features_check, }; +static bool virtnet_bypass_xmit_ready(struct net_device *dev) +{ + return netif_running(dev) && netif_carrier_ok(dev); +} + static void virtnet_config_changed_work(struct work_struct *work) { struct virtnet_info *vi @@ -2647,6 +2677,601 @@ static int virtnet_validate(struct virtio_device *vdev) return 0; } +static void +virtnet_bypass_child_open(struct net_device *dev, + struct net_device *child_netdev) +{ + int err = dev_open(child_netdev); + + if (err) + netdev_warn(dev, "unable to open slave: %s: %d\n", + child_netdev->name, err); +} + +static int virtnet_bypass_open(struct net_device *dev) +{ + struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct net_device *child_netdev; + + netif_carrier_off(dev); + netif_tx_wake_all_queues(dev); + + child_netdev = rtnl_dereference(vbi->active_netdev); + if (child_netdev) + virtnet_bypass_child_open(dev, child_netdev); + + child_netdev = rtnl_dereference(vbi->backup_netdev); + if (child_netdev) + virtnet_bypass_child_open(dev, child_netdev); + + return 0; +} + +static int virtnet_bypass_close(struct net_device *dev) +{ + struct virtnet_bypass_info *vi = netdev_priv(dev); + struct net_device *child_netdev; + + netif_tx_disable(dev); + + child_netdev = rtnl_dereference(vi->active_netdev); + if (child_netdev) + dev_close(child_netdev); + + child_netdev = rtnl_dereference(vi->backup_netdev); + if (child_netdev) + dev_close(child_netdev); + + return 0; +} + +static netdev_tx_t +virtnet_bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev) +{ + atomic_long_inc(&dev->tx_dropped); + dev_kfree_skb_any(skb); + return NETDEV_TX_OK; +} + +static netdev_tx_t +virtnet_bypass_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct net_device *xmit_dev; + + /* Try xmit via active netdev followed by backup netdev */ + xmit_dev = rcu_dereference_bh(vbi->active_netdev); + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) { + xmit_dev = rcu_dereference_bh(vbi->backup_netdev); + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) + return virtnet_bypass_drop_xmit(skb, dev); + } + + skb->dev = xmit_dev; + skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping; + + return dev_queue_xmit(skb); +} + +static u16 +virtnet_bypass_select_queue(struct net_device *dev, struct sk_buff *skb, + void *accel_priv, select_queue_fallback_t fallback) +{ + /* This helper function exists to help dev_pick_tx get the correct + * destination queue. Using a helper function skips a call to + * skb_tx_hash and will put the skbs in the queue we expect on their + * way down to the bonding driver. + */ + u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0; + + /* Save the original txq to restore before passing to the driver */ + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping; + + if (unlikely(txq >= dev->real_num_tx_queues)) { + do { + txq -= dev->real_num_tx_queues; + } while (txq >= dev->real_num_tx_queues); + } + + return txq; +} + +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but + * that some drivers can provide 32bit values only. + */ +static void +virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res, + const struct rtnl_link_stats64 *_new, + const struct rtnl_link_stats64 *_old) +{ + const u64 *new = (const u64 *)_new; + const u64 *old = (const u64 *)_old; + u64 *res = (u64 *)_res; + int i; + + for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) { + u64 nv = new[i]; + u64 ov = old[i]; + s64 delta = nv - ov; + + /* detects if this particular field is 32bit only */ + if (((nv | ov) >> 32) == 0) + delta = (s64)(s32)((u32)nv - (u32)ov); + + /* filter anomalies, some drivers reset their stats + * at down/up events. + */ + if (delta > 0) + res[i] += delta; + } +} + +static void +virtnet_bypass_get_stats(struct net_device *dev, + struct rtnl_link_stats64 *stats) +{ + struct virtnet_bypass_info *vbi = netdev_priv(dev); + const struct rtnl_link_stats64 *new; + struct rtnl_link_stats64 temp; + struct net_device *child_netdev; + + spin_lock(&vbi->stats_lock); + memcpy(stats, &vbi->bypass_stats, sizeof(*stats)); + + rcu_read_lock(); + + child_netdev = rcu_dereference(vbi->active_netdev); + if (child_netdev) { + new = dev_get_stats(child_netdev, &temp); + virtnet_bypass_fold_stats(stats, new, &vbi->active_stats); + memcpy(&vbi->active_stats, new, sizeof(*new)); + } + + child_netdev = rcu_dereference(vbi->backup_netdev); + if (child_netdev) { + new = dev_get_stats(child_netdev, &temp); + virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats); + memcpy(&vbi->backup_stats, new, sizeof(*new)); + } + + rcu_read_unlock(); + + memcpy(&vbi->bypass_stats, stats, sizeof(*stats)); + spin_unlock(&vbi->stats_lock); +} + +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu) +{ + struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct net_device *child_netdev; + int ret = 0; + + child_netdev = rcu_dereference(vbi->active_netdev); + if (child_netdev) { + ret = dev_set_mtu(child_netdev, new_mtu); + if (ret) + return ret; + } + + child_netdev = rcu_dereference(vbi->backup_netdev); + if (child_netdev) { + ret = dev_set_mtu(child_netdev, new_mtu); + if (ret) + netdev_err(child_netdev, + "Unexpected failure to set mtu to %d\n", + new_mtu); + } + + dev->mtu = new_mtu; + return 0; +} + +static const struct net_device_ops virtnet_bypass_netdev_ops = { + .ndo_open = virtnet_bypass_open, + .ndo_stop = virtnet_bypass_close, + .ndo_start_xmit = virtnet_bypass_start_xmit, + .ndo_select_queue = virtnet_bypass_select_queue, + .ndo_get_stats64 = virtnet_bypass_get_stats, + .ndo_change_mtu = virtnet_bypass_change_mtu, + .ndo_validate_addr = eth_validate_addr, + .ndo_features_check = passthru_features_check, +}; + +static int +virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev, + struct ethtool_link_ksettings *cmd) +{ + struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct net_device *child_netdev; + + child_netdev = rtnl_dereference(vbi->active_netdev); + if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) { + child_netdev = rtnl_dereference(vbi->backup_netdev); + if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) { + cmd->base.duplex = DUPLEX_UNKNOWN; + cmd->base.port = PORT_OTHER; + cmd->base.speed = SPEED_UNKNOWN; + + return 0; + } + } + + return __ethtool_get_link_ksettings(child_netdev, cmd); +} + +#define BYPASS_DRV_NAME "virtnet_bypass" +#define BYPASS_DRV_VERSION "0.1" + +static void +virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev, + struct ethtool_drvinfo *drvinfo) +{ + strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver)); + strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version)); +} + +static const struct ethtool_ops virtnet_bypass_ethtool_ops = { + .get_drvinfo = virtnet_bypass_ethtool_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_link_ksettings = virtnet_bypass_ethtool_get_link_ksettings, +}; + +static struct net_device * +get_virtnet_bypass_bymac(struct net *net, const u8 *mac) +{ + struct net_device *dev; + + ASSERT_RTNL(); + + for_each_netdev(net, dev) { + if (dev->netdev_ops != &virtnet_bypass_netdev_ops) + continue; /* not a virtnet_bypass device */ + + if (ether_addr_equal(mac, dev->perm_addr)) + return dev; + } + + return NULL; +} + +static struct net_device * +get_virtnet_bypass_byref(struct net_device *child_netdev) +{ + struct net *net = dev_net(child_netdev); + struct net_device *dev; + + ASSERT_RTNL(); + + for_each_netdev(net, dev) { + struct virtnet_bypass_info *vbi; + + if (dev->netdev_ops != &virtnet_bypass_netdev_ops) + continue; /* not a virtnet_bypass device */ + + vbi = netdev_priv(dev); + + if ((rtnl_dereference(vbi->active_netdev) == child_netdev) || + (rtnl_dereference(vbi->backup_netdev) == child_netdev)) + return dev; /* a match */ + } + + return NULL; +} + +/* Called when child dev is injecting data into network stack. + * Change the associated network device from lower dev to virtio. + * note: already called with rcu_read_lock + */ +static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb) +{ + struct sk_buff *skb = *pskb; + struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data); + + skb->dev = ndev; + + return RX_HANDLER_ANOTHER; +} + +static int virtnet_bypass_register_child(struct net_device *child_netdev) +{ + struct virtnet_bypass_info *vbi; + struct net_device *dev; + bool backup; + int ret; + + if (child_netdev->addr_len != ETH_ALEN) + return NOTIFY_DONE; + + /* We will use the MAC address to locate the virtnet_bypass netdev + * to associate with the child netdev. If we don't find a matching + * bypass netdev, move on. + */ + dev = get_virtnet_bypass_bymac(dev_net(child_netdev), + child_netdev->perm_addr); + if (!dev) + return NOTIFY_DONE; + + vbi = netdev_priv(dev); + backup = (child_netdev->dev.parent == dev->dev.parent); + if (backup ? rtnl_dereference(vbi->backup_netdev) : + rtnl_dereference(vbi->active_netdev)) { + netdev_info(dev, + "%s attempting to join bypass dev when %s already present\n", + child_netdev->name, + backup ? "backup" : "active"); + return NOTIFY_DONE; + } + + ret = netdev_rx_handler_register(child_netdev, + virtnet_bypass_handle_frame, dev); + if (ret != 0) { + netdev_err(child_netdev, + "can not register bypass receive handler (err = %d)\n", + ret); + goto rx_handler_failed; + } + + ret = netdev_upper_dev_link(child_netdev, dev, NULL); + if (ret != 0) { + netdev_err(child_netdev, + "can not set master device %s (err = %d)\n", + dev->name, ret); + goto upper_link_failed; + } + + child_netdev->flags |= IFF_SLAVE; + + if (netif_running(dev)) { + ret = dev_open(child_netdev); + if (ret && (ret != -EBUSY)) { + netdev_err(dev, "Opening child %s failed ret:%d\n", + child_netdev->name, ret); + goto err_interface_up; + } + } + + /* Align MTU of child with master */ + ret = dev_set_mtu(child_netdev, dev->mtu); + if (ret) { + netdev_err(dev, + "unable to change mtu of %s to %u register failed\n", + child_netdev->name, dev->mtu); + goto err_set_mtu; + } + + call_netdevice_notifiers(NETDEV_JOIN, child_netdev); + + netdev_info(dev, "registering %s\n", child_netdev->name); + + dev_hold(child_netdev); + if (backup) { + rcu_assign_pointer(vbi->backup_netdev, child_netdev); + dev_get_stats(vbi->backup_netdev, &vbi->backup_stats); + } else { + rcu_assign_pointer(vbi->active_netdev, child_netdev); + dev_get_stats(vbi->active_netdev, &vbi->active_stats); + dev->min_mtu = child_netdev->min_mtu; + dev->max_mtu = child_netdev->max_mtu; + } + + return NOTIFY_OK; + +err_set_mtu: + dev_close(child_netdev); +err_interface_up: + netdev_upper_dev_unlink(child_netdev, dev); + child_netdev->flags &= ~IFF_SLAVE; +upper_link_failed: + netdev_rx_handler_unregister(child_netdev); +rx_handler_failed: + return NOTIFY_DONE; +} + +static int virtnet_bypass_unregister_child(struct net_device *child_netdev) +{ + struct virtnet_bypass_info *vbi; + struct net_device *dev, *backup; + + dev = get_virtnet_bypass_byref(child_netdev); + if (!dev) + return NOTIFY_DONE; + + vbi = netdev_priv(dev); + + netdev_info(dev, "unregistering %s\n", child_netdev->name); + + netdev_rx_handler_unregister(child_netdev); + netdev_upper_dev_unlink(child_netdev, dev); + child_netdev->flags &= ~IFF_SLAVE; + + if (child_netdev->dev.parent == dev->dev.parent) { + RCU_INIT_POINTER(vbi->backup_netdev, NULL); + } else { + RCU_INIT_POINTER(vbi->active_netdev, NULL); + backup = rtnl_dereference(vbi->backup_netdev); + if (backup) { + dev->min_mtu = backup->min_mtu; + dev->max_mtu = backup->max_mtu; + } + } + + dev_put(child_netdev); + + return NOTIFY_OK; +} + +static int virtnet_bypass_update_link(struct net_device *child_netdev) +{ + struct net_device *dev, *active, *backup; + struct virtnet_bypass_info *vbi; + + dev = get_virtnet_bypass_byref(child_netdev); + if (!dev || !netif_running(dev)) + return NOTIFY_DONE; + + vbi = netdev_priv(dev); + + active = rtnl_dereference(vbi->active_netdev); + backup = rtnl_dereference(vbi->backup_netdev); + + if ((active && virtnet_bypass_xmit_ready(active)) || + (backup && virtnet_bypass_xmit_ready(backup))) { + netif_carrier_on(dev); + netif_tx_wake_all_queues(dev); + } else { + netif_carrier_off(dev); + netif_tx_stop_all_queues(dev); + } + + return NOTIFY_OK; +} + +static int +virtnet_bypass_event(struct notifier_block *this, unsigned long event, + void *ptr) +{ + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr); + + /* Skip our own events */ + if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops) + return NOTIFY_DONE; + + /* Avoid non-Ethernet type devices */ + if (event_dev->type != ARPHRD_ETHER) + return NOTIFY_DONE; + + /* Avoid Vlan dev with same MAC registering as child dev */ + if (is_vlan_dev(event_dev)) + return NOTIFY_DONE; + + /* Avoid Bonding master dev with same MAC registering as child dev */ + if ((event_dev->priv_flags & IFF_BONDING) && + (event_dev->flags & IFF_MASTER)) + return NOTIFY_DONE; + + switch (event) { + case NETDEV_REGISTER: + return virtnet_bypass_register_child(event_dev); + case NETDEV_UNREGISTER: + return virtnet_bypass_unregister_child(event_dev); + case NETDEV_UP: + case NETDEV_DOWN: + case NETDEV_CHANGE: + return virtnet_bypass_update_link(event_dev); + default: + return NOTIFY_DONE; + } +} + +static struct notifier_block virtnet_bypass_notifier = { + .notifier_call = virtnet_bypass_event, +}; + +static int virtnet_bypass_create(struct virtnet_info *vi) +{ + struct net_device *backup_netdev = vi->dev; + struct device *dev = &vi->vdev->dev; + struct net_device *bypass_netdev; + int res; + + /* Alloc at least 2 queues, for now we are going with 16 assuming + * that most devices being bonded won't have too many queues. + */ + bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info), + 16); + if (!bypass_netdev) { + dev_err(dev, "Unable to allocate bypass_netdev!\n"); + return -ENOMEM; + } + + dev_net_set(bypass_netdev, dev_net(backup_netdev)); + SET_NETDEV_DEV(bypass_netdev, dev); + + bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops; + bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops; + + /* Initialize the device options */ + bypass_netdev->flags |= IFF_MASTER; + bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT | + IFF_NO_QUEUE; + bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | + IFF_TX_SKB_SHARING); + + /* don't acquire bypass netdev's netif_tx_lock when transmitting */ + bypass_netdev->features |= NETIF_F_LLTX; + + /* Don't allow bypass devices to change network namespaces. */ + bypass_netdev->features |= NETIF_F_NETNS_LOCAL; + + bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | + NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | + NETIF_F_HIGHDMA | NETIF_F_LRO; + + bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL; + bypass_netdev->features |= bypass_netdev->hw_features; + + /* For now treat bypass netdev as VLAN challenged since we + * cannot assume VLAN functionality with a VF + */ + bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED; + + memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr, + bypass_netdev->addr_len); + + bypass_netdev->min_mtu = backup_netdev->min_mtu; + bypass_netdev->max_mtu = backup_netdev->max_mtu; + + res = register_netdev(bypass_netdev); + if (res < 0) { + dev_err(dev, "Unable to register bypass_netdev!\n"); + free_netdev(bypass_netdev); + return res; + } + + netif_carrier_off(bypass_netdev); + + vi->bypass_netdev = bypass_netdev; + + /* Change the name of the backup interface to vbkup0 + * we may need to revisit naming later but this gets it out + * of the way for now. + */ + strcpy(backup_netdev->name, "vbkup%d"); + + return 0; +} + +static void virtnet_bypass_destroy(struct virtnet_info *vi) +{ + struct net_device *bypass_netdev = vi->bypass_netdev; + struct virtnet_bypass_info *vbi; + struct net_device *child_netdev; + + /* no device found, nothing to free */ + if (!bypass_netdev) + return; + + vbi = netdev_priv(bypass_netdev); + + netif_device_detach(bypass_netdev); + + rtnl_lock(); + + child_netdev = rtnl_dereference(vbi->active_netdev); + if (child_netdev) + virtnet_bypass_unregister_child(child_netdev); + + child_netdev = rtnl_dereference(vbi->backup_netdev); + if (child_netdev) + virtnet_bypass_unregister_child(child_netdev); + + unregister_netdevice(bypass_netdev); + + rtnl_unlock(); + + free_netdev(bypass_netdev); +} + static int virtnet_probe(struct virtio_device *vdev) { int i, err = -ENOMEM; @@ -2797,10 +3422,15 @@ static int virtnet_probe(struct virtio_device *vdev) virtnet_init_settings(dev); + if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) { + if (virtnet_bypass_create(vi) != 0) + goto free_vqs; + } + err = register_netdev(dev); if (err) { pr_debug("virtio_net: registering device failed\n"); - goto free_vqs; + goto free_bypass; } virtio_device_ready(vdev); @@ -2837,6 +3467,8 @@ static int virtnet_probe(struct virtio_device *vdev) vi->vdev->config->reset(vdev); unregister_netdev(dev); +free_bypass: + virtnet_bypass_destroy(vi); free_vqs: cancel_delayed_work_sync(&vi->refill); free_receive_page_frags(vi); @@ -2871,6 +3503,8 @@ static void virtnet_remove(struct virtio_device *vdev) unregister_netdev(vi->dev); + virtnet_bypass_destroy(vi); + remove_vq_common(vi); free_netdev(vi->dev); @@ -2968,6 +3602,8 @@ static __init int virtio_net_driver_init(void) ret = register_virtio_driver(&virtio_net_driver); if (ret) goto err_virtio; + + register_netdevice_notifier(&virtnet_bypass_notifier); return 0; err_virtio: cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD); @@ -2980,6 +3616,7 @@ module_init(virtio_net_driver_init); static __exit void virtio_net_driver_exit(void) { + unregister_netdevice_notifier(&virtnet_bypass_notifier); unregister_virtio_driver(&virtio_net_driver); cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD); cpuhp_remove_multi_state(virtionet_online); -- 2.14.3
Sridhar Samudrala
2018-Feb-16 18:11 UTC
[RFC PATCH v3 3/3] virtio_net: Enable alternate datapath without creating an additional netdev
This patch addresses the issues that were seen with the 3 netdev model by avoiding the creation of an additional netdev. Instead the bypass state information is tracked in the original netdev and a different set of ndo_ops and ethtool_ops are used when BACKUP feature is enabled. Signed-off-by: Sridhar Samudrala <sridhar.samudrala at intel.com> Reviewed-by: Alexander Duyck <alexander.h.duyck at intel.com> --- drivers/net/virtio_net.c | 283 +++++++++++++++++------------------------------ 1 file changed, 101 insertions(+), 182 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 14679806c1b1..c85b2949f151 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -154,7 +154,7 @@ struct virtnet_bypass_info { struct net_device __rcu *active_netdev; /* virtio_net netdev */ - struct net_device __rcu *backup_netdev; + struct net_device *backup_netdev; /* active netdev stats */ struct rtnl_link_stats64 active_stats; @@ -229,8 +229,8 @@ struct virtnet_info { unsigned long guest_offloads; - /* upper netdev created when BACKUP feature enabled */ - struct net_device *bypass_netdev; + /* bypass state maintained when BACKUP feature is enabled */ + struct virtnet_bypass_info *vbi; }; struct padded_vnet_hdr { @@ -2285,6 +2285,22 @@ static bool virtnet_bypass_xmit_ready(struct net_device *dev) return netif_running(dev) && netif_carrier_ok(dev); } +static bool virtnet_bypass_active_ready(struct net_device *dev) +{ + struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_bypass_info *vbi = vi->vbi; + struct net_device *active; + + if (!vbi) + return false; + + active = rcu_dereference(vbi->active_netdev); + if (!active || !virtnet_bypass_xmit_ready(active)) + return false; + + return true; +} + static void virtnet_config_changed_work(struct work_struct *work) { struct virtnet_info *vi @@ -2312,7 +2328,7 @@ static void virtnet_config_changed_work(struct work_struct *work) virtnet_update_settings(vi); netif_carrier_on(vi->dev); netif_tx_wake_all_queues(vi->dev); - } else { + } else if (!virtnet_bypass_active_ready(vi->dev)) { netif_carrier_off(vi->dev); netif_tx_stop_all_queues(vi->dev); } @@ -2501,7 +2517,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi) if (vi->has_cvq) { vi->cvq = vqs[total_vqs - 1]; - if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN)) + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN) && + !virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) vi->dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER; } @@ -2690,62 +2707,54 @@ virtnet_bypass_child_open(struct net_device *dev, static int virtnet_bypass_open(struct net_device *dev) { - struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_bypass_info *vbi = vi->vbi; struct net_device *child_netdev; - - netif_carrier_off(dev); - netif_tx_wake_all_queues(dev); + int err; child_netdev = rtnl_dereference(vbi->active_netdev); if (child_netdev) virtnet_bypass_child_open(dev, child_netdev); - child_netdev = rtnl_dereference(vbi->backup_netdev); - if (child_netdev) - virtnet_bypass_child_open(dev, child_netdev); + err = virtnet_open(dev); + if (err < 0) { + dev_close(child_netdev); + return err; + } return 0; } static int virtnet_bypass_close(struct net_device *dev) { - struct virtnet_bypass_info *vi = netdev_priv(dev); + struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_bypass_info *vbi = vi->vbi; struct net_device *child_netdev; - netif_tx_disable(dev); + virtnet_close(dev); - child_netdev = rtnl_dereference(vi->active_netdev); - if (child_netdev) - dev_close(child_netdev); + if (!vbi) + goto done; - child_netdev = rtnl_dereference(vi->backup_netdev); + child_netdev = rtnl_dereference(vbi->active_netdev); if (child_netdev) dev_close(child_netdev); +done: return 0; } -static netdev_tx_t -virtnet_bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev) -{ - atomic_long_inc(&dev->tx_dropped); - dev_kfree_skb_any(skb); - return NETDEV_TX_OK; -} - static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb, struct net_device *dev) { - struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_bypass_info *vbi = vi->vbi; struct net_device *xmit_dev; /* Try xmit via active netdev followed by backup netdev */ xmit_dev = rcu_dereference_bh(vbi->active_netdev); - if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) { - xmit_dev = rcu_dereference_bh(vbi->backup_netdev); - if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) - return virtnet_bypass_drop_xmit(skb, dev); - } + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) + return start_xmit(skb, dev); skb->dev = xmit_dev; skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping; @@ -2810,7 +2819,8 @@ static void virtnet_bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats) { - struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_bypass_info *vbi = vi->vbi; const struct rtnl_link_stats64 *new; struct rtnl_link_stats64 temp; struct net_device *child_netdev; @@ -2827,12 +2837,10 @@ virtnet_bypass_get_stats(struct net_device *dev, memcpy(&vbi->active_stats, new, sizeof(*new)); } - child_netdev = rcu_dereference(vbi->backup_netdev); - if (child_netdev) { - new = dev_get_stats(child_netdev, &temp); - virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats); - memcpy(&vbi->backup_stats, new, sizeof(*new)); - } + memset(&temp, 0, sizeof(temp)); + virtnet_stats(vbi->backup_netdev, &temp); + virtnet_bypass_fold_stats(stats, &temp, &vbi->backup_stats); + memcpy(&vbi->backup_stats, &temp, sizeof(temp)); rcu_read_unlock(); @@ -2842,7 +2850,8 @@ virtnet_bypass_get_stats(struct net_device *dev, static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu) { - struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_bypass_info *vbi = vi->vbi; struct net_device *child_netdev; int ret = 0; @@ -2853,15 +2862,6 @@ static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu) return ret; } - child_netdev = rcu_dereference(vbi->backup_netdev); - if (child_netdev) { - ret = dev_set_mtu(child_netdev, new_mtu); - if (ret) - netdev_err(child_netdev, - "Unexpected failure to set mtu to %d\n", - new_mtu); - } - dev->mtu = new_mtu; return 0; } @@ -2881,20 +2881,13 @@ static int virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev, struct ethtool_link_ksettings *cmd) { - struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_bypass_info *vbi = vi->vbi; struct net_device *child_netdev; child_netdev = rtnl_dereference(vbi->active_netdev); - if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) { - child_netdev = rtnl_dereference(vbi->backup_netdev); - if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) { - cmd->base.duplex = DUPLEX_UNKNOWN; - cmd->base.port = PORT_OTHER; - cmd->base.speed = SPEED_UNKNOWN; - - return 0; - } - } + if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) + return virtnet_get_link_ksettings(dev, cmd); return __ethtool_get_link_ksettings(child_netdev, cmd); } @@ -2944,14 +2937,15 @@ get_virtnet_bypass_byref(struct net_device *child_netdev) for_each_netdev(net, dev) { struct virtnet_bypass_info *vbi; + struct virtnet_info *vi; if (dev->netdev_ops != &virtnet_bypass_netdev_ops) continue; /* not a virtnet_bypass device */ - vbi = netdev_priv(dev); + vi = netdev_priv(dev); + vbi = vi->vbi; - if ((rtnl_dereference(vbi->active_netdev) == child_netdev) || - (rtnl_dereference(vbi->backup_netdev) == child_netdev)) + if (rtnl_dereference(vbi->active_netdev) == child_netdev) return dev; /* a match */ } @@ -2974,9 +2968,9 @@ static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb) static int virtnet_bypass_register_child(struct net_device *child_netdev) { + struct net_device *dev, *active; struct virtnet_bypass_info *vbi; - struct net_device *dev; - bool backup; + struct virtnet_info *vi; int ret; if (child_netdev->addr_len != ETH_ALEN) @@ -2991,14 +2985,14 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev) if (!dev) return NOTIFY_DONE; - vbi = netdev_priv(dev); - backup = (child_netdev->dev.parent == dev->dev.parent); - if (backup ? rtnl_dereference(vbi->backup_netdev) : - rtnl_dereference(vbi->active_netdev)) { + vi = netdev_priv(dev); + vbi = vi->vbi; + + active = rtnl_dereference(vbi->active_netdev); + if (active) { netdev_info(dev, "%s attempting to join bypass dev when %s already present\n", - child_netdev->name, - backup ? "backup" : "active"); + child_netdev->name, active->name); return NOTIFY_DONE; } @@ -3030,7 +3024,7 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev) } } - /* Align MTU of child with master */ + /* Align MTU of child with virtio */ ret = dev_set_mtu(child_netdev, dev->mtu); if (ret) { netdev_err(dev, @@ -3044,15 +3038,10 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev) netdev_info(dev, "registering %s\n", child_netdev->name); dev_hold(child_netdev); - if (backup) { - rcu_assign_pointer(vbi->backup_netdev, child_netdev); - dev_get_stats(vbi->backup_netdev, &vbi->backup_stats); - } else { - rcu_assign_pointer(vbi->active_netdev, child_netdev); - dev_get_stats(vbi->active_netdev, &vbi->active_stats); - dev->min_mtu = child_netdev->min_mtu; - dev->max_mtu = child_netdev->max_mtu; - } + rcu_assign_pointer(vbi->active_netdev, child_netdev); + dev_get_stats(vbi->active_netdev, &vbi->active_stats); + dev->min_mtu = child_netdev->min_mtu; + dev->max_mtu = child_netdev->max_mtu; return NOTIFY_OK; @@ -3070,13 +3059,15 @@ static int virtnet_bypass_register_child(struct net_device *child_netdev) static int virtnet_bypass_unregister_child(struct net_device *child_netdev) { struct virtnet_bypass_info *vbi; - struct net_device *dev, *backup; + struct virtnet_info *vi; + struct net_device *dev; dev = get_virtnet_bypass_byref(child_netdev); if (!dev) return NOTIFY_DONE; - vbi = netdev_priv(dev); + vi = netdev_priv(dev); + vbi = vi->vbi; netdev_info(dev, "unregistering %s\n", child_netdev->name); @@ -3084,41 +3075,35 @@ static int virtnet_bypass_unregister_child(struct net_device *child_netdev) netdev_upper_dev_unlink(child_netdev, dev); child_netdev->flags &= ~IFF_SLAVE; - if (child_netdev->dev.parent == dev->dev.parent) { - RCU_INIT_POINTER(vbi->backup_netdev, NULL); - } else { - RCU_INIT_POINTER(vbi->active_netdev, NULL); - backup = rtnl_dereference(vbi->backup_netdev); - if (backup) { - dev->min_mtu = backup->min_mtu; - dev->max_mtu = backup->max_mtu; - } - } + RCU_INIT_POINTER(vbi->active_netdev, NULL); + dev->min_mtu = MIN_MTU; + dev->max_mtu = MAX_MTU; dev_put(child_netdev); + if (!(vi->status & VIRTIO_NET_S_LINK_UP)) { + netif_carrier_off(dev); + netif_tx_stop_all_queues(dev); + } + return NOTIFY_OK; } static int virtnet_bypass_update_link(struct net_device *child_netdev) { - struct net_device *dev, *active, *backup; - struct virtnet_bypass_info *vbi; + struct virtnet_info *vi; + struct net_device *dev; dev = get_virtnet_bypass_byref(child_netdev); - if (!dev || !netif_running(dev)) + if (!dev) return NOTIFY_DONE; - vbi = netdev_priv(dev); - - active = rtnl_dereference(vbi->active_netdev); - backup = rtnl_dereference(vbi->backup_netdev); + vi = netdev_priv(dev); - if ((active && virtnet_bypass_xmit_ready(active)) || - (backup && virtnet_bypass_xmit_ready(backup))) { + if (virtnet_bypass_xmit_ready(child_netdev)) { netif_carrier_on(dev); netif_tx_wake_all_queues(dev); - } else { + } else if (!(vi->status & VIRTIO_NET_S_LINK_UP)) { netif_carrier_off(dev); netif_tx_stop_all_queues(dev); } @@ -3169,107 +3154,41 @@ static struct notifier_block virtnet_bypass_notifier = { static int virtnet_bypass_create(struct virtnet_info *vi) { - struct net_device *backup_netdev = vi->dev; - struct device *dev = &vi->vdev->dev; - struct net_device *bypass_netdev; - int res; + struct net_device *dev = vi->dev; + struct virtnet_bypass_info *vbi; - /* Alloc at least 2 queues, for now we are going with 16 assuming - * that most devices being bonded won't have too many queues. - */ - bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info), - 16); - if (!bypass_netdev) { - dev_err(dev, "Unable to allocate bypass_netdev!\n"); + vbi = kzalloc(sizeof(*vbi), GFP_KERNEL); + if (!vbi) return -ENOMEM; - } - - dev_net_set(bypass_netdev, dev_net(backup_netdev)); - SET_NETDEV_DEV(bypass_netdev, dev); - - bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops; - bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops; - - /* Initialize the device options */ - bypass_netdev->flags |= IFF_MASTER; - bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT | - IFF_NO_QUEUE; - bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | - IFF_TX_SKB_SHARING); - - /* don't acquire bypass netdev's netif_tx_lock when transmitting */ - bypass_netdev->features |= NETIF_F_LLTX; - - /* Don't allow bypass devices to change network namespaces. */ - bypass_netdev->features |= NETIF_F_NETNS_LOCAL; - - bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | - NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | - NETIF_F_HIGHDMA | NETIF_F_LRO; - - bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL; - bypass_netdev->features |= bypass_netdev->hw_features; - - /* For now treat bypass netdev as VLAN challenged since we - * cannot assume VLAN functionality with a VF - */ - bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED; - - memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr, - bypass_netdev->addr_len); - bypass_netdev->min_mtu = backup_netdev->min_mtu; - bypass_netdev->max_mtu = backup_netdev->max_mtu; + dev->netdev_ops = &virtnet_bypass_netdev_ops; + dev->ethtool_ops = &virtnet_bypass_ethtool_ops; - res = register_netdev(bypass_netdev); - if (res < 0) { - dev_err(dev, "Unable to register bypass_netdev!\n"); - free_netdev(bypass_netdev); - return res; - } - - netif_carrier_off(bypass_netdev); - - vi->bypass_netdev = bypass_netdev; - - /* Change the name of the backup interface to vbkup0 - * we may need to revisit naming later but this gets it out - * of the way for now. - */ - strcpy(backup_netdev->name, "vbkup%d"); + vbi->backup_netdev = dev; + virtnet_stats(vbi->backup_netdev, &vbi->backup_stats); + vi->vbi = vbi; return 0; } static void virtnet_bypass_destroy(struct virtnet_info *vi) { - struct net_device *bypass_netdev = vi->bypass_netdev; - struct virtnet_bypass_info *vbi; + struct virtnet_bypass_info *vbi = vi->vbi; struct net_device *child_netdev; - /* no device found, nothing to free */ - if (!bypass_netdev) + if (!vbi) return; - vbi = netdev_priv(bypass_netdev); - - netif_device_detach(bypass_netdev); - rtnl_lock(); child_netdev = rtnl_dereference(vbi->active_netdev); if (child_netdev) virtnet_bypass_unregister_child(child_netdev); - child_netdev = rtnl_dereference(vbi->backup_netdev); - if (child_netdev) - virtnet_bypass_unregister_child(child_netdev); - - unregister_netdevice(bypass_netdev); - rtnl_unlock(); - free_netdev(bypass_netdev); + kfree(vbi); + vi->vbi = NULL; } static int virtnet_probe(struct virtio_device *vdev) -- 2.14.3
Jakub Kicinski
2018-Feb-17 02:38 UTC
[RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:> Ppatch 2 is in response to the community request for a 3 netdev > solution. However, it creates some issues we'll get into in a moment. > It extends virtio_net to use alternate datapath when available and > registered. When BACKUP feature is enabled, virtio_net driver creates > an additional 'bypass' netdev that acts as a master device and controls > 2 slave devices. The original virtio_net netdev is registered as > 'backup' netdev and a passthru/vf device with the same MAC gets > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are > associated with the same 'pci' device. The user accesses the network > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev > as default for transmits when it is available with link up and running.Thank you do doing this.> We noticed a couple of issues with this approach during testing. > - As both 'bypass' and 'backup' netdevs are associated with the same > virtio pci device, udev tries to rename both of them with the same name > and the 2nd rename will fail. This would be OK as long as the first netdev > to be renamed is the 'bypass' netdev, but the order in which udev gets > to rename the 2 netdevs is not reliable.Out of curiosity - why do you link the master netdev to the virtio struct device? FWIW two solutions that immediately come to mind is to export "backup" as phys_port_name of the backup virtio link and/or assign a name to the master like you are doing already. I think team uses team%d and bond uses bond%d, soft naming of master devices seems quite natural in this case. IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio link is quite neat.> - When the 'active' netdev is unplugged OR not present on a destination > system after live migration, the user will see 2 virtio_net netdevs.That's necessary and expected, all configuration applies to the master so master must exist.
Jakub Kicinski
2018-Feb-17 03:04 UTC
[RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available
On Fri, 16 Feb 2018 10:11:21 -0800, Sridhar Samudrala wrote:> This patch enables virtio_net to switch over to a VF datapath when a VF > netdev is present with the same MAC address. It allows live migration > of a VM with a direct attached VF without the need to setup a bond/team > between a VF and virtio net device in the guest. > > The hypervisor needs to enable only one datapath at any time so that > packets don't get looped back to the VM over the other datapath. When a VF > is plugged, the virtio datapath link state can be marked as down. The > hypervisor needs to unplug the VF device from the guest on the source host > and reset the MAC filter of the VF to initiate failover of datapath to > virtio before starting the migration. After the migration is completed, > the destination hypervisor sets the MAC filter on the VF and plugs it back > to the guest to switch over to VF datapath. > > When BACKUP feature is enabled, an additional netdev(bypass netdev) is > created that acts as a master device and tracks the state of the 2 lower > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a > passthru device with the same MAC is registered as 'active' netdev. > > This patch is based on the discussion initiated by Jesse on this thread. > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala at intel.com> > Signed-off-by: Alexander Duyck <alexander.h.duyck at intel.com>> +static void > +virtnet_bypass_get_stats(struct net_device *dev, > + struct rtnl_link_stats64 *stats) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + const struct rtnl_link_stats64 *new; > + struct rtnl_link_stats64 temp; > + struct net_device *child_netdev; > + > + spin_lock(&vbi->stats_lock); > + memcpy(stats, &vbi->bypass_stats, sizeof(*stats)); > + > + rcu_read_lock(); > + > + child_netdev = rcu_dereference(vbi->active_netdev); > + if (child_netdev) { > + new = dev_get_stats(child_netdev, &temp); > + virtnet_bypass_fold_stats(stats, new, &vbi->active_stats); > + memcpy(&vbi->active_stats, new, sizeof(*new)); > + } > + > + child_netdev = rcu_dereference(vbi->backup_netdev); > + if (child_netdev) { > + new = dev_get_stats(child_netdev, &temp); > + virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats); > + memcpy(&vbi->backup_stats, new, sizeof(*new)); > + } > + > + rcu_read_unlock(); > + > + memcpy(&vbi->bypass_stats, stats, sizeof(*stats)); > + spin_unlock(&vbi->stats_lock); > +} > + > +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + struct net_device *child_netdev; > + int ret = 0; > + > + child_netdev = rcu_dereference(vbi->active_netdev); > + if (child_netdev) { > + ret = dev_set_mtu(child_netdev, new_mtu); > + if (ret) > + return ret; > + } > + > + child_netdev = rcu_dereference(vbi->backup_netdev); > + if (child_netdev) { > + ret = dev_set_mtu(child_netdev, new_mtu); > + if (ret) > + netdev_err(child_netdev, > + "Unexpected failure to set mtu to %d\n", > + new_mtu);You should probably unwind if set fails on one of the legs.> + } > + > + dev->mtu = new_mtu; > + return 0; > +}nit: stats, mtu, all those mundane things are implemented in team already. If we had this as kernel-internal team mode we wouldn't have to reimplement them... You probably did investigate that option, for my edification, would you mind saying what the challenges/downsides were?> +static struct net_device * > +get_virtnet_bypass_bymac(struct net *net, const u8 *mac) > +{ > + struct net_device *dev; > + > + ASSERT_RTNL(); > + > + for_each_netdev(net, dev) { > + if (dev->netdev_ops != &virtnet_bypass_netdev_ops) > + continue; /* not a virtnet_bypass device */Is there anything inherently wrong with enslaving another virtio dev now? I was expecting something like a hash map to map MAC addr -> master and then one can check if dev is already enslaved to that master. Just a random thought, I'm probably missing something...> + if (ether_addr_equal(mac, dev->perm_addr)) > + return dev; > + } > + > + return NULL; > +} > + > +static struct net_device * > +get_virtnet_bypass_byref(struct net_device *child_netdev) > +{ > + struct net *net = dev_net(child_netdev); > + struct net_device *dev; > + > + ASSERT_RTNL(); > + > + for_each_netdev(net, dev) { > + struct virtnet_bypass_info *vbi; > + > + if (dev->netdev_ops != &virtnet_bypass_netdev_ops) > + continue; /* not a virtnet_bypass device */ > + > + vbi = netdev_priv(dev); > + > + if ((rtnl_dereference(vbi->active_netdev) == child_netdev) || > + (rtnl_dereference(vbi->backup_netdev) == child_netdev))nit: parens not needed> + return dev; /* a match */ > + } > + > + return NULL; > +}> +static int virtnet_bypass_create(struct virtnet_info *vi) > +{ > + struct net_device *backup_netdev = vi->dev; > + struct device *dev = &vi->vdev->dev; > + struct net_device *bypass_netdev; > + int res; > + > + /* Alloc at least 2 queues, for now we are going with 16 assuming > + * that most devices being bonded won't have too many queues. > + */ > + bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info), > + 16); > + if (!bypass_netdev) { > + dev_err(dev, "Unable to allocate bypass_netdev!\n"); > + return -ENOMEM; > + }Maybe it's just me but referring to master as bypass seems slightly confusing. I know you don't like team and bond, but perhaps we can come up with a better name? For me bypass device is the other leg, i.e. the VF, not the master. Perhaps others disagree.> + dev_net_set(bypass_netdev, dev_net(backup_netdev)); > + SET_NETDEV_DEV(bypass_netdev, dev); > + > + bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops; > + bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;Thanks!
Alexander Duyck
2018-Feb-17 17:12 UTC
[RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici at wp.pl> wrote:> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: >> Ppatch 2 is in response to the community request for a 3 netdev >> solution. However, it creates some issues we'll get into in a moment. >> It extends virtio_net to use alternate datapath when available and >> registered. When BACKUP feature is enabled, virtio_net driver creates >> an additional 'bypass' netdev that acts as a master device and controls >> 2 slave devices. The original virtio_net netdev is registered as >> 'backup' netdev and a passthru/vf device with the same MAC gets >> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >> associated with the same 'pci' device. The user accesses the network >> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >> as default for transmits when it is available with link up and running. > > Thank you do doing this. > >> We noticed a couple of issues with this approach during testing. >> - As both 'bypass' and 'backup' netdevs are associated with the same >> virtio pci device, udev tries to rename both of them with the same name >> and the 2nd rename will fail. This would be OK as long as the first netdev >> to be renamed is the 'bypass' netdev, but the order in which udev gets >> to rename the 2 netdevs is not reliable. > > Out of curiosity - why do you link the master netdev to the virtio > struct device?The basic idea of all this is that we wanted this to work with an existing VM image that was using virtio. As such we were trying to make it so that the bypass interface takes the place of the original virtio and get udev to rename the bypass to what the original virtio_net was.> FWIW two solutions that immediately come to mind is to export "backup" > as phys_port_name of the backup virtio link and/or assign a name to the > master like you are doing already. I think team uses team%d and bond > uses bond%d, soft naming of master devices seems quite natural in this > case.I figured I had overlooked something like that.. Thanks for pointing this out. Okay so I think the phys_port_name approach might resolve the original issue. If I am reading things correctly what we end up with is the master showing up as "ens1" for example and the backup showing up as "ens1nbackup". Am I understanding that right? The problem with the team/bond%d approach is that it creates a new netdevice and so it would require guest configuration changes.> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio > link is quite neat.I agree. For non-"backup" virio_net devices would it be okay for us to just return -EOPNOTSUPP? I assume it would be and that way the legacy behavior could be maintained although the function still exists.>> - When the 'active' netdev is unplugged OR not present on a destination >> system after live migration, the user will see 2 virtio_net netdevs. > > That's necessary and expected, all configuration applies to the master > so master must exist.With the naming issue resolved this is the only item left outstanding. This becomes a matter of form vs function. The main complaint about the "3 netdev" solution is a bit confusing to have the 2 netdevs present if the VF isn't there. The idea is that having the extra "master" netdev there if there isn't really a bond is a bit ugly. The downside of the "2 netdev" solution is that you have to deal with an extra layer of locking/queueing to get to the VF and you lose some functionality since things like in-driver XDP have to be disabled in order to maintain the same functionality when the VF is present or not. However it looks more like classic virtio_net when the VF is not present.
Jiri Pirko
2018-Feb-20 10:42 UTC
[RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala at intel.com wrote:>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be >used by hypervisor to indicate that virtio_net interface should act as >a backup for another device with the same MAC address. > >Ppatch 2 is in response to the community request for a 3 netdev >solution. However, it creates some issues we'll get into in a moment. >It extends virtio_net to use alternate datapath when available and >registered. When BACKUP feature is enabled, virtio_net driver creates >an additional 'bypass' netdev that acts as a master device and controls >2 slave devices. The original virtio_net netdev is registered as >'backup' netdev and a passthru/vf device with the same MAC gets >registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >associated with the same 'pci' device. The user accesses the network >interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >as default for transmits when it is available with link up and running.Sorry, but this is ridiculous. You are apparently re-implemeting part of bonding driver as a part of NIC driver. Bond and team drivers are mature solutions, well tested, broadly used, with lots of issues resolved in the past. What you try to introduce is a weird shortcut that already has couple of issues as you mentioned and will certanly have many more. Also, I'm pretty sure that in future, someone comes up with ideas like multiple VFs, LACP and similar bonding things. What is the reason for this abomination? According to: https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 The reason is quite weak. User in the vm sees 2 (or more) netdevices, he puts them in bond/team and that's it. This works now! If the vm lacks some userspace features, let's fix it there! For example the MAC changes is something that could be easily handled in teamd userspace deamon.
Alexander Duyck
2018-Feb-20 16:04 UTC
[RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <jiri at resnulli.us> wrote:> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudrala at intel.com wrote: >>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be >>used by hypervisor to indicate that virtio_net interface should act as >>a backup for another device with the same MAC address. >> >>Ppatch 2 is in response to the community request for a 3 netdev >>solution. However, it creates some issues we'll get into in a moment. >>It extends virtio_net to use alternate datapath when available and >>registered. When BACKUP feature is enabled, virtio_net driver creates >>an additional 'bypass' netdev that acts as a master device and controls >>2 slave devices. The original virtio_net netdev is registered as >>'backup' netdev and a passthru/vf device with the same MAC gets >>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >>associated with the same 'pci' device. The user accesses the network >>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >>as default for transmits when it is available with link up and running. > > Sorry, but this is ridiculous. You are apparently re-implemeting part > of bonding driver as a part of NIC driver. Bond and team drivers > are mature solutions, well tested, broadly used, with lots of issues > resolved in the past. What you try to introduce is a weird shortcut > that already has couple of issues as you mentioned and will certanly > have many more. Also, I'm pretty sure that in future, someone comes up > with ideas like multiple VFs, LACP and similar bonding things.The problem with the bond and team drivers is they are too large and have too many interfaces available for configuration so as a result they can really screw this interface up. Essentially this is meant to be a bond that is more-or-less managed by the host, not the guest. We want the host to be able to configure it and have it automatically kick in on the guest. For now we want to avoid adding too much complexity as this is meant to be just the first step. Trying to go in and implement the whole solution right from the start based on existing drivers is going to be a massive time sink and will likely never get completed due to the fact that there is always going to be some other thing that will interfere. My personal hope is that we can look at doing a virtio-bond sort of device that will handle all this as well as providing a communication channel, but that is much further down the road. For now we only have a single bit so the goal for now is trying to keep this as simple as possible.> What is the reason for this abomination? According to: > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 > The reason is quite weak. > User in the vm sees 2 (or more) netdevices, he puts them in bond/team > and that's it. This works now! If the vm lacks some userspace features, > let's fix it there! For example the MAC changes is something that could > be easily handled in teamd userspace deamon.I think you might have missed the point of this. This is meant to be a simple interface so the guest should not be able to change the MAC address, and it shouldn't require any userspace daemon to setup or tear down. Ideally with this solution the virtio bypass will come up and be assigned the name of the original virtio, and the "backup" interface will come up and be assigned the name of the original virtio with an additional "nbackup" tacked on via the phys_port_name, and then whenever a VF is added it will automatically be enslaved by the bypass interface, and it will be removed when the VF is hotplugged out. In my mind the difference between this and bond or team is where the configuration interface lies. In the case of bond it is in the kernel. If my understanding is correct team is mostly in user space. With this the configuration interface is really down in the hypervisor and requests are communicated up to the guest. I would prefer not to make virtio_net dependent on the bonding or team drivers, or worse yet a userspace daemon in the guest. For now I would argue we should keep this as simple as possible just to support basic live migration. There has already been discussions of refactoring this after it is in so that we can start to combine the functionality here with what is there in bonding/team, but the differences in configuration interface and the size of the code bases will make it challenging to outright merge this into something like that.
Possibly Parallel Threads
- [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
- [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
- [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
- [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
- [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device