thr3ads.net - Linux Virtualization - [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature [Oct 2023]

If this information is useful, please help other people find it:
Share via:
Jason Wang
2023-Oct-10 06:00 UTC
[RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On Tue, Oct 10, 2023 at 1:51?PM Akihiko Odaki <akihiko.odaki at
daynix.com> wrote:>
> On 2023/10/10 14:45, Jason Wang wrote:
> > On Tue, Oct 10, 2023 at 9:52?AM Akihiko Odaki <akihiko.odaki at
daynix.com> wrote:
> >>
> >> On 2023/10/09 19:44, Willem de Bruijn wrote:
> >>> On Mon, Oct 9, 2023 at 3:12?AM Akihiko Odaki <akihiko.odaki
at daynix.com> wrote:
> >>>>
> >>>> On 2023/10/09 19:06, Willem de Bruijn wrote:
> >>>>> On Mon, Oct 9, 2023 at 3:02?AM Akihiko Odaki
<akihiko.odaki at daynix.com> wrote:
> >>>>>>
> >>>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
> >>>>>>> On Mon, Oct 9, 2023 at 3:57?AM Akihiko Odaki
<akihiko.odaki at daynix.com> wrote:
> >>>>>>>>
> >>>>>>>> On 2023/10/09 17:04, Willem de Bruijn
wrote:
> >>>>>>>>> On Sun, Oct 8, 2023 at 3:46?PM Akihiko
Odaki <akihiko.odaki at daynix.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2023/10/09 5:08, Willem de
Bruijn wrote:
> >>>>>>>>>>> On Sun, Oct 8, 2023 at
10:04?PM Akihiko Odaki <akihiko.odaki at daynix.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2023/10/09 4:07, Willem
de Bruijn wrote:
> >>>>>>>>>>>>> On Sun, Oct 8, 2023 at
7:22?AM Akihiko Odaki <akihiko.odaki at daynix.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> virtio-net have
two usage of hashes: one is RSS and another is hash
> >>>>>>>>>>>>>> reporting.
Conventionally the hash calculation was done by the VMM.
> >>>>>>>>>>>>>> However, computing
the hash after the queue was chosen defeats the
> >>>>>>>>>>>>>> purpose of RSS.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Another approach
is to use eBPF steering program. This approach has
> >>>>>>>>>>>>>> another downside:
it cannot report the calculated hash due to the
> >>>>>>>>>>>>>> restrictive nature
of eBPF.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Introduce the code
to compute hashes to the kernel in order to overcome
> >>>>>>>>>>>>>> thse challenges.
An alternative solution is to extend the eBPF steering
> >>>>>>>>>>>>>> program so that it
will be able to report to the userspace, but it makes
> >>>>>>>>>>>>>> little sense to
allow to implement different hashing algorithms with
> >>>>>>>>>>>>>> eBPF since the
hash value reported by virtio-net is strictly defined by
> >>>>>>>>>>>>>> the specification.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The hash value
already stored in sk_buff is not used and computed
> >>>>>>>>>>>>>> independently
since it may have been computed in a way not conformant
> >>>>>>>>>>>>>> with the
specification.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Signed-off-by:
Akihiko Odaki <akihiko.odaki at daynix.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> +static const
struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>>>>>>>>>> +      
.max_indirection_table_length >
>>>>>>>>>>>>>> +              
TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +       .types =
VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>>>>>>>>>> +};
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> No need to have
explicit capabilities exchange like this? Tun either
> >>>>>>>>>>>>> supports all or none.
> >>>>>>>>>>>>
> >>>>>>>>>>>> tun does not support
VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>>>>>>>>>>
VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It is because the flow
dissector does not support IPv6 extensions. The
> >>>>>>>>>>>> specification is also
vague, and does not tell how many TLVs should be
> >>>>>>>>>>>> consumed at most when
interpreting destination option header so I chose
> >>>>>>>>>>>> to avoid adding code for
these hash types to the flow dissector. I doubt
> >>>>>>>>>>>> anyone will complain about
it since nobody complains for Linux.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm also adding this
so that we can extend it later.
> >>>>>>>>>>>>
max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>>>>>>>>>> types may have other bits
for new protocols in the future.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>               
case TUNSETSTEERINGEBPF:
> >>>>>>>>>>>>>> -              
ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>>>> +              
bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>>>> +               if
(IS_ERR(bpf_ret))
> >>>>>>>>>>>>>> +                 
ret = PTR_ERR(bpf_ret);
> >>>>>>>>>>>>>> +              
else if (bpf_ret)
> >>>>>>>>>>>>>> +                 
tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Don't make one
feature disable another.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> TUNSETSTEERINGEBPF and
TUNSETVNETHASH are mutually exclusive
> >>>>>>>>>>>>> functions. If one is
enabled the other call should fail, with EBUSY
> >>>>>>>>>>>>> for instance.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> +       case
TUNSETVNETHASH:
> >>>>>>>>>>>>>> +              
len = sizeof(vnet_hash);
> >>>>>>>>>>>>>> +               if
(copy_from_user(&vnet_hash, argp, len)) {
> >>>>>>>>>>>>>> +                 
ret = -EFAULT;
> >>>>>>>>>>>>>> +                 
break;
> >>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +               if
(((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>>>>>>>>>> +                 
(tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>>>>>>>>>> +                 
!tun_is_little_endian(tun))) ||
> >>>>>>>>>>>>>> +                 
vnet_hash.indirection_table_mask >>
>>>>>>>>>>>>>> +                   
TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>>>>>>>>>> +                 
ret = -EINVAL;
> >>>>>>>>>>>>>> +                 
break;
> >>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +              
argp = (u8 __user *)argp + len;
> >>>>>>>>>>>>>> +              
len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>>>>>>>>>> +               if
(copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>>>>>>>>>> +                 
ret = -EFAULT;
> >>>>>>>>>>>>>> +                 
break;
> >>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +              
argp = (u8 __user *)argp + len;
> >>>>>>>>>>>>>> +              
len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +               if
(copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>>>>>>>>>> +                 
ret = -EFAULT;
> >>>>>>>>>>>>>> +                 
break;
> >>>>>>>>>>>>>> +               }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Probably easier and
less error-prone to define a fixed size control
> >>>>>>>>>>>>> struct with the max
indirection table size.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I made its size variable
because the indirection table and key may grow
> >>>>>>>>>>>> in the future as I wrote
above.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Btw: please trim the
CC: list considerably on future patches.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'll do so in the next
version with the TUNSETSTEERINGEBPF change you
> >>>>>>>>>>>> proposed.
> >>>>>>>>>>>
> >>>>>>>>>>> To be clear: please don't
just resubmit with that one change.
> >>>>>>>>>>>
> >>>>>>>>>>> The skb and cb issues are
quite fundamental issues that need to be resolved.
> >>>>>>>>>>>
> >>>>>>>>>>> I'd like to understand why
adjusting the existing BPF feature for this
> >>>>>>>>>>> exact purpose cannot be
amended to return the key it produced.
> >>>>>>>>>>
> >>>>>>>>>> eBPF steering program is not
designed for this particular problem in my
> >>>>>>>>>> understanding. It was introduced
to derive hash values with an
> >>>>>>>>>> understanding of
application-specific semantics of packets instead of
> >>>>>>>>>> generic IP/TCP/UDP semantics.
> >>>>>>>>>>
> >>>>>>>>>> This problem is rather different
in terms that the hash derivation is
> >>>>>>>>>> strictly defined by virtio-net. I
don't think it makes sense to
> >>>>>>>>>> introduce the complexity of BPF
when you always run the same code.
> >>>>>>>>>>
> >>>>>>>>>> It can utilize the existing flow
dissector and also make it easier to
> >>>>>>>>>> use for the userspace by
implementing this in the kernel.
> >>>>>>>>>
> >>>>>>>>> Ok. There does appear to be overlap in
functionality. But it might be
> >>>>>>>>> easier to deploy to just have standard
Toeplitz available without
> >>>>>>>>> having to compile and load an eBPF
program.
> >>>>>>>>>
> >>>>>>>>> As for the sk_buff and cb[] changes.
The first is really not needed.
> >>>>>>>>> sk_buff simply would not scale if
every edge case needs a few bits.
> >>>>>>>>
> >>>>>>>> An alternative is to move the bit to cb[]
and clear it for every code
> >>>>>>>> paths that lead to ndo_start_xmit(), but
I'm worried that it is error-prone.
> >>>>>>>>
> >>>>>>>> I think we can put the bit in sk_buff for
now. We can implement the
> >>>>>>>> alternative when we are short of bits.
> >>>>>>>
> >>>>>>> I disagree. sk_buff fields add a cost to every
code path. They cannot
> >>>>>>> be added for every edge case.
> >>>>>>
> >>>>>> It only takes an unused bit and does not grow the
sk_buff size so I
> >>>>>> think it has practically no cost for now.
> >>>>>
> >>>>> The problem is that that thinking leads to death by a
thousand cuts.
> >>>>>
> >>>>> "for now" forces the cost of having to think
hard how to avoid growing
> >>>>> sk_buff onto the next person. Let's do it right
from the start.
> >>>>
> >>>> I see. I described an alternative to move the bit to cb[]
and clear it
> >>>> in all code paths that leads to ndo_start_xmit() earlier.
Does that
> >>>> sound good to you?
> >>>
> >>> If you use the control block to pass information between
> >>> __dev_queue_xmit on the tun device and tun_net_xmit, using
gso_skb_cb,
> >>> the field can be left undefined in all non-tun paths.
tun_select_queue
> >>> can initialize.
> >>
> >> The problem is that tun_select_queue() is not always called.
> >> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1
before
> >> calling it, but this variable may change later and result in a
race
> >> condition. Another case is that XDP with predefined queue.
> >>
> >>>
> >>> I would still use skb->hash to encode the hash. That hash
type of that
> >>> field is not strictly defined. It can be siphash from
___skb_get_hash
> >>> or a device hash, which most likely also uses Toeplitz. Then
you also
> >>> don't run into the problem of growing the struct size.
> >>
> >> I'm concerned exactly because it's not strictly defined.
Someone may
> >> decide to overwrite it later if we are not cautious enough.
qdisc_skb_cb
> >> also has sufficient space to contain both of the hash value and
type.
> >
> > How about using skb extensions?
>
> I think it will work. I'll try it in the next version.
Btw, I still think using eBPF for hash might be better.

Though the hashing rule is defined in the spec, it may be extended in
the future. For example, several extensions has been proposed:

1) RSS context
2) encapsulated packet hashing

Thanks
>
Linux Virtualization - Oct 2023 - [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

[RFC PATCH 5/7] tun: Introduce virtio-net hashing feature