On 2019/06/16 4:19, Johannes Berg wrote:> Hi Alexei,
>
> Sorry for messing up your address btw, not sure where I dug that one
> up...
>
>>> 1) Make the bridge code use skb->mac_len instead of ETH_HLEN.
This
>>> works for this particular case, but breaks some other cases;
>>> evidently some places exist where skb->mac_len isn't
even set to
>>> ETH_HLEN when a packet gets to the bridge. I don't know
right now
>>> what that was, I think probably somebody who's CC'ed
reported that.
>>>
>>> 2) Let tc_act_vlan() just pull ETH_HLEN instead of
skb->mac_len, but
>>> this is rather asymmetric and strange, and while it works for
this
>>> case it may cause confusion elsewhere.
>>>
>>> 2b) Toshiaki said it might be better to make that code *remember*
>>> mac_len and then use it to push and pull (so not caring about
the
>>> change made by skb_vlan_push()), but that ultimately leads to
>>> confusion and if you have TC push/pop combinations things
just get
>>> completely out of sync and die
>>>
>>> 3) Make skb_vlan_push()/_pop() just not change mac_len at all. So
far
>>> this also addresses the issue, but it's likely that this
will break
>>> OVS, and I don't know how it'd affect BPF. Quite
possibly like TC
>>> does and is broken, but perhaps not.
>>>
>>>
>>> But now we're stuck. Depending on how you look at it, all of
these seem
>>> sort of reasonable, or not.
>>>
>>> Ultimately, the issue seems to be that we couldn't really
decide whether
>>> VLAN tags (and probably MPLS tags, for that matter) are covered by
>>> mac_len or not. At least not consistently on ingress and egress.
>>> eth_type_trans() doesn't take them into account, so of course
on simple
>>> ingress mac_len will only cover the ETH_HLEN.
>>>
>>> If you have an accelerated tag and then push it into the SKB, it
will
>>> *not* be taken into account in mac_len. OTOH, if you have a new tag
and
>>> use skb_vlan_push() then it *will* be taken into account.
>>>
>>>
>>> I'm trending towards solution (3), because if we consider other
>>> combinations of VLAN push/pop in TC, I think we can end up in a
very
>>> messy situation today. For example, POP/PUSH seems like it should
be a
>>> no-op, but it isn't due to the mac_len, *unless* it can use the
HW accel
>>> only (i.e. only a single tag).
>>>
>>> I think then to propose such a patch though we'd have to figure
out
>>> where the BPF case is, and to keep OVS working probably either add
an
>>> argument ("bool adjust_mac_len") to the function
signatures, or just do
>>> the adjustments in OVS code after calling them?
>>>
>>>
>>> Any other thoughts?
>>
>> imo skb_vlan_push() should still change mac_len.
>> tc, ovs, bpf use it and expect vlan to be part of L2.
>
> I'm not sure tc really cares, but it *is* a reasonable argument to
make.
>
> Like I said, whichever way I look at the problem, a different solution
> looks more reasonable ;-)
>
>> There is nothing between L2 and L3 :)
>> Hence we cannot say that vlan is not part of L2.
>> Hence push/pop vlan must change mac_len, since skb->mac_len
>> is kernel's definition of the length of L2 header.
>
> I think we're getting to something here now. I actually thought about
> this some more last night, and basically asked myself how I would design
> it without all the legacy baggage.
>
> I'm certainly not suggesting we should change anything here, but to me
> it was a bit of a clarification to do this and then see where we differ
> in our handling today.
>
> Thinking along those lines, I sort of ended up with the following scheme
> (just for the skb head, not the frags/fraglist):
>
> +------------------+----------------+---------------+
> headroom | eth | vlan | ... | IP | TCP | payload | tailroom
> +------------------+----------------+---------------+
> ^ skb->head_ptr
> ^ skb->l2_ptr
> ^ skb->l3_ptr == skb->l2_ptr +
skb->l2_len
> ...
> ^ skb->payload_ptr
> ^
skb->tail
>
> Now, I deliberately didn't put any "skb->data" here,
because what we do
> today is sort of confusing.
>
> By getting rid of the "multi-use" skb->data in this scheme I
think it
> becomes clearer to think about.
>
>
>
> On *egress*, all we really care about is this:
>
> +------------------+----------------+---------------+
> headroom | eth | vlan | ... | IP / TCP | payload | tailroom
> +------------------+----------------+---------------+
>
> ^ skb->data
> skb->data + skb->len
> ^
>
> On *ingress*, however, we hide some of the data (temporarily):
>
> |--------- headroom ---------|
> +------------------+----------------+---------------+
> | eth | vlan | ... | IP / TCP | payload | tailroom
> +------------------+----------------+---------------+
> ^ skb->data - skb->mac_len
> ^ skb->data
> skb->data + skb->len ^
>
> which is somewhat confusing to me, and sort of causes all these
> problems.
>
> (It also makes it harder to reason about what data is actually valid in
> the skb, although if mac_len is non-zero then it must be, but it means
> you actually have less headroom and all).
>
> If instead we just made it like the hypothetical scheme I outlined
> above, then on traversing a layer we'd set the next layer pointer
> appropriately, and then each layer would just use the right pointer:
>
> * bridge/ethernet driver/... would use l2_ptr
> * IP would use the l3_ptr
> * TCP would use the l4_ptr (didn't put that into the picture)
> * ...
>
> Now we wouldn't have a problem with the VLAN tags, because we'd
just
> appropriate set/keep all the pointers - bridge doesn't even care where
> l3_ptr is pointing, but for IP it would of course point to beyond the
> VLAN tags.
>
> (Now, if you wanted to implement this, you probably wouldn't have
l2_ptr
> but l2_offset etc. but that's an implementation detail.)
>
> Now, why am I writing all this? Because I think it points out that
> you're absolutely right - we should treat mac_len as part of the frame
> if we're in anything that cares about L2 like bridge.
>
>> Now as far as bridge... I think it's unfortunate that linux
>> adopted 'vlan' as a netdevice model and that's where I
think
>> the problem is.
>
> I'm not sure. I don't exactly know where the problem is if we fix
bridge
> according to the patch (1) above, which, btw, was discussed before:
> https://lore.kernel.org/netdev/20190113135939.8970-1-zahari.doychev at
linux.com/
>
> Back then, Nikolay (whom I forgot to CC, fixed now) said:
>
>> It breaks connectivity between bridge and
>> members when vlans are used. The host generated packets going out of
the bridge
>> have mac_len = 0.
>
> Which probably indicates that we're not even consistent with the egress
> scheme I pointed out above, probably because we *also* have
> hard_header_len?
>
> Maybe somewhere early in the egress path we should set skb->mac_len to
> dev->hard_header_len, and then use skb->mac_len consistently, and
> consider that part of the skb (and not arbitrarily consider ETH_HLEN to
> be part of the skb in bridge).
>
> (This almost tempts me to actually try to implement the hypothetical SKB
> scheme I described above, just so it's easier to understand what part
> does what ... and to find where the issues like this occur)
>
>> Typical bridge in the networking industry is a device that
>> does forwarding based on L2. Which includes vlans.
>> And imo that's the most appropriate way of configuring and thinking
>> about bridge functionality.
>> Whereas in the kernel there is a 'vlan' netdevice that
'eats'
>> vlan tag and pretends that the rest is the same.
>> So linux bridge kinda doesn't need to be vlan aware.
>> CONFIG_BRIDGE_VLAN_FILTERING was the right step, but I haven't
>> seen it being used and I'm not sure about state of things there.
>
> I think that ends up being a question of semantics. You can consider an
> "industry bridge" that you describe to be a combination of VLAN
and
> bridge netdevs, and so it's just a question of what exactly you
consider
> a "bridge" - does it have to be a single netdev or not.
>
>> So your option 1 above is imo the best. The bridge needs to deal
>> with skb->mac_len and full L2 header.
>
> Yeah, I guess. We're back to square 1 ;-)
I'll try to explain the problem I see, which cannot be fixed by option 1...
The bug is in tcf_vlan_act(), and mainly in skb->data, not in mac_len.
Consider about vlan packets from NIC, but non-hw-accelerated, where
vlan devices are configured to receive them.
When __netif_receive_skb_core() is called, skb is like this.
+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
^
data
skb->data is at the beginning of the vlan header.
This is reasonable because we did not process the vlan tag at this point.
Then after vlan_do_receive() (receive the skb on a vlan device), the skb is like
this.
+-----+--------
| eth | TCP/IP
+-----+--------
^
data
Or if reorder_hdr is off (which does not remove vlan tags when receiving on vlan
devices),
+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
^
data
Relying on this mechanism, we are currently able to handle multiple vlan tags.
For example if we have 2 tags,
- On __netif_receive_skb_core() invocation
+-----+------+------+--------
| eth | vlan | vlan | TCP/IP
+-----+------+------+--------
^
data
- After first vlan_do_receive()
+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
^
data
Or if reorder_hdr is off,
+-----+------+------+--------
| eth | vlan | vlan | TCP/IP
+-----+------+------+--------
^
data
When we process one tag, the data goes forward by one tag.
Now looking at TC vlan case...
After it inserts two tags, the skb looks like:
(The first tag is in vlan_tci)
+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
^
data
The data pointer went forward before we process it.
This is apparently wrong. I think we don't want to (or cannot?) handle cases
like this
after tcf_vlan_act(). This is why I said we should remember mac_len there.
So, my opinion is:
On ingress, data pointer can be at the end of vlan header and mac_len probably
should
include vlan tag length, but only after the vlan tag is processed.
Bridge may need to handle mac_len that is not equal to ETH_HLEN but to me
it's a
different problem.
Toshiaki Makita