thr3ads.net - LARTC - [PATCH] TC: bug fixes to the "sample" clause [Feb 2006]

If this information is useful, please help other people find it:
Share via:

Russell Stuart

2006-Feb-10 02:33 UTC

[PATCH] TC: bug fixes to the "sample" clause

PATCH 1
======
On my machine tc does not parse filter "sample" for the u32
filter.  Eg:

tc filter add dev eth2 parent 1:0 protocol ip prio 1 u32 ht 801: \ 
    classid 1:3 \
    sample ip protocol 1 0xff match ip protocol 1 0xff
  Illegal "sample"

The reason is a missing memset.  This patch fixes it.

diff -Nur iproute-20051007.keep/tc/f_u32.c iproute-20051007/tc/f_u32.c
--- iproute-20051007.keep/tc/f_u32.c	2005-01-19 08:11:58.000000000 +1000
+++ iproute-20051007/tc/f_u32.c	2006-01-12 17:12:43.000000000 +1000
@@ -878,6 +878,7 @@
 				struct tc_u32_sel sel;
 				struct tc_u32_key keys[4];
 			} sel2;
+			memset(&sel2, 0, sizeof(sel2));
 			NEXT_ARG();
 			if (parse_selector(&argc, &argv, &sel2.sel, n)) {
 				fprintf(stderr, "Illegal \"sample\"\n");


PATCH 2
======
In tc, the u32 sample clause uses the 2.4 hashing algorithm.
The hashing algorithm used by the kernel changed in 2.6,
consequently "sample" hasn''t work since then.

This patch makes the sample clause work for both 2.4 and 2.6:

diff -Nur iproute-20051007.keep/tc/f_u32.c iproute-20051007/tc/f_u32.c
--- iproute-20051007.keep/tc/f_u32.c	2006-01-12 17:34:37.000000000 +1000
+++ iproute-20051007/tc/f_u32.c	2006-02-07 17:10:29.000000000 +1000
@@ -17,6 +17,7 @@
 #include <syslog.h>
 #include <fcntl.h>
 #include <sys/socket.h>
+#include <sys/utsname.h>
 #include <netinet/in.h>
 #include <arpa/inet.h>
 #include <string.h>
@@ -874,6 +875,7 @@
 				htid = (handle&0xFFFFF000);
 		} else if (strcmp(*argv, "sample") == 0) {
 			__u32 hash;
+			struct utsname utsname;
 			struct {
 				struct tc_u32_sel sel;
 				struct tc_u32_key keys[4];
@@ -889,8 +891,19 @@
 				return -1;
 			}
 			hash = sel2.sel.keys[0].val&sel2.sel.keys[0].mask;
-			hash ^= hash>>16;
-			hash ^= hash>>8;
+			uname(&utsname);
+			if (strncmp(utsname.release, "2.4.", 4) == 0) {
+				hash ^= hash>>16;
+				hash ^= hash>>8;
+			}
+			else {
+				__u32 mask = sel2.sel.keys[0].mask;
+				while (mask && !(mask & 1)) {
+				  	mask >>= 1;
+					hash >>= 1;
+				}
+				hash &= 0xFF;
+			}
 			htid = ((hash<<12)&0xFF000)|(htid&0xFFF00000);
 			sample_ok = 1;
 			continue;


PATCH 3
======
"tc" does not allow you to specify the divisor for the
"sample" clause, it always assumes a divisor of 256.  
If the divisor isn''t 256, (ie it is something less),
the kernel will usually whinge because the bucket given
to it by "tc" is typically too big.

This patch adds a "divisor" option to tc''s "sample"
clause:

diff -Nur iproute-20051007.keep/tc/f_u32.c iproute-20051007/tc/f_u32.c
--- iproute-20051007.keep/tc/f_u32.c	2006-02-10 11:40:16.000000000 +1000
+++ iproute-20051007/tc/f_u32.c	2006-02-10 11:47:14.000000000 +1000
@@ -35,7 +35,7 @@
 	fprintf(stderr, "or         u32 divisor DIVISOR\n");
 	fprintf(stderr, "\n");
 	fprintf(stderr, "Where: SELECTOR := SAMPLE SAMPLE ...\n");
-	fprintf(stderr, "       SAMPLE := { ip | ip6 | udp | tcp | icmp |
u{32|16|8} | mark } SAMPLE_ARGS\n");
+	fprintf(stderr, "       SAMPLE := { ip | ip6 | udp | tcp | icmp |
u{32|16|8} | mark } SAMPLE_ARGS [divisor DIVISOR]\n");
 	fprintf(stderr, "       FILTERID := X:Y:Z\n");
 }
 
@@ -835,7 +835,7 @@
 			unsigned divisor;
 			NEXT_ARG();
 			if (get_unsigned(&divisor, *argv, 0) || divisor == 0 ||
-			    divisor > 0x100) {
+			    divisor > 0x100 || (divisor - 1 & divisor)) {
 				fprintf(stderr, "Illegal \"divisor\"\n");
 				return -1;
 			}
@@ -875,6 +875,7 @@
 				htid = (handle&0xFFFFF000);
 		} else if (strcmp(*argv, "sample") == 0) {
 			__u32 hash;
+			unsigned divisor = 0x100;
 			struct utsname utsname;
 			struct {
 				struct tc_u32_sel sel;
@@ -890,6 +891,15 @@
 				fprintf(stderr, "\"sample\" must contain exactly ONE
key.\n");
 				return -1;
 			}
+			if (*argv != 0 && strcmp(*argv, "divisor") == 0) {
+				NEXT_ARG();
+				if (get_unsigned(&divisor, *argv, 0) || divisor == 0 ||
+				    divisor > 0x100 || (divisor - 1 & divisor)) {
+					fprintf(stderr, "Illegal sample \"divisor\"\n");
+					return -1;
+				}
+				NEXT_ARG();
+			}
 			hash = sel2.sel.keys[0].val&sel2.sel.keys[0].mask;
 			uname(&utsname);
 			if (strncmp(utsname.release, "2.4.", 4) == 0) {
@@ -904,7 +913,7 @@
 				}
 				hash &= 0xFF;
 			}
-			htid = ((hash<<12)&0xFF000)|(htid&0xFFF00000);
+			htid = ((hash%divisor)<<12)|(htid&0xFFF00000);
 			sample_ok = 1;
 			continue;
 		} else if (strcmp(*argv, "indev") == 0) {

Russell Stuart

2006-Mar-13 04:44 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

On Sat, 2006-03-11 at 08:11 -0500, jamal wrote: > > On my machine tc does not parse filter "sample" for the u32
> > filter.  Eg:
> > 
> > tc filter add dev eth2 parent 1:0 protocol ip prio 1 u32 ht 801: \ 
> >     classid 1:3 \
> >     sample ip protocol 1 0xff match ip protocol 1 0xff
> >   Illegal "sample"
> > 
> 
> The syntax is correct but ht 801: must exist - that is why it is being
> rejected.. So there is absolutely categorically _no way in hell_ your
> memset would have fixed it ;-> Apologies for being overdramatic ;->
You are wrong on both counts.

Firstly, the error message came from tc when it parsed
the line.  If tc gets an error talking to the kernel it 
says, among other things:
  "We have an error talking to the kernel"
Ergo, it hadn''t given the command to the kernel yet.
This is significant, because the only place that
knows whether ht 801: has been created or not is
the kernel.  So obviously the error message can''t
depend on whether the table had been created.

As it happens I did create the ht before issuing the
prior command when I first struck the bug - but I
didn''t show it in my example because it was not
relevant.

As for _no way in hell_ - there are apparently more ways
in hell then you are aware of.  If you look at the 
function pack_key() in tc/f_u32.c, you will see it
assumes the "sel" parameter it is passed is initialised.
Without the added "memset()" it isn''t - it just contains
random crap.  Of course on some machines (perhaps yours?)
that random crap might be 0, and then it would work.  
That is why I said at the start of my patch "On my 
machine tc does not ...".  On other machines the bug
may not appear.
> sample never worked 100% of the time with that hash. It works _most_ of
> the time with 256 buckets. Infact it will work some of the time as it is
> right now with 2.6.x. Can you post the output of tc -s filter ls on 2.6
> with and without your user space change?
Re: "sample never worked 100% of the time with that 
hash".  Can you give an example?  As far as I know it
always worked.

Re: "it will work some of the time as it is right now 
with 2.6.x".  Yes - it will work when you are sampling
one byte.   I am sampling port numbers, which are two
bytes.  It will not work in any case where there are
two non-zero bytes.

Re: "Can you post the output of tc -s filter ls on 2.6 
with and without your user space change?".  Here it is:

  With my change:
    tc qdisc add dev eth0 root handle 1: htb
    tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 256
    tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample tcp src 0x369 0xffff match tcp src 0x0369 0xffff
    tc -s filter show dev eth0
    filter parent 1: protocol ip pref 1 u32
    filter parent 1: protocol ip pref 1 u32 fh 801: ht divisor 256
    filter parent 1: protocol ip pref 1 u32 fh 801:3:800 order 2048 key ht 801
bkt 3 flowid 1:
      match 03690000/ffff0000 at nexthdr+0
    filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1

  On the orginal "tc" shipped with debian "sarge":
    tc qdisc add dev eth0 root handle 1: htb
    tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 256
    tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample tcp src 0x369 0xffff match tcp src 0x0369 0xffff
    Illegal "sample"
  Ooops.  Looks like I can''t get out of this without patching
  and compiling:
    tc qdisc add dev eth0 root handle 1: htb
    tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 256
    tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample tcp src 0x369 0xffff match tcp src 0x0369 0xffff
    tc -s filter show dev eth0filter parent 1: protocol ip pref 1 u32
    filter parent 1: protocol ip pref 1 u32 fh 801: ht divisor 256
    filter parent 1: protocol ip pref 1 u32 fh 801:6a:800 order 2048 key ht 801
bkt 6a flowid 1:
      match 03690000/ffff0000 at nexthdr+0
    filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1

Note: this also answers you request in your other email re:
"can you give me an example that doesn''t work".
> Heres how you should fix this:
> Patch1) fix kernel 2.4.x to be like 2.6.x
> patch 2) fix iproute2 to have the same syntax as for 2.6 and put a big
> bold note in the code that if anyone changes that part of the code to
> look at the kernel u32_hash_fold() routine.
> no need for the utsname check.
Why do it that way?  If you want to put the 2.6 hashing
algorithm in 2.4 then go ahead - but that is a separate 
decision which is not related to the issue of making tc 
backwards compatible.  Making tc work with all versions of
the kernel is what I am doing there. As an example of why
this is a good idea, Debian ships 2.4 and 2.6 kernels, 
and one version of tc.  That tc should work with both 
kernels.

Finally, regarding which hashing algorithm is better,
your results differ from mine.  First a bit of 
background: I am trying make VOIP work over some 256/64 
and 512/128 links that carry all sorts of other traffic.
The patches you see here are a result of me trying to
make that work over a range of sites (companies and
home setups).  The end result is that it did work, better
than I expected it to.  On a 512/128 ADSL link, I can:

  a.  Saturate the incoming direction with a large wget,
  b.  Saturate the outgoing direction with a large wget,
  c.  Hit the incoming direction with a "while :; do
      wget http://www.google.com/ -O /dev/null; done".
  d.  Hit the outgoing direction with an external box
      doing the same while loop.

While all that is going on, I can sustain 2 VOIP calls
with perfect clarity on that link.  I wasn''t expecting to
be able to pull that off.  To pull it off I created the
u32 filter from hell.  This long winded explanation is to
forestall the inevitable "why the hell you want want to do
that" flame when you see the script that follows.

You can find the shell script that creates the filter here:
  http://www.stuart.id.au/russell/files/tc/setup-traffic-control.sh
The script itself is not that important.  What is important
is that it is a real-life use of u32, and there are approx 
1200 hashed u32 filter lines.

So how to means how good a job the hash is doing.  The
easiest would seem to be to use a least squares fit, ie:

  Number of elements to be hashed       = E
  Number of buckets                     = B   (== 256)
  Optimal number of elements per bucket = E/B
  Hash function "goodness" metric       = M     sqrt(sum foreach
bucket i: [ (NrOfElementsInBucket(i) - E/B)^2 ] / B)

For a good hash function M < 1.  The bigger M is the
worse the hash function is.  I wrote a python program to 
compute M for my u32 filter.  The python program can be 
found here:
  http://www.stuart.id.au/russell/files/tc/tc-ports
The dataset it operates on can be found here:
  http://www.stuart.id.au/russell/files/tc/m.py

Results:
  tcp 2.6: E=534  M=2.35
  tcp 2.4: E=534  M=0.82
  udp 2.6: E=711  M=2.91
  udp 2.4: E=711  M=0.92

As you can probably tell, I see the new hash function in 2.6
as a backward step - not an improvement.

Patrick McHardy

2006-Mar-13 17:55 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

jamal wrote:> On Mon, 2006-13-03 at 14:44 +1000, Russell Stuart wrote: 
> 
>>You are wrong on both counts.
> 
> 
> I am wrong on why it is being rejected - but what you are seeing is
> worse than i thought initially. 
> 
> Lets put it this way:
> The only you will _ever_ get that message is if you had made a syntax
> error (which you did not). Please look at the code on where that message
> appears and:
> 
> a) tell me how you would have got that message to begin with using
> perfectly legal syntax.
> a) tell me how a memset would have fixed that.
He already told you, pack_key expects the selector to be initialized,
otherwise nkeys might contain a value >= 128, which would cause
exactly this error, if a matching key is not found within the
uninitialized memory by accident.
> Just send the memset fix to Stephen with a different reason. Your
> current reason is _wrong_ and i really dont have much time to have this
> kind of discussion.  
> If you had said "I added that memset there because it looks like the
> right thing to do" then we would not have had this discussion.
> 
> You made claims you fixed a bug. It cant possibly be the bug you fixed.
> Was it some other bug perhaps and you mixed up the two?
The patch as well as the description are perfectly fine.

BTW, running valgrind on tc shows lots of uses of uninitialized values,
it seems like a good idea if someone would go over these and fix them
up.

Stephen Hemminger

2006-Mar-13 18:04 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

On Mon, 13 Mar 2006 18:55:35 +0100
Patrick McHardy <kaber@trash.net> wrote:
> jamal wrote:
> > On Mon, 2006-13-03 at 14:44 +1000, Russell Stuart wrote: 
> > 
> >>You are wrong on both counts.
> > 
> > 
> > I am wrong on why it is being rejected - but what you are seeing is
> > worse than i thought initially. 
> > 
> > Lets put it this way:
> > The only you will _ever_ get that message is if you had made a syntax
> > error (which you did not). Please look at the code on where that
message
> > appears and:
> > 
> > a) tell me how you would have got that message to begin with using
> > perfectly legal syntax.
> > a) tell me how a memset would have fixed that.
> 
> He already told you, pack_key expects the selector to be initialized,
> otherwise nkeys might contain a value >= 128, which would cause
> exactly this error, if a matching key is not found within the
> uninitialized memory by accident.
> 
> > Just send the memset fix to Stephen with a different reason. Your
> > current reason is _wrong_ and i really dont have much time to have
this
> > kind of discussion.  
> > If you had said "I added that memset there because it looks like
the
> > right thing to do" then we would not have had this discussion.
> > 
> > You made claims you fixed a bug. It cant possibly be the bug you
fixed.
> > Was it some other bug perhaps and you mixed up the two?
The memset fix is in current CVS. I just wasn''t going to take the
patch that looked at utsname to decide what hash to use.
> The patch as well as the description are perfectly fine.
> 
> BTW, running valgrind on tc shows lots of uses of uninitialized values,
> it seems like a good idea if someone would go over these and fix them
> up.
If we had a test script of commands (code coverage), that would help.

Patrick McHardy

2006-Mar-13 18:15 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

Stephen Hemminger wrote:> The memset fix is in current CVS. I just wasn''t going to take the
> patch that looked at utsname to decide what hash to use.
Yes, that sucks. We have another incompatibility, old iproute versions
(like the one shipped by Debian) show garbage statistics for HFSC. I
think it still works properly, but I wasn''t able to track down the
cause. I have a feeling its somehow related to the gen_stats changes.
>>BTW, running valgrind on tc shows lots of uses of uninitialized values,
>>it seems like a good idea if someone would go over these and fix them
>>up.
> 
> 
> If we had a test script of commands (code coverage), that would help.
Getting good coverage looks like a lot of work, but I''ll put it on
my TODO list for a boring day :)

Patrick McHardy

2006-Mar-13 19:02 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

Stephen Hemminger wrote:>>BTW, running valgrind on tc shows lots of uses of uninitialized values,
>>it seems like a good idea if someone would go over these and fix them
>>up.
> 
> 
> If we had a test script of commands (code coverage), that would help.
Actually the Coverity scanner is quite good at spotting these problems,
so I''ve requested iproute to be added to the list of regulary scanned
projects.

Russell Stuart

2006-Mar-13 21:43 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

On Mon, 2006-03-13 at 10:04 -0800, Stephen Hemminger
wrote:> The memset fix is in current CVS. I just wasn''t going to take the
> patch that looked at utsname to decide what hash to use.
Stephen, could you describe your objections to it please?

Stephen Hemminger

2006-Mar-13 21:50 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

On Tue, 14 Mar 2006 07:43:57 +1000
Russell Stuart <russell-lartc@stuart.id.au> wrote:
> On Mon, 2006-03-13 at 10:04 -0800, Stephen Hemminger wrote:
> > The memset fix is in current CVS. I just wasn''t going to take
the
> > patch that looked at utsname to decide what hash to use.
> 
> Stephen, could you describe your objections to it please?
> 
Because it means that the API for netlink isn''t being used properly.

Russell Stuart

2006-Mar-14 00:31 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

On Mon, 2006-03-13 at 13:50 -0800, Stephen Hemminger
wrote:> On Tue, 14 Mar 2006 07:43:57 +1000
> Russell Stuart <russell-lartc@stuart.id.au> wrote:
> 
> > On Mon, 2006-03-13 at 10:04 -0800, Stephen Hemminger wrote:
> > > The memset fix is in current CVS. I just wasn''t going to
take the
> > > patch that looked at utsname to decide what hash to use.
> > 
> > Stephen, could you describe your objections to it please?
> > 
> 
> Because it means that the API for netlink isn''t being used
properly.
Do you mean the binary API between the kernel and the 
applications has been broken; this is bad as it 
transgresses some unwritten law; and the patch is
bad because it hides this problem rather than addressing
it?

Does the fact that the binary API changed during a
major kernel release (between 2.4 and 2.6) give us some 
wriggle room here?

Anyway, jokes aside, the situation we have now is the 
current "tc" doesn''t work with the current kernel.  This 
situation can''t be allowed to continue, I presume.  Ergo, 
here is a list of things we could do to fix this.  All 
you (plural) have to do is choose one, or some 
combination:

1.  Change the hashing algorithm back to the 2.4 way.
    (IMHO, the 2.4 algorithm was better.)  Disadvantage:
    anybody who had a u32 filter that hashed on more
    than one non-zero byte will have their u32 filters
    suddenly break.  However, since how the hashing
    algorithm was never documented beyond "HOWTO''s"
    that showed how to hash on one byte, and every
    example of hashing I have seen has been a copy &
    paste of said HOWTO''s, my guess is there are stuff
    all people who will be effected.  Another disadvantage:
    people who are using older 2.6 kernels (eg me on
    my production machines) won''t have the problem fixed.

2.  Change the hashing algorithm in "tc" to match the 
    current kernel.  Disadvantage: "tc" will no longer 
    work with 2.4 kernels.

3.  Change the hashing algorithm in "tc" to match the
    current kernel, and change the 2.4 kernel''s hashing
    algorithm to match the 2.6 kernel.  This is Jamal''s
    proposed solution.  Disadvantage: 2.4 is a "stable" 
    kernel, produced when we made a real effort to 
    distinguish between between stable and development 
    kernels.  This change would break API compatibility 
    in a stable series.  Yuk!

4.  Make "tc" look at the kernel version, and choose
    the appropriate hashing function.  This was my
    solution, and everybody seems to hate it bar me
    :(.  Disadvantages: None, other than it hides a
    violation of an "unwritten law".

5.  A combination of 1 & 4.  Change the hashing in 2.6 
    algorithm back to what it was in 2.4, and hide the 
    change by making "tc" check the kernel version and 
    choose the matching hash.    Disadvantages: None, 
    other than now we have wilfully broken the unwritten
    law twice.

My personal preference is to have a "tc" in CVS that
works with _all_ kernel versions.  Yes - the netlink
interface has been broken.  But is was done, and now 
can''t be undone.  No matter why we do, there will
forever more be kernel versions with incompatible
netlink interfaces.  We can''t fix it.  We just have
to limit the damage.

Russell Stuart

2006-Mar-14 02:29 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

On Sat, 2006-03-11 at 08:11 -0500, jamal wrote:> On Fri, 2006-10-02 at 12:33 +1000, Russell Stuart wrote:
> > This patch adds a "divisor" option to tc''s
"sample"
> > clause: 
> 
> While this looks right - can we have more test data with tc filter ls
> both before and after your fix? Specify divisor of 256 and 16 for
> example. Show that for the 256 it is the same as before and for 16 it
> does the right thing.
With patch, divisor 256:
  tc qdisc del dev eth0 root
  tc qdisc add dev eth0 root handle 1: htb
  tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 256
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0x01 0xff divisor 256 match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0x0f 0xff divisor 256 match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0x10 0xff divisor 256 match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0xef 0xff divisor 256 match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0xf0 0xff divisor 256 match u32 0 0
  tc -s filter show dev eth0
    filter parent 1: protocol ip pref 1 u32
    filter parent 1: protocol ip pref 1 u32 fh 801: ht divisor 256
    filter parent 1: protocol ip pref 1 u32 fh 801:1:800 order 2048 key ht 801
bkt 1 flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801:f:800 order 2048 key ht 801
bkt f flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801:10:800 order 2048 key ht 801
bkt 10 flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801:ef:800 order 2048 key ht 801
bkt ef flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801:f0:800 order 2048 key ht 801
bkt f0 flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1

Without patch, divisor 256:
  tc qdisc add dev eth0 root handle 1: htb
  tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 256
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0x01 0xff match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0x0f 0xff match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0x10 0xff match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0xef 0xff match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0xf0 0xff match u32 0 0
  tc -s filter show dev eth0
    filter parent 1: protocol ip pref 1 u32
    filter parent 1: protocol ip pref 1 u32 fh 801: ht divisor 256
    filter parent 1: protocol ip pref 1 u32 fh 801:1:800 order 2048 key ht 801
bkt 1 flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801:f:800 order 2048 key ht 801
bkt f flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801:10:800 order 2048 key ht 801
bkt 10 flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801:ef:800 order 2048 key ht 801
bkt ef flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801:f0:800 order 2048 key ht 801
bkt f0 flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1

With patch, divisor 16:
  tc qdisc del dev eth0 root
  tc qdisc add dev eth0 root handle 1: htb
  tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 16
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0x01 0xff divisor 16 match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0x0f 0xff divisor 16 match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0x10 0xff divisor 16 match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0xef 0xff divisor 16 match u32 0 0
  tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0
sample u8 0xf0 0xff divisor 16 match u32 0 0
  tc -s filter show dev eth0
    filter parent 1: protocol ip pref 1 u32
    filter parent 1: protocol ip pref 1 u32 fh 801: ht divisor 16
    filter parent 1: protocol ip pref 1 u32 fh 801::800 order 2048 key ht 801
bkt 0 flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801::801 order 2049 key ht 801
bkt 0 flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801:1:800 order 2048 key ht 801
bkt 1 flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801:f:800 order 2048 key ht 801
bkt f flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 801:f:801 order 2049 key ht 801
bkt f flowid 1:
      match 00000000/00000000 at 0
    filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1

Russell Stuart

2006-Mar-14 02:45 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

On Sat, 2006-03-11 at 10:56 -0500, jamal wrote:> Right - take a look at the use of hashkey with varying divisors to see
> where the 2.4 folding breaks[1]. Note you should be able to use hashkey
> instead of sample and it would work fine.
<snip>
> [1] Essentially, if you teach u32 in 2.4 to spread rules over different
> buckets it will not do so evenly. Off top of my head i remember that the
> best you could ever do is have bucket selection in increments of 4 (so
> bucket 0, 4, 8,..) with certain divisor sizes - mostly works with 256;
> to work properly you need it to be (0,1,2,3,..). 
> implies --> When it does work, it is detrimental to lookup speed when
> you hit thousands of rules because some buckets get overloaded and
> others are totaly unused and when it doesnt work, it results in
> classifier misses for rules you inserted. 
Hmm.  I can''t see how this could be so.  Can you give 
specific examples.

The only time I can think of where the 2.4 XOR hash would
be worse is where there is a correlation between the bits
in different bytes.  I can''t think of a real life example
of where that would be so.  In every other case it will
be as good as, but typically better than, the 2.6
algorithm.

Russell Stuart

2006-Mar-16 01:43 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

<snip>Much discussion bashing this issue to death.</snip>

(sorry, jamal - this one is CC''ed to lartc.)

Here is are revised versions of the 2 as yet unapplied
patches.

PATCH 1
======
[Has been applied.]


PATCH 2
======
In tc, the u32 sample clause uses the 2.4 hashing algorithm.
The hashing algorithm used by the kernel changed in 2.6,
consequently "sample" hasn''t work since then.

This patch makes the sample clause work 2.6 only.  This is
different from the prior version of the patch, in that it
made it work with 2.4 and 2.6:

diff -Nur iproute-20051007.keep/tc/f_u32.c iproute-20051007/tc/f_u32.c
--- iproute-20051007.keep/tc/f_u32.c	2006-03-14 12:28:17.000000000 +1000
+++ iproute-20051007/tc/f_u32.c	2006-03-16 11:08:25.000000000 +1000
@@ -888,8 +888,18 @@
 				return -1;
 			}
 			hash = sel2.sel.keys[0].val&sel2.sel.keys[0].mask;
+#if	0
+			/* 2.2 .. 2.4 hashing algorithm */
 			hash ^= hash>>16;
 			hash ^= hash>>8;
+#else
+			/* 2.5 onwards */
+			__u32 mask = sel2.sel.keys[0].mask;
+			while (mask && !(mask & 1)) {
+			  	mask >>= 1;
+				hash >>= 1;
+			}
+#endif
 			htid = ((hash<<12)&0xFF000)|(htid&0xFFF00000);
 			sample_ok = 1;
 			continue;


PATCH 3
======
"tc" does not allow you to specify the divisor for the
"sample" clause, it always assumes a divisor of 256.  
If the divisor isn''t 256, (ie it is something less),
the kernel will usually whinge because the bucket given
to it by "tc" is typically too big.

This patch adds a "divisor" option to tc''s "sample"
clause.  This is identical to the previous version of
the patch, other than it now applies correctly after
the revised PATCH 3.

diff -Nur iproute-20051007.keep/tc/f_u32.c iproute-20051007/tc/f_u32.c
--- iproute-20051007.keep/tc/f_u32.c	2006-03-16 11:27:17.000000000 +1000
+++ iproute-20051007/tc/f_u32.c	2006-03-16 11:29:56.000000000 +1000
@@ -34,7 +34,7 @@
 	fprintf(stderr, "or         u32 divisor DIVISOR\n");
 	fprintf(stderr, "\n");
 	fprintf(stderr, "Where: SELECTOR := SAMPLE SAMPLE ...\n");
-	fprintf(stderr, "       SAMPLE := { ip | ip6 | udp | tcp | icmp |
u{32|16|8} | mark } SAMPLE_ARGS\n");
+	fprintf(stderr, "       SAMPLE := { ip | ip6 | udp | tcp | icmp |
u{32|16|8} | mark } SAMPLE_ARGS [divisor DIVISOR]\n");
 	fprintf(stderr, "       FILTERID := X:Y:Z\n");
 }
 
@@ -834,7 +834,7 @@
 			unsigned divisor;
 			NEXT_ARG();
 			if (get_unsigned(&divisor, *argv, 0) || divisor == 0 ||
-			    divisor > 0x100) {
+			    divisor > 0x100 || (divisor - 1 & divisor)) {
 				fprintf(stderr, "Illegal \"divisor\"\n");
 				return -1;
 			}
@@ -874,6 +874,7 @@
 				htid = (handle&0xFFFFF000);
 		} else if (strcmp(*argv, "sample") == 0) {
 			__u32 hash;
+			unsigned divisor = 0x100;
 			struct {
 				struct tc_u32_sel sel;
 				struct tc_u32_key keys[4];
@@ -888,6 +889,15 @@
 				fprintf(stderr, "\"sample\" must contain exactly ONE
key.\n");
 				return -1;
 			}
+			if (*argv != 0 && strcmp(*argv, "divisor") == 0) {
+				NEXT_ARG();
+				if (get_unsigned(&divisor, *argv, 0) || divisor == 0 ||
+				    divisor > 0x100 || (divisor - 1 & divisor)) {
+					fprintf(stderr, "Illegal sample \"divisor\"\n");
+					return -1;
+				}
+				NEXT_ARG();
+			}
 			hash = sel2.sel.keys[0].val&sel2.sel.keys[0].mask;
 #if	0
 			/* 2.2 .. 2.4 hashing algorithm */
@@ -901,7 +911,7 @@
 				hash >>= 1;
 			}
 #endif
-			htid = ((hash<<12)&0xFF000)|(htid&0xFFF00000);
+			htid = ((hash%divisor)<<12)|(htid&0xFFF00000);
 			sample_ok = 1;
 			continue;
 		} else if (strcmp(*argv, "indev") == 0) {

Russell Stuart

2006-Mar-16 23:24 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

On Thu, 2006-03-16 at 08:51 -0500, jamal wrote:> > BTW, in this example, the new hash I suggested would be as good 
> > as the 2.6 case.
> > 
> 
> Yes, if you used 256 buckets per hash table ;->
No - you haven''t understood what the new algorithm does.
It will get the same performance of the 2.6 version with
the same memory.  In fact for all cases where the number
of bits used is <= 8, ie is _identical_ to the 2.6
algorithm.
> Wait till you start getting into V6. Or try
It is odd that you keep mention IPv6.  The IPv6 header
has no fields that are less that 8 bits, and there are
none that are not aligned on a 8 bit boundary.  In fact 
even within the IPv6 addresses, there are no sub-fields 
less that 8 bits.  It would be a happy hunting ground for
the 2.4 algorithm.
> building an effective fast lookup scheme for a five tuple lookup using
> u32; five tuples being {src/dst IP, IP protocol, src/dst port}
> So in simple setups where you say dont exceed a few hundred hash tables,
> and a few hundred to thousand filters, the old hash was fine.
> > Now lets take the case that we hashing a number of bytes with
> > a 256 divisor (my case).  If these bytes contain truly random 
> > values, then again 2.4 and 2.6 will be the same.  
> 
> But they are not.
Well, this is the crux of the matter, isn''t it?  You are
apparently used to looking up well known patterns - probably
ones you determine. As such, you can arrange these in nice
logical ranges.

Guess what?  The very thing that started this off was me
looking up TCP & UDP port numbers.  I didn''t determine those
port numbers.  They are all over the shop - for me they are
effectively random data.  And they are sparse too, as in
they occupy 16 bits.  The whole point of the rant that
followed was to explain to you why in a case like that the
2.6 hash runs substantially slower that the 2.4 one.  Whats
more, it can''t be fixed by simply trading off more memory.

But you seem to be burying you head in the sand and saying
"such data sets don''t exist".  Well they do.  And I gave
you a real life one.

Here is another one: I have contemplated at times giving
priority to Australian traffic.  So I went and found myself
a list of Australian IP addresses.  Being allocated by some
central authority, I expected some nice ordered data set.
How naive.  There are some 400 subnets, and after trying
to find some pattern I gave up after an hour.  Another
random dataset.  The 2.4 algorithm will handle that well.

When you changed the hash algorithm, you optimised it for
your particular world - at the expense of people who have 
data sets like mime.  Note that users of 2.4 with your
dataset have a way out - waste a bit of memory and it will
run just as fast.  With a large dataset and 2.6 there is 
no way out.  You are probably doubting this as you are it -
I explain why below.
> > The 2.4 XOR''s
> > the two values together.  XOR has the property that it
"adds"
> > the randomness of the bits together, unless they are correlated.
> > So if you take two partially random bits, and XOR them together,
> > then the resulting bit will be more random that the original two
> > bits.  An illustration of this from crypto is a stream cypher 
> > like rc4.  rc4 effectively produces a random stream of bits.
> > To use rc4, you XOR your plain text with this random stream.
> > Even though your plain text is highly non-random, the cypher
> > text is at least as random as the rc4 stream - and thus looks
> > like gibberish.  Anyway, the end result for 2.4 is that if you
> > XOR two perfectly random bytes, the result is a perfectly random
> > byte.  So for random data 2.6 and 2.4 are the same.
> > 
> 
> To use a term which makes sense up here, you are treading on thin ice
> now ;->  You dont wanna skate on this river ;->
> 
> Randomness has nothing to do with this. It doesnt matter whether
> random traffic patterns arrive at all. 
> - assume that in 8 bits of data, 6 bits provide the most entropy.
> - assume 256 hosts (just to cover the range of the 8 bits)
> 
> a) If i built a hash table with 256 buckets, i can guarantee that
> i will find any host using either scheme in 4 lookups.
> Except 75% of the buckets will not be used by either.
You miss one point.  While you can guarantee it will
happen in 4 lookups, 2.4 averages slightly more than
1 lookup it if hashes the entire value in one hit.
In 2.6, you are stuck with your 4 lookups, as you say.
> b) If i built a hash table with 128 buckets, I can guarantee i will
> find any host in 4 lookups with the new scheme but it will take
> a best case of 8 lookups with the old scheme.
> Except 50% of the buckets will not be used by the new scheme and
> 75% not be used by the old scheme.
> 
> c) if i built a hash table with 64 buckets, I can guarantee that i will
> find any host in 4 lookups with the new scheme and 16 in the old scheme.
> 100% of the buckets will be used by the new scheme; only 25% will be
> used by the old scheme.
> 
> d) If i built a hash table of 32 buckets, I can guarantee that i will
> find any host in 8 lookups with new scheme and _32_ with old.
> 100% used by the new scheme and 25% by old
> 
> See the pattern? But this is not the worst part. The nasty part
> is if i built a newer level of hash tables so i can reduce the lookups,
> it totally reduces my effectiveness; i need to figure out which buckets
> are being hit etc.
The pattern is based on the (unstated) premise that you
are dropping the least significant bits in the byte in
your hashing function.  Perhaps you are assuming you know
where these random bits live in the address.  I wasn''t
so lucky with my dataset.  However lets go with the original
premise - the data is truly random, and you don''t know
where bits are.  When you drop the least significant bits 
2.4 behaves badly.  But since the data is random, you are 
better off dropping the upper bits when using 2.4.  If 
you do that then the 2.4 and 2.6 hashes perform 
identically when used as you describe above.

In other words, the example didn''t show what you intended 
it to show.  It showed that if you do the optimal thing 
for 2.6 and use a tree of hash tables then 2.4 and 2.6 can 
be made to perform identically.  If you do the optimal 
thing for 2.4, then on average 2.4 can be made to run 
almost 4 times faster than 2.6, on average.
> You have not convinced me that, in the case of multibyte, the old one is
> good for anything other than the one example you had with a fixed mask
> and fixed number of buckets.  
> I hope you see the value of varying the parameters i mentioned now (it
> should be very clear on the hash bucket count and mask i hope).
> Sorry, but a lot more work is needed and you seem to be in a rush to get
> there ;->
As you can probably gather, all we seem to of done here
is concluded that there are two sorts of data sets -
those that are nicely behaved like yours, and those that
aren''t - like mine.  You seem to be denying mine exist,
which is a pity.  2.6 works well for yours.  2.4 works
well for mine.

Judging from these initial comments:

  russell: BTW, in this example, the new hash I suggested 
           would be as good as the 2.6 case.
  jamal:   Yes, if you used 256 buckets per hash table.

I don''t seem to have explained the new algorithm very well.
Let me try again.  There is nothing inherently new about 
it.  It is just a combination of Alexy''s 2.4 algorithm, 
and your 2.6 version:

  2.4 Algorithm:           2.6 algorithm:       New Algorithm:
  --------------           --------------       --------------
  hash = (hash>>16)^hash;  hash = hash>>shift;  hash =
hash>>shift;
  hash = (hash>>8)^hash;   return hash & 0xff;  hash =
(hash>>16)^hash;
  return hash & 0xff;                           hash = (hash>>8)^hash;
						return hash & 0xff;

How the new one performs:

(a) For all fields <= 8 bits, it is identical to 2.6.

(b) For all fields aligned on a 8 bit boundary, it is
    identical to 2.4.

(c) Importantly, on a single 8 bit aligned value, it is
    identical to 2.4 and 2.6: it is the identity function.
    This keeps the cut & pasters happy.

Now everyone wins: those with nice orderly datasets like 
yours, and those with sparse, random data sets like mine.  
Why not change over to it?

Russell Stuart

2006-Mar-18 05:10 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

On Fri, 2006-03-17 at 09:34 -0500, jamal wrote:> - the 2.4 algorithm performs very poorly for < 8 bits if you use a
> standard mask for ALL cases except when you use a lot of memory, most of
> which is _wasted_, in which case it performs equally. There are only 8
> masks in an 8 bit ranging from length of 1 to 8. Should not be hard to
prove
> or disprove. Should be very easy to see when you plot.
Agreed.
> - You made the claim the 2.6 algorithm performs poorly if you had
> a 16 bit mask. You showed one sample of data. I dont even remember your
> sample anymore because you keep bombarding me with a lot of claims.
> I gave you the parameters which will help convince me. I showed you a 
> sample using similar parameters why the old one was buggy in the case of 
> 8 bits. There are only 16 possible masks for 16 bits (of length 1-16).
> Why cant you just give me similar results? Why do we have to go back
> and forth with a lot of hand waving when we can settle this easily?
I guess there are a couple of points here I don''t understand:

- I don''t see how 2.4 was "buggy", but perhaps we have
different
  definitions of buggy.  I regard giving the wrong result as
  buggy.  Neither algorithm does that.

- I don''t understand your point about "there are only 16 
  possible masks for 16 bits".  What do you want me to show?
> I will not respond to any further emails which do not contain 
> data - instead I am going to produce mine.
Put the 2.4 vs 2.6 argument aside.  The best solution as is 
to adopt the "new" algorithm I proposed.  Here is why:

1.  It can emulate either the 2.4 or 2.6 algorithm by a
    suitable choice of mask.  I can prove formally this 
    if you need me to.

2.  There is no dataset where the "new" algorithm is slower 
    than either 2.4, or 2.6.  This follows from (1).

3.  There are datasets where it is faster than the 2.6,
    algorithm, and datasets where it is faster than the
    2.4 algorithm.

    This follows from the fact that there are datasets
    where 2.6 is faster than 2.4, and there are datasets
    where 2.4 is faster than 2.6.  I think you know this
    already, but if not I can give some simple examples
    to prove it - just ask.

4.  Timing.  If we are going to change the algorithm,
    this is the time to do it - while the algorithm in
    "tc" is wrong and must be changed anyway.

For your convenience I have cut & pasted from the previous
email the comparison of the 2.4, 2.6 and "new" algorithm:

  2.4 Algorithm:           2.6 algorithm:       New Algorithm:
  --------------           --------------       --------------
  hash = (hash>>16)^hash;  hash = hash>>shift;  hash =
hash>>shift;
  hash = (hash>>8)^hash;   return hash & 0xff;  hash =
(hash>>16)^hash;
  return hash & 0xff;                           hash = (hash>>8)^hash;
                                                return hash & 0xff;

Russell Stuart

2006-Mar-20 03:11 UTC

head link

Re: [PATCH] TC: bug fixes to the "sample" clause

On Fri, 2006-03-17 at 09:34 -0500, jamal wrote:> If you are unable to do this then I will. I just dont have time until this
> Sunday.
> I will not respond to any further emails which do not contain data -
instead
> I am going to produce mine.
After that wrist-slap I spent some time putting together 
some data.  I am still not really sure what you are after,
so if this isn''t it please let me know:

http://ace-host.stuart.id.au/russell/files/tc/hash-analysis/

One other thing: I have made a rather embarrassing error
earlier.  When I computed my metric''s I posted earlier 
about 2.4 and 2.6, I emulated the 2.6 hash incorrectly.
If it has of been correct, rather than showing a landside
win for 2.4, it would of shown that 2.6 was slightly 
better.

Possibly Parallel Threads

Search for more apparently analagous threads

LARTC - Feb 2006 - [PATCH] TC: bug fixes to the "sample" clause

[PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Re: [PATCH] TC: bug fixes to the "sample" clause

Possibly Parallel Threads