PATCH 1 ====== On my machine tc does not parse filter "sample" for the u32 filter. Eg: tc filter add dev eth2 parent 1:0 protocol ip prio 1 u32 ht 801: \ classid 1:3 \ sample ip protocol 1 0xff match ip protocol 1 0xff Illegal "sample" The reason is a missing memset. This patch fixes it. diff -Nur iproute-20051007.keep/tc/f_u32.c iproute-20051007/tc/f_u32.c --- iproute-20051007.keep/tc/f_u32.c 2005-01-19 08:11:58.000000000 +1000 +++ iproute-20051007/tc/f_u32.c 2006-01-12 17:12:43.000000000 +1000 @@ -878,6 +878,7 @@ struct tc_u32_sel sel; struct tc_u32_key keys[4]; } sel2; + memset(&sel2, 0, sizeof(sel2)); NEXT_ARG(); if (parse_selector(&argc, &argv, &sel2.sel, n)) { fprintf(stderr, "Illegal \"sample\"\n"); PATCH 2 ====== In tc, the u32 sample clause uses the 2.4 hashing algorithm. The hashing algorithm used by the kernel changed in 2.6, consequently "sample" hasn''t work since then. This patch makes the sample clause work for both 2.4 and 2.6: diff -Nur iproute-20051007.keep/tc/f_u32.c iproute-20051007/tc/f_u32.c --- iproute-20051007.keep/tc/f_u32.c 2006-01-12 17:34:37.000000000 +1000 +++ iproute-20051007/tc/f_u32.c 2006-02-07 17:10:29.000000000 +1000 @@ -17,6 +17,7 @@ #include <syslog.h> #include <fcntl.h> #include <sys/socket.h> +#include <sys/utsname.h> #include <netinet/in.h> #include <arpa/inet.h> #include <string.h> @@ -874,6 +875,7 @@ htid = (handle&0xFFFFF000); } else if (strcmp(*argv, "sample") == 0) { __u32 hash; + struct utsname utsname; struct { struct tc_u32_sel sel; struct tc_u32_key keys[4]; @@ -889,8 +891,19 @@ return -1; } hash = sel2.sel.keys[0].val&sel2.sel.keys[0].mask; - hash ^= hash>>16; - hash ^= hash>>8; + uname(&utsname); + if (strncmp(utsname.release, "2.4.", 4) == 0) { + hash ^= hash>>16; + hash ^= hash>>8; + } + else { + __u32 mask = sel2.sel.keys[0].mask; + while (mask && !(mask & 1)) { + mask >>= 1; + hash >>= 1; + } + hash &= 0xFF; + } htid = ((hash<<12)&0xFF000)|(htid&0xFFF00000); sample_ok = 1; continue; PATCH 3 ====== "tc" does not allow you to specify the divisor for the "sample" clause, it always assumes a divisor of 256. If the divisor isn''t 256, (ie it is something less), the kernel will usually whinge because the bucket given to it by "tc" is typically too big. This patch adds a "divisor" option to tc''s "sample" clause: diff -Nur iproute-20051007.keep/tc/f_u32.c iproute-20051007/tc/f_u32.c --- iproute-20051007.keep/tc/f_u32.c 2006-02-10 11:40:16.000000000 +1000 +++ iproute-20051007/tc/f_u32.c 2006-02-10 11:47:14.000000000 +1000 @@ -35,7 +35,7 @@ fprintf(stderr, "or u32 divisor DIVISOR\n"); fprintf(stderr, "\n"); fprintf(stderr, "Where: SELECTOR := SAMPLE SAMPLE ...\n"); - fprintf(stderr, " SAMPLE := { ip | ip6 | udp | tcp | icmp | u{32|16|8} | mark } SAMPLE_ARGS\n"); + fprintf(stderr, " SAMPLE := { ip | ip6 | udp | tcp | icmp | u{32|16|8} | mark } SAMPLE_ARGS [divisor DIVISOR]\n"); fprintf(stderr, " FILTERID := X:Y:Z\n"); } @@ -835,7 +835,7 @@ unsigned divisor; NEXT_ARG(); if (get_unsigned(&divisor, *argv, 0) || divisor == 0 || - divisor > 0x100) { + divisor > 0x100 || (divisor - 1 & divisor)) { fprintf(stderr, "Illegal \"divisor\"\n"); return -1; } @@ -875,6 +875,7 @@ htid = (handle&0xFFFFF000); } else if (strcmp(*argv, "sample") == 0) { __u32 hash; + unsigned divisor = 0x100; struct utsname utsname; struct { struct tc_u32_sel sel; @@ -890,6 +891,15 @@ fprintf(stderr, "\"sample\" must contain exactly ONE key.\n"); return -1; } + if (*argv != 0 && strcmp(*argv, "divisor") == 0) { + NEXT_ARG(); + if (get_unsigned(&divisor, *argv, 0) || divisor == 0 || + divisor > 0x100 || (divisor - 1 & divisor)) { + fprintf(stderr, "Illegal sample \"divisor\"\n"); + return -1; + } + NEXT_ARG(); + } hash = sel2.sel.keys[0].val&sel2.sel.keys[0].mask; uname(&utsname); if (strncmp(utsname.release, "2.4.", 4) == 0) { @@ -904,7 +913,7 @@ } hash &= 0xFF; } - htid = ((hash<<12)&0xFF000)|(htid&0xFFF00000); + htid = ((hash%divisor)<<12)|(htid&0xFFF00000); sample_ok = 1; continue; } else if (strcmp(*argv, "indev") == 0) {
On Sat, 2006-03-11 at 08:11 -0500, jamal wrote:> > On my machine tc does not parse filter "sample" for the u32 > > filter. Eg: > > > > tc filter add dev eth2 parent 1:0 protocol ip prio 1 u32 ht 801: \ > > classid 1:3 \ > > sample ip protocol 1 0xff match ip protocol 1 0xff > > Illegal "sample" > > > > The syntax is correct but ht 801: must exist - that is why it is being > rejected.. So there is absolutely categorically _no way in hell_ your > memset would have fixed it ;-> Apologies for being overdramatic ;->You are wrong on both counts. Firstly, the error message came from tc when it parsed the line. If tc gets an error talking to the kernel it says, among other things: "We have an error talking to the kernel" Ergo, it hadn''t given the command to the kernel yet. This is significant, because the only place that knows whether ht 801: has been created or not is the kernel. So obviously the error message can''t depend on whether the table had been created. As it happens I did create the ht before issuing the prior command when I first struck the bug - but I didn''t show it in my example because it was not relevant. As for _no way in hell_ - there are apparently more ways in hell then you are aware of. If you look at the function pack_key() in tc/f_u32.c, you will see it assumes the "sel" parameter it is passed is initialised. Without the added "memset()" it isn''t - it just contains random crap. Of course on some machines (perhaps yours?) that random crap might be 0, and then it would work. That is why I said at the start of my patch "On my machine tc does not ...". On other machines the bug may not appear.> sample never worked 100% of the time with that hash. It works _most_ of > the time with 256 buckets. Infact it will work some of the time as it is > right now with 2.6.x. Can you post the output of tc -s filter ls on 2.6 > with and without your user space change?Re: "sample never worked 100% of the time with that hash". Can you give an example? As far as I know it always worked. Re: "it will work some of the time as it is right now with 2.6.x". Yes - it will work when you are sampling one byte. I am sampling port numbers, which are two bytes. It will not work in any case where there are two non-zero bytes. Re: "Can you post the output of tc -s filter ls on 2.6 with and without your user space change?". Here it is: With my change: tc qdisc add dev eth0 root handle 1: htb tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 256 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample tcp src 0x369 0xffff match tcp src 0x0369 0xffff tc -s filter show dev eth0 filter parent 1: protocol ip pref 1 u32 filter parent 1: protocol ip pref 1 u32 fh 801: ht divisor 256 filter parent 1: protocol ip pref 1 u32 fh 801:3:800 order 2048 key ht 801 bkt 3 flowid 1: match 03690000/ffff0000 at nexthdr+0 filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1 On the orginal "tc" shipped with debian "sarge": tc qdisc add dev eth0 root handle 1: htb tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 256 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample tcp src 0x369 0xffff match tcp src 0x0369 0xffff Illegal "sample" Ooops. Looks like I can''t get out of this without patching and compiling: tc qdisc add dev eth0 root handle 1: htb tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 256 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample tcp src 0x369 0xffff match tcp src 0x0369 0xffff tc -s filter show dev eth0filter parent 1: protocol ip pref 1 u32 filter parent 1: protocol ip pref 1 u32 fh 801: ht divisor 256 filter parent 1: protocol ip pref 1 u32 fh 801:6a:800 order 2048 key ht 801 bkt 6a flowid 1: match 03690000/ffff0000 at nexthdr+0 filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1 Note: this also answers you request in your other email re: "can you give me an example that doesn''t work".> Heres how you should fix this: > Patch1) fix kernel 2.4.x to be like 2.6.x > patch 2) fix iproute2 to have the same syntax as for 2.6 and put a big > bold note in the code that if anyone changes that part of the code to > look at the kernel u32_hash_fold() routine. > no need for the utsname check.Why do it that way? If you want to put the 2.6 hashing algorithm in 2.4 then go ahead - but that is a separate decision which is not related to the issue of making tc backwards compatible. Making tc work with all versions of the kernel is what I am doing there. As an example of why this is a good idea, Debian ships 2.4 and 2.6 kernels, and one version of tc. That tc should work with both kernels. Finally, regarding which hashing algorithm is better, your results differ from mine. First a bit of background: I am trying make VOIP work over some 256/64 and 512/128 links that carry all sorts of other traffic. The patches you see here are a result of me trying to make that work over a range of sites (companies and home setups). The end result is that it did work, better than I expected it to. On a 512/128 ADSL link, I can: a. Saturate the incoming direction with a large wget, b. Saturate the outgoing direction with a large wget, c. Hit the incoming direction with a "while :; do wget http://www.google.com/ -O /dev/null; done". d. Hit the outgoing direction with an external box doing the same while loop. While all that is going on, I can sustain 2 VOIP calls with perfect clarity on that link. I wasn''t expecting to be able to pull that off. To pull it off I created the u32 filter from hell. This long winded explanation is to forestall the inevitable "why the hell you want want to do that" flame when you see the script that follows. You can find the shell script that creates the filter here: http://www.stuart.id.au/russell/files/tc/setup-traffic-control.sh The script itself is not that important. What is important is that it is a real-life use of u32, and there are approx 1200 hashed u32 filter lines. So how to means how good a job the hash is doing. The easiest would seem to be to use a least squares fit, ie: Number of elements to be hashed = E Number of buckets = B (== 256) Optimal number of elements per bucket = E/B Hash function "goodness" metric = M sqrt(sum foreach bucket i: [ (NrOfElementsInBucket(i) - E/B)^2 ] / B) For a good hash function M < 1. The bigger M is the worse the hash function is. I wrote a python program to compute M for my u32 filter. The python program can be found here: http://www.stuart.id.au/russell/files/tc/tc-ports The dataset it operates on can be found here: http://www.stuart.id.au/russell/files/tc/m.py Results: tcp 2.6: E=534 M=2.35 tcp 2.4: E=534 M=0.82 udp 2.6: E=711 M=2.91 udp 2.4: E=711 M=0.92 As you can probably tell, I see the new hash function in 2.6 as a backward step - not an improvement.
jamal wrote:> On Mon, 2006-13-03 at 14:44 +1000, Russell Stuart wrote: > >>You are wrong on both counts. > > > I am wrong on why it is being rejected - but what you are seeing is > worse than i thought initially. > > Lets put it this way: > The only you will _ever_ get that message is if you had made a syntax > error (which you did not). Please look at the code on where that message > appears and: > > a) tell me how you would have got that message to begin with using > perfectly legal syntax. > a) tell me how a memset would have fixed that.He already told you, pack_key expects the selector to be initialized, otherwise nkeys might contain a value >= 128, which would cause exactly this error, if a matching key is not found within the uninitialized memory by accident.> Just send the memset fix to Stephen with a different reason. Your > current reason is _wrong_ and i really dont have much time to have this > kind of discussion. > If you had said "I added that memset there because it looks like the > right thing to do" then we would not have had this discussion. > > You made claims you fixed a bug. It cant possibly be the bug you fixed. > Was it some other bug perhaps and you mixed up the two?The patch as well as the description are perfectly fine. BTW, running valgrind on tc shows lots of uses of uninitialized values, it seems like a good idea if someone would go over these and fix them up.
On Mon, 13 Mar 2006 18:55:35 +0100 Patrick McHardy <kaber@trash.net> wrote:> jamal wrote: > > On Mon, 2006-13-03 at 14:44 +1000, Russell Stuart wrote: > > > >>You are wrong on both counts. > > > > > > I am wrong on why it is being rejected - but what you are seeing is > > worse than i thought initially. > > > > Lets put it this way: > > The only you will _ever_ get that message is if you had made a syntax > > error (which you did not). Please look at the code on where that message > > appears and: > > > > a) tell me how you would have got that message to begin with using > > perfectly legal syntax. > > a) tell me how a memset would have fixed that. > > He already told you, pack_key expects the selector to be initialized, > otherwise nkeys might contain a value >= 128, which would cause > exactly this error, if a matching key is not found within the > uninitialized memory by accident. > > > Just send the memset fix to Stephen with a different reason. Your > > current reason is _wrong_ and i really dont have much time to have this > > kind of discussion. > > If you had said "I added that memset there because it looks like the > > right thing to do" then we would not have had this discussion. > > > > You made claims you fixed a bug. It cant possibly be the bug you fixed. > > Was it some other bug perhaps and you mixed up the two?The memset fix is in current CVS. I just wasn''t going to take the patch that looked at utsname to decide what hash to use.> The patch as well as the description are perfectly fine. > > BTW, running valgrind on tc shows lots of uses of uninitialized values, > it seems like a good idea if someone would go over these and fix them > up.If we had a test script of commands (code coverage), that would help.
Stephen Hemminger wrote:> The memset fix is in current CVS. I just wasn''t going to take the > patch that looked at utsname to decide what hash to use.Yes, that sucks. We have another incompatibility, old iproute versions (like the one shipped by Debian) show garbage statistics for HFSC. I think it still works properly, but I wasn''t able to track down the cause. I have a feeling its somehow related to the gen_stats changes.>>BTW, running valgrind on tc shows lots of uses of uninitialized values, >>it seems like a good idea if someone would go over these and fix them >>up. > > > If we had a test script of commands (code coverage), that would help.Getting good coverage looks like a lot of work, but I''ll put it on my TODO list for a boring day :)
Stephen Hemminger wrote:>>BTW, running valgrind on tc shows lots of uses of uninitialized values, >>it seems like a good idea if someone would go over these and fix them >>up. > > > If we had a test script of commands (code coverage), that would help.Actually the Coverity scanner is quite good at spotting these problems, so I''ve requested iproute to be added to the list of regulary scanned projects.
On Mon, 2006-03-13 at 10:04 -0800, Stephen Hemminger wrote:> The memset fix is in current CVS. I just wasn''t going to take the > patch that looked at utsname to decide what hash to use.Stephen, could you describe your objections to it please?
On Tue, 14 Mar 2006 07:43:57 +1000 Russell Stuart <russell-lartc@stuart.id.au> wrote:> On Mon, 2006-03-13 at 10:04 -0800, Stephen Hemminger wrote: > > The memset fix is in current CVS. I just wasn''t going to take the > > patch that looked at utsname to decide what hash to use. > > Stephen, could you describe your objections to it please? >Because it means that the API for netlink isn''t being used properly.
On Mon, 2006-03-13 at 13:50 -0800, Stephen Hemminger wrote:> On Tue, 14 Mar 2006 07:43:57 +1000 > Russell Stuart <russell-lartc@stuart.id.au> wrote: > > > On Mon, 2006-03-13 at 10:04 -0800, Stephen Hemminger wrote: > > > The memset fix is in current CVS. I just wasn''t going to take the > > > patch that looked at utsname to decide what hash to use. > > > > Stephen, could you describe your objections to it please? > > > > Because it means that the API for netlink isn''t being used properly.Do you mean the binary API between the kernel and the applications has been broken; this is bad as it transgresses some unwritten law; and the patch is bad because it hides this problem rather than addressing it? Does the fact that the binary API changed during a major kernel release (between 2.4 and 2.6) give us some wriggle room here? Anyway, jokes aside, the situation we have now is the current "tc" doesn''t work with the current kernel. This situation can''t be allowed to continue, I presume. Ergo, here is a list of things we could do to fix this. All you (plural) have to do is choose one, or some combination: 1. Change the hashing algorithm back to the 2.4 way. (IMHO, the 2.4 algorithm was better.) Disadvantage: anybody who had a u32 filter that hashed on more than one non-zero byte will have their u32 filters suddenly break. However, since how the hashing algorithm was never documented beyond "HOWTO''s" that showed how to hash on one byte, and every example of hashing I have seen has been a copy & paste of said HOWTO''s, my guess is there are stuff all people who will be effected. Another disadvantage: people who are using older 2.6 kernels (eg me on my production machines) won''t have the problem fixed. 2. Change the hashing algorithm in "tc" to match the current kernel. Disadvantage: "tc" will no longer work with 2.4 kernels. 3. Change the hashing algorithm in "tc" to match the current kernel, and change the 2.4 kernel''s hashing algorithm to match the 2.6 kernel. This is Jamal''s proposed solution. Disadvantage: 2.4 is a "stable" kernel, produced when we made a real effort to distinguish between between stable and development kernels. This change would break API compatibility in a stable series. Yuk! 4. Make "tc" look at the kernel version, and choose the appropriate hashing function. This was my solution, and everybody seems to hate it bar me :(. Disadvantages: None, other than it hides a violation of an "unwritten law". 5. A combination of 1 & 4. Change the hashing in 2.6 algorithm back to what it was in 2.4, and hide the change by making "tc" check the kernel version and choose the matching hash. Disadvantages: None, other than now we have wilfully broken the unwritten law twice. My personal preference is to have a "tc" in CVS that works with _all_ kernel versions. Yes - the netlink interface has been broken. But is was done, and now can''t be undone. No matter why we do, there will forever more be kernel versions with incompatible netlink interfaces. We can''t fix it. We just have to limit the damage.
On Sat, 2006-03-11 at 08:11 -0500, jamal wrote:> On Fri, 2006-10-02 at 12:33 +1000, Russell Stuart wrote: > > This patch adds a "divisor" option to tc''s "sample" > > clause: > > While this looks right - can we have more test data with tc filter ls > both before and after your fix? Specify divisor of 256 and 16 for > example. Show that for the 256 it is the same as before and for 16 it > does the right thing.With patch, divisor 256: tc qdisc del dev eth0 root tc qdisc add dev eth0 root handle 1: htb tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 256 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0x01 0xff divisor 256 match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0x0f 0xff divisor 256 match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0x10 0xff divisor 256 match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0xef 0xff divisor 256 match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0xf0 0xff divisor 256 match u32 0 0 tc -s filter show dev eth0 filter parent 1: protocol ip pref 1 u32 filter parent 1: protocol ip pref 1 u32 fh 801: ht divisor 256 filter parent 1: protocol ip pref 1 u32 fh 801:1:800 order 2048 key ht 801 bkt 1 flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801:f:800 order 2048 key ht 801 bkt f flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801:10:800 order 2048 key ht 801 bkt 10 flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801:ef:800 order 2048 key ht 801 bkt ef flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801:f0:800 order 2048 key ht 801 bkt f0 flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1 Without patch, divisor 256: tc qdisc add dev eth0 root handle 1: htb tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 256 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0x01 0xff match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0x0f 0xff match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0x10 0xff match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0xef 0xff match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0xf0 0xff match u32 0 0 tc -s filter show dev eth0 filter parent 1: protocol ip pref 1 u32 filter parent 1: protocol ip pref 1 u32 fh 801: ht divisor 256 filter parent 1: protocol ip pref 1 u32 fh 801:1:800 order 2048 key ht 801 bkt 1 flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801:f:800 order 2048 key ht 801 bkt f flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801:10:800 order 2048 key ht 801 bkt 10 flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801:ef:800 order 2048 key ht 801 bkt ef flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801:f0:800 order 2048 key ht 801 bkt f0 flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1 With patch, divisor 16: tc qdisc del dev eth0 root tc qdisc add dev eth0 root handle 1: htb tc filter add dev eth0 parent 1:0 prio 1 protocol ip u32 ht 801: divisor 16 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0x01 0xff divisor 16 match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0x0f 0xff divisor 16 match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0x10 0xff divisor 16 match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0xef 0xff divisor 16 match u32 0 0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ht 801: classid 1:0 sample u8 0xf0 0xff divisor 16 match u32 0 0 tc -s filter show dev eth0 filter parent 1: protocol ip pref 1 u32 filter parent 1: protocol ip pref 1 u32 fh 801: ht divisor 16 filter parent 1: protocol ip pref 1 u32 fh 801::800 order 2048 key ht 801 bkt 0 flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801::801 order 2049 key ht 801 bkt 0 flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801:1:800 order 2048 key ht 801 bkt 1 flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801:f:800 order 2048 key ht 801 bkt f flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 801:f:801 order 2049 key ht 801 bkt f flowid 1: match 00000000/00000000 at 0 filter parent 1: protocol ip pref 1 u32 fh 800: ht divisor 1
On Sat, 2006-03-11 at 10:56 -0500, jamal wrote:> Right - take a look at the use of hashkey with varying divisors to see > where the 2.4 folding breaks[1]. Note you should be able to use hashkey > instead of sample and it would work fine.<snip>> [1] Essentially, if you teach u32 in 2.4 to spread rules over different > buckets it will not do so evenly. Off top of my head i remember that the > best you could ever do is have bucket selection in increments of 4 (so > bucket 0, 4, 8,..) with certain divisor sizes - mostly works with 256; > to work properly you need it to be (0,1,2,3,..). > implies --> When it does work, it is detrimental to lookup speed when > you hit thousands of rules because some buckets get overloaded and > others are totaly unused and when it doesnt work, it results in > classifier misses for rules you inserted.Hmm. I can''t see how this could be so. Can you give specific examples. The only time I can think of where the 2.4 XOR hash would be worse is where there is a correlation between the bits in different bytes. I can''t think of a real life example of where that would be so. In every other case it will be as good as, but typically better than, the 2.6 algorithm.
<snip>Much discussion bashing this issue to death.</snip> (sorry, jamal - this one is CC''ed to lartc.) Here is are revised versions of the 2 as yet unapplied patches. PATCH 1 ====== [Has been applied.] PATCH 2 ====== In tc, the u32 sample clause uses the 2.4 hashing algorithm. The hashing algorithm used by the kernel changed in 2.6, consequently "sample" hasn''t work since then. This patch makes the sample clause work 2.6 only. This is different from the prior version of the patch, in that it made it work with 2.4 and 2.6: diff -Nur iproute-20051007.keep/tc/f_u32.c iproute-20051007/tc/f_u32.c --- iproute-20051007.keep/tc/f_u32.c 2006-03-14 12:28:17.000000000 +1000 +++ iproute-20051007/tc/f_u32.c 2006-03-16 11:08:25.000000000 +1000 @@ -888,8 +888,18 @@ return -1; } hash = sel2.sel.keys[0].val&sel2.sel.keys[0].mask; +#if 0 + /* 2.2 .. 2.4 hashing algorithm */ hash ^= hash>>16; hash ^= hash>>8; +#else + /* 2.5 onwards */ + __u32 mask = sel2.sel.keys[0].mask; + while (mask && !(mask & 1)) { + mask >>= 1; + hash >>= 1; + } +#endif htid = ((hash<<12)&0xFF000)|(htid&0xFFF00000); sample_ok = 1; continue; PATCH 3 ====== "tc" does not allow you to specify the divisor for the "sample" clause, it always assumes a divisor of 256. If the divisor isn''t 256, (ie it is something less), the kernel will usually whinge because the bucket given to it by "tc" is typically too big. This patch adds a "divisor" option to tc''s "sample" clause. This is identical to the previous version of the patch, other than it now applies correctly after the revised PATCH 3. diff -Nur iproute-20051007.keep/tc/f_u32.c iproute-20051007/tc/f_u32.c --- iproute-20051007.keep/tc/f_u32.c 2006-03-16 11:27:17.000000000 +1000 +++ iproute-20051007/tc/f_u32.c 2006-03-16 11:29:56.000000000 +1000 @@ -34,7 +34,7 @@ fprintf(stderr, "or u32 divisor DIVISOR\n"); fprintf(stderr, "\n"); fprintf(stderr, "Where: SELECTOR := SAMPLE SAMPLE ...\n"); - fprintf(stderr, " SAMPLE := { ip | ip6 | udp | tcp | icmp | u{32|16|8} | mark } SAMPLE_ARGS\n"); + fprintf(stderr, " SAMPLE := { ip | ip6 | udp | tcp | icmp | u{32|16|8} | mark } SAMPLE_ARGS [divisor DIVISOR]\n"); fprintf(stderr, " FILTERID := X:Y:Z\n"); } @@ -834,7 +834,7 @@ unsigned divisor; NEXT_ARG(); if (get_unsigned(&divisor, *argv, 0) || divisor == 0 || - divisor > 0x100) { + divisor > 0x100 || (divisor - 1 & divisor)) { fprintf(stderr, "Illegal \"divisor\"\n"); return -1; } @@ -874,6 +874,7 @@ htid = (handle&0xFFFFF000); } else if (strcmp(*argv, "sample") == 0) { __u32 hash; + unsigned divisor = 0x100; struct { struct tc_u32_sel sel; struct tc_u32_key keys[4]; @@ -888,6 +889,15 @@ fprintf(stderr, "\"sample\" must contain exactly ONE key.\n"); return -1; } + if (*argv != 0 && strcmp(*argv, "divisor") == 0) { + NEXT_ARG(); + if (get_unsigned(&divisor, *argv, 0) || divisor == 0 || + divisor > 0x100 || (divisor - 1 & divisor)) { + fprintf(stderr, "Illegal sample \"divisor\"\n"); + return -1; + } + NEXT_ARG(); + } hash = sel2.sel.keys[0].val&sel2.sel.keys[0].mask; #if 0 /* 2.2 .. 2.4 hashing algorithm */ @@ -901,7 +911,7 @@ hash >>= 1; } #endif - htid = ((hash<<12)&0xFF000)|(htid&0xFFF00000); + htid = ((hash%divisor)<<12)|(htid&0xFFF00000); sample_ok = 1; continue; } else if (strcmp(*argv, "indev") == 0) {
On Thu, 2006-03-16 at 08:51 -0500, jamal wrote:> > BTW, in this example, the new hash I suggested would be as good > > as the 2.6 case. > > > > Yes, if you used 256 buckets per hash table ;->No - you haven''t understood what the new algorithm does. It will get the same performance of the 2.6 version with the same memory. In fact for all cases where the number of bits used is <= 8, ie is _identical_ to the 2.6 algorithm.> Wait till you start getting into V6. Or tryIt is odd that you keep mention IPv6. The IPv6 header has no fields that are less that 8 bits, and there are none that are not aligned on a 8 bit boundary. In fact even within the IPv6 addresses, there are no sub-fields less that 8 bits. It would be a happy hunting ground for the 2.4 algorithm.> building an effective fast lookup scheme for a five tuple lookup using > u32; five tuples being {src/dst IP, IP protocol, src/dst port} > So in simple setups where you say dont exceed a few hundred hash tables, > and a few hundred to thousand filters, the old hash was fine.> > Now lets take the case that we hashing a number of bytes with > > a 256 divisor (my case). If these bytes contain truly random > > values, then again 2.4 and 2.6 will be the same. > > But they are not.Well, this is the crux of the matter, isn''t it? You are apparently used to looking up well known patterns - probably ones you determine. As such, you can arrange these in nice logical ranges. Guess what? The very thing that started this off was me looking up TCP & UDP port numbers. I didn''t determine those port numbers. They are all over the shop - for me they are effectively random data. And they are sparse too, as in they occupy 16 bits. The whole point of the rant that followed was to explain to you why in a case like that the 2.6 hash runs substantially slower that the 2.4 one. Whats more, it can''t be fixed by simply trading off more memory. But you seem to be burying you head in the sand and saying "such data sets don''t exist". Well they do. And I gave you a real life one. Here is another one: I have contemplated at times giving priority to Australian traffic. So I went and found myself a list of Australian IP addresses. Being allocated by some central authority, I expected some nice ordered data set. How naive. There are some 400 subnets, and after trying to find some pattern I gave up after an hour. Another random dataset. The 2.4 algorithm will handle that well. When you changed the hash algorithm, you optimised it for your particular world - at the expense of people who have data sets like mime. Note that users of 2.4 with your dataset have a way out - waste a bit of memory and it will run just as fast. With a large dataset and 2.6 there is no way out. You are probably doubting this as you are it - I explain why below.> > The 2.4 XOR''s > > the two values together. XOR has the property that it "adds" > > the randomness of the bits together, unless they are correlated. > > So if you take two partially random bits, and XOR them together, > > then the resulting bit will be more random that the original two > > bits. An illustration of this from crypto is a stream cypher > > like rc4. rc4 effectively produces a random stream of bits. > > To use rc4, you XOR your plain text with this random stream. > > Even though your plain text is highly non-random, the cypher > > text is at least as random as the rc4 stream - and thus looks > > like gibberish. Anyway, the end result for 2.4 is that if you > > XOR two perfectly random bytes, the result is a perfectly random > > byte. So for random data 2.6 and 2.4 are the same. > > > > To use a term which makes sense up here, you are treading on thin ice > now ;-> You dont wanna skate on this river ;-> > > Randomness has nothing to do with this. It doesnt matter whether > random traffic patterns arrive at all. > - assume that in 8 bits of data, 6 bits provide the most entropy. > - assume 256 hosts (just to cover the range of the 8 bits) > > a) If i built a hash table with 256 buckets, i can guarantee that > i will find any host using either scheme in 4 lookups. > Except 75% of the buckets will not be used by either.You miss one point. While you can guarantee it will happen in 4 lookups, 2.4 averages slightly more than 1 lookup it if hashes the entire value in one hit. In 2.6, you are stuck with your 4 lookups, as you say.> b) If i built a hash table with 128 buckets, I can guarantee i will > find any host in 4 lookups with the new scheme but it will take > a best case of 8 lookups with the old scheme. > Except 50% of the buckets will not be used by the new scheme and > 75% not be used by the old scheme. > > c) if i built a hash table with 64 buckets, I can guarantee that i will > find any host in 4 lookups with the new scheme and 16 in the old scheme. > 100% of the buckets will be used by the new scheme; only 25% will be > used by the old scheme. > > d) If i built a hash table of 32 buckets, I can guarantee that i will > find any host in 8 lookups with new scheme and _32_ with old. > 100% used by the new scheme and 25% by old > > See the pattern? But this is not the worst part. The nasty part > is if i built a newer level of hash tables so i can reduce the lookups, > it totally reduces my effectiveness; i need to figure out which buckets > are being hit etc.The pattern is based on the (unstated) premise that you are dropping the least significant bits in the byte in your hashing function. Perhaps you are assuming you know where these random bits live in the address. I wasn''t so lucky with my dataset. However lets go with the original premise - the data is truly random, and you don''t know where bits are. When you drop the least significant bits 2.4 behaves badly. But since the data is random, you are better off dropping the upper bits when using 2.4. If you do that then the 2.4 and 2.6 hashes perform identically when used as you describe above. In other words, the example didn''t show what you intended it to show. It showed that if you do the optimal thing for 2.6 and use a tree of hash tables then 2.4 and 2.6 can be made to perform identically. If you do the optimal thing for 2.4, then on average 2.4 can be made to run almost 4 times faster than 2.6, on average.> You have not convinced me that, in the case of multibyte, the old one is > good for anything other than the one example you had with a fixed mask > and fixed number of buckets. > I hope you see the value of varying the parameters i mentioned now (it > should be very clear on the hash bucket count and mask i hope). > Sorry, but a lot more work is needed and you seem to be in a rush to get > there ;->As you can probably gather, all we seem to of done here is concluded that there are two sorts of data sets - those that are nicely behaved like yours, and those that aren''t - like mine. You seem to be denying mine exist, which is a pity. 2.6 works well for yours. 2.4 works well for mine. Judging from these initial comments: russell: BTW, in this example, the new hash I suggested would be as good as the 2.6 case. jamal: Yes, if you used 256 buckets per hash table. I don''t seem to have explained the new algorithm very well. Let me try again. There is nothing inherently new about it. It is just a combination of Alexy''s 2.4 algorithm, and your 2.6 version: 2.4 Algorithm: 2.6 algorithm: New Algorithm: -------------- -------------- -------------- hash = (hash>>16)^hash; hash = hash>>shift; hash = hash>>shift; hash = (hash>>8)^hash; return hash & 0xff; hash = (hash>>16)^hash; return hash & 0xff; hash = (hash>>8)^hash; return hash & 0xff; How the new one performs: (a) For all fields <= 8 bits, it is identical to 2.6. (b) For all fields aligned on a 8 bit boundary, it is identical to 2.4. (c) Importantly, on a single 8 bit aligned value, it is identical to 2.4 and 2.6: it is the identity function. This keeps the cut & pasters happy. Now everyone wins: those with nice orderly datasets like yours, and those with sparse, random data sets like mine. Why not change over to it?
On Fri, 2006-03-17 at 09:34 -0500, jamal wrote:> - the 2.4 algorithm performs very poorly for < 8 bits if you use a > standard mask for ALL cases except when you use a lot of memory, most of > which is _wasted_, in which case it performs equally. There are only 8 > masks in an 8 bit ranging from length of 1 to 8. Should not be hard to prove > or disprove. Should be very easy to see when you plot.Agreed.> - You made the claim the 2.6 algorithm performs poorly if you had > a 16 bit mask. You showed one sample of data. I dont even remember your > sample anymore because you keep bombarding me with a lot of claims. > I gave you the parameters which will help convince me. I showed you a > sample using similar parameters why the old one was buggy in the case of > 8 bits. There are only 16 possible masks for 16 bits (of length 1-16). > Why cant you just give me similar results? Why do we have to go back > and forth with a lot of hand waving when we can settle this easily?I guess there are a couple of points here I don''t understand: - I don''t see how 2.4 was "buggy", but perhaps we have different definitions of buggy. I regard giving the wrong result as buggy. Neither algorithm does that. - I don''t understand your point about "there are only 16 possible masks for 16 bits". What do you want me to show?> I will not respond to any further emails which do not contain > data - instead I am going to produce mine.Put the 2.4 vs 2.6 argument aside. The best solution as is to adopt the "new" algorithm I proposed. Here is why: 1. It can emulate either the 2.4 or 2.6 algorithm by a suitable choice of mask. I can prove formally this if you need me to. 2. There is no dataset where the "new" algorithm is slower than either 2.4, or 2.6. This follows from (1). 3. There are datasets where it is faster than the 2.6, algorithm, and datasets where it is faster than the 2.4 algorithm. This follows from the fact that there are datasets where 2.6 is faster than 2.4, and there are datasets where 2.4 is faster than 2.6. I think you know this already, but if not I can give some simple examples to prove it - just ask. 4. Timing. If we are going to change the algorithm, this is the time to do it - while the algorithm in "tc" is wrong and must be changed anyway. For your convenience I have cut & pasted from the previous email the comparison of the 2.4, 2.6 and "new" algorithm: 2.4 Algorithm: 2.6 algorithm: New Algorithm: -------------- -------------- -------------- hash = (hash>>16)^hash; hash = hash>>shift; hash = hash>>shift; hash = (hash>>8)^hash; return hash & 0xff; hash = (hash>>16)^hash; return hash & 0xff; hash = (hash>>8)^hash; return hash & 0xff;
On Fri, 2006-03-17 at 09:34 -0500, jamal wrote:> If you are unable to do this then I will. I just dont have time until this > Sunday. > I will not respond to any further emails which do not contain data - instead > I am going to produce mine.After that wrist-slap I spent some time putting together some data. I am still not really sure what you are after, so if this isn''t it please let me know: http://ace-host.stuart.id.au/russell/files/tc/hash-analysis/ One other thing: I have made a rather embarrassing error earlier. When I computed my metric''s I posted earlier about 2.4 and 2.6, I emulated the 2.6 hash incorrectly. If it has of been correct, rather than showing a landside win for 2.4, it would of shown that 2.6 was slightly better.