This patch changes HTB''s class storage from hash+lists to a two-level linear array, so it can do constant time (O(1)) class lookup by classid. It improves scalability for large number of classes. Without the patch, ~14k htb classes can starve a Xeon-3.2 at only 15kpps, using most of it''s cycles traversing lists in htb_find(). The patch eliminates this problem, and has a measurable impact even with a few hundred classes. Previously, scalability could be improved by increasing HTB_HSIZE, modify the hash function, and recompile, but this patch works for everyone without recompile and scales better too. The patch is for 2.6.20-rc6, I have older ones for 2.6.18 and 2.6.19 if anyone is interested. Signed-off-by: Simon Lodal <simonl@parknet.dk> _______________________________________________ LARTC mailing list LARTC@mailman.ds9a.nl http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc
Simon Lodal wrote:> This patch changes HTB''s class storage from hash+lists to a two-level linear > array, so it can do constant time (O(1)) class lookup by classid. It improves > scalability for large number of classes. > > Without the patch, ~14k htb classes can starve a Xeon-3.2 at only 15kpps, > using most of it''s cycles traversing lists in htb_find(). The patch > eliminates this problem, and has a measurable impact even with a few hundred > classes. > > Previously, scalability could be improved by increasing HTB_HSIZE, modify the > hash function, and recompile, but this patch works for everyone without > recompile and scales better too.I agree that the current fixed sized hashes (additionally quite small by default) are a big problem with many classes, for all of HTB/HFSC/CBQ. But I think your approach is a bit wasteful, with unfortunately chosen classids 128 classes are enough to reach the maximum memory usage of ~512kb (with 4k pages and 8 byte pointers). I have a patch for HFSC which introduces dynamic resizing of the class hash. I have planned to generalize it (similar to tcf_hashinfo) and convert HTB and CBQ as well, which as a nice side effect will allow to get rid of some duplicated code, like hash walking. If you give me a few days I''ll try to finish and post it. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 01 February 2007 07:08, Patrick McHardy wrote:> Simon Lodal wrote: > > This patch changes HTB''s class storage from hash+lists to a two-level > > linear array, so it can do constant time (O(1)) class lookup by classid. > > It improves scalability for large number of classes. > > > > Without the patch, ~14k htb classes can starve a Xeon-3.2 at only 15kpps, > > using most of it''s cycles traversing lists in htb_find(). The patch > > eliminates this problem, and has a measurable impact even with a few > > hundred classes. > > > > Previously, scalability could be improved by increasing HTB_HSIZE, modify > > the hash function, and recompile, but this patch works for everyone > > without recompile and scales better too. > > I agree that the current fixed sized hashes (additionally quite > small by default) are a big problem with many classes, for all of > HTB/HFSC/CBQ. But I think your approach is a bit wasteful, with > unfortunately chosen classids 128 classes are enough to reach the > maximum memory usage of ~512kb (with 4k pages and 8 byte pointers).I think it is a non-issue since it does not happen in practice. Generally there are two ways to assign classids: 1) Manually, used when you have few classes. People usually use 100, 101, 102, 200, 201 etc (probably unaware that they are hex). With 4k pages and 32bit pointers, everything below classid 400 is within the first page, which covers most "few classes" examples you can find lying around. 2) Script generated, in practice this is required if you have many classes. The classid''s will then usually be forthrunning, at least in large chunks, which means minimal memory waste, and an optimal case for plain linear lookup; hashing them can only be wasteful.> I have a patch for HFSC which introduces dynamic resizing of the > class hash. I have planned to generalize it (similar to tcf_hashinfo) > and convert HTB and CBQ as well, which as a nice side effect will > allow to get rid of some duplicated code, like hash walking.I have not been looking into HFSC and CBQ, was not aware that they have similar issues.> If you give me a few days I''ll try to finish and post it.Memory is generally not an issue, but CPU is, and you can not beat the CPU efficiency of plain array lookup (always faster, and constant time). If anything, I would find it more relevant to use array lookup with dynamic adjustment of the array size (HTB_CLS_ARRAY_SIZE in my patch); start out small to waste less memory, increase up to PAGE_SIZE as needed. But then, it is probably too much effort for the gain (a few kb''s in machines that should have plenty of RAM anyway), and requires more code => more complexity, bugs, maintenance. Regards Simon - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Simon Lodal <simonl@parknet.dk> writes:> > Memory is generally not an issue, but CPU is, and you can not beat the CPU > efficiency of plain array lookup (always faster, and constant time).Actually that''s not true when the array doesn''t fit in cache. The cost of going out to memory over caches is so large (factor 100 and more) that often algorithms with smaller cache footprint can easily beat algorithms that execute much less instructions if it has less cache misses. That is because not all instructions have the same cost; anything in core is very fast but going out to memory is very slow. That said I think I agree with your analysis that a two level array is probably the right data structure for this and likely not less cache efficient than the hash table. And the worst memory consumption case considered by Patrick should be relatively unlikely. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2007-01-02 at 07:08 +0100, Patrick McHardy wrote:> > I have a patch for HFSC which introduces dynamic resizing of the > class hash.One thing that has bitten me recently was tests to try and see how far i can go insert xfrm SAD/SPDs - the resizing of the hashes kept allocing more and more space until i ran out of memory, then swap took over and hell broke loose. It would be nice in your approach to keep a configurable upper bound on how much mem a hash table can chew.> I have planned to generalize it (similar to tcf_hashinfo) > and convert HTB and CBQ as well, which as a nice side effect will > allow to get rid of some duplicated code, like hash walking. >You know what would be really nice is a generic piece of code that would apply for all sorts of netcode that uses hashes (theres a huge amount of such code) and then converting over slowly all users to it: All attributes to such hashes are known, max-size, hash() etc. The tcf_hashinfo is a good start template for such an effort. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Simon Lodal napisaĆ(a):> The patch is for 2.6.20-rc6, I have older ones for 2.6.18 and 2.6.19 if anyone > is interested.It''s working also on 2.6.20-rc7. I''m testing it and I''m impressed. Good work :)
On 01-02-2007 12:30, Andi Kleen wrote:> Simon Lodal <simonl@parknet.dk> writes: >> Memory is generally not an issue, but CPU is, and you can not beat the CPU >> efficiency of plain array lookup (always faster, and constant time).Probably for some old (or embedded) lean boxes used for small network routers, with memory hungry iptables - memory could be an issue.> Actually that''s not true when the array doesn''t fit in cache. > > The cost of going out to memory over caches is so large (factor 100 and more) > that often algorithms with smaller cache footprint can easily beat > algorithms that execute much less instructions if it has less cache misses. > That is because not all instructions have the same cost; anything > in core is very fast but going out to memory is very slow. > > That said I think I agree with your analysis that a two level > array is probably the right data structure for this and likely > not less cache efficient than the hash table.Strange - it seems you gave only arguments against this analysis...> And the worst memory consumption case considered by Patrick should > be relatively unlikely.Anyway, such approach, that most users do something this (reasonable) way, doesn''t look like good programming practice. I wonder, why not try, at least for a while, to do this a compile (menuconfig) option with a comment: recommended for a large number of classes. After hash optimization and some testing, final decisions could be made. Regards, Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday 05 February 2007 11:16, Jarek Poplawski wrote:> > Strange - it seems you gave only arguments against this > analysis...For a naturally clustered key space (as is common in this case) the two level structure is likely more cache efficient than a generic hash function. That is because the hash will likely spread out the natural clusters and then require more cache lines to access them because there will be less sharing. Ok in theory a very tuned for this case hash function might have similar properties, but normally people don''t put that much care into designing hashes and just use some generic one.> > And the worst memory consumption case considered by Patrick should > > be relatively unlikely. > > Anyway, such approach, that most users do something > this (reasonable) way, doesn''t look like good > programming practice.In the unlikely worst case they will get half a MB of tables. Hardly a show stopper.> I wonder, why not try, at least for a while, to do this > a compile (menuconfig) option with a comment: > recommended for a large number of classes. After hash > optimization and some testing, final decisions could be > made.There are already far too many obscure CONFIGs. Don''t add more. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andi Kleen schrieb:> On Monday 05 February 2007 11:16, Jarek Poplawski wrote: > > I wonder, why not try, at least for a while, to do this > > a compile (menuconfig) option with a comment: > > recommended for a large number of classes. After hash > > optimization and some testing, final decisions could be > > made. > > There are already far too many obscure CONFIGs. Don''t add more.And for people concerned about memory usage: There is always CONFIG_SMALL for such decisions. Maybe one can limit worst case and average memory usage based on this config. The algorithm should stay the same. Regards Ingo Oeser - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday 05 February 2007 11:16, Jarek Poplawski wrote:> On 01-02-2007 12:30, Andi Kleen wrote: > > Simon Lodal <simonl@parknet.dk> writes: > >> Memory is generally not an issue, but CPU is, and you can not beat the > >> CPU efficiency of plain array lookup (always faster, and constant time). > > Probably for some old (or embedded) lean boxes used for > small network routers, with memory hungry iptables - > memory could be an issue.Sure, but if they are that constrained they probably do not run HTB in the first place. We are talking about 4k initially, up to 256k worst case (or 512k if your router is 64bit, unlikely if "small" is a priority).> > And the worst memory consumption case considered by Patrick should > > be relatively unlikely. > > Anyway, such approach, that most users do something > this (reasonable) way, doesn''t look like good > programming practice.The current hash algorithm also assumes certain usage patterns, namely that you choose classids that generate different hash keys (= distribute uniformly across the buckets), or scalability will suffer very quickly. Even at 64 classes you would probably see htb_find() near the top of a profiling analysis. But I would say that it is just as unlikely as choosing 64 classids that cause my patch to allocate all 256k. In these unlikely cases, my patch only wastes passive memory, while the current htb wastes cpu to a point where it can severely limit routing performance.> I wonder, why not try, at least for a while, to do this > a compile (menuconfig) option with a comment: > recommended for a large number of classes. After hash > optimization and some testing, final decisions could be > made.I decided not to do it because it would mean too many ifdefs (ugly+unmaintanable code). Regards Simon - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 01 February 2007 12:30, Andi Kleen wrote:> Simon Lodal <simonl@parknet.dk> writes: > > Memory is generally not an issue, but CPU is, and you can not beat the > > CPU efficiency of plain array lookup (always faster, and constant time). > > Actually that''s not true when the array doesn''t fit in cache. > > The cost of going out to memory over caches is so large (factor 100 and > more) that often algorithms with smaller cache footprint can easily beat > algorithms that execute much less instructions if it has less cache misses. > That is because not all instructions have the same cost; anything > in core is very fast but going out to memory is very slow. > > That said I think I agree with your analysis that a two level > array is probably the right data structure for this and likely > not less cache efficient than the hash table.Good point. The 2-level lookup generates 3 memory accesses (including getting at the htb_class struct). List traversal will generate many more memory accesses, unless the lists have 3 or fewer entries (currently that only holds true for up to 48 classes, uniformly distributed). It is difficult to judge if the tables will be in cache or not. The tables are clearly extra baggage for the cachelines, compared to only having the htb_class structs (they are going to be fetched anyway). On the other hand, if you have 10k classes, they are usually (real world) allocated for individual users, of which at most half are active at any time. With hashing, all 10k classes are fetched into cachelines all the time, only in order to traverse lists. That is >150k wasted cache (5000 x 32 bytes); plenty for 10k pointers in lookup tables. Regards Simon - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 05, 2007 at 06:14:13PM +0100, Simon Lodal wrote:> On Monday 05 February 2007 11:16, Jarek Poplawski wrote: > > On 01-02-2007 12:30, Andi Kleen wrote: > > > Simon Lodal <simonl@parknet.dk> writes: > > >> Memory is generally not an issue, but CPU is, and you can not beat the > > >> CPU efficiency of plain array lookup (always faster, and constant time). > > > > Probably for some old (or embedded) lean boxes used for > > small network routers, with memory hungry iptables - > > memory could be an issue. > > Sure, but if they are that constrained they probably do not run HTB in the > first place. > > We are talking about 4k initially, up to 256k worst case (or 512k if your > router is 64bit, unlikely if "small" is a priority). > > > > And the worst memory consumption case considered by Patrick should > > > be relatively unlikely. > > > > Anyway, such approach, that most users do something > > this (reasonable) way, doesn''t look like good > > programming practice. > > The current hash algorithm also assumes certain usage patterns, namely that > you choose classids that generate different hash keys (= distribute uniformly > across the buckets), or scalability will suffer very quickly. Even at 64 > classes you would probably see htb_find() near the top of a profiling > analysis. > > But I would say that it is just as unlikely as choosing 64 classids that cause > my patch to allocate all 256k. > > In these unlikely cases, my patch only wastes passive memory, while the > current htb wastes cpu to a point where it can severely limit routing > performance. > > > > I wonder, why not try, at least for a while, to do this > > a compile (menuconfig) option with a comment: > > recommended for a large number of classes. After hash > > optimization and some testing, final decisions could be > > made. > > I decided not to do it because it would mean too many ifdefs > (ugly+unmaintanable code).As a matter of fact Andi''s recommentation is enough for me. In his first message he wrote "probably the right data structure for this", so I thought: why not test and make sure. It should be easier without removing current solution. But his second message convinced me. Generally I think 512k (or even 256k) should matter and don''t agree HTB is not for constrained ones. It could be dangerous attitude if every module in the kernel were so "generous". And it could be contagious: others don''t care - why should I? Some time ago low memory requirements and possibility to run on older boxes were strong arguments for linux. Did we give it up to BSDs? So I only wanted to make sure there would be a real gain, because, for consistency, probably the same model should be used with others (CBQ, HFSC). Cheers, Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 05, 2007 at 06:14:13PM +0100, Simon Lodal wrote: ...> Regards... It seems decisions makers need more time, so I''d add 2 cents more: 1c: an indentation could be improved (spaces around operators), like in these places:>+#define HTB_MAX_CLS (TC_H_MIN(-1)+1)...>+ htb_cls_array* a;...>+ int cnt,done;etc. 2c: it is a question of taste, but here:> err = -ENOBUFS; >+ if (q->classes[HTB_CLS_ARRAY(minorid)] == NULL) { >+ if ((q->classes[HTB_CLS_ARRAY(minorid)] = >+ kzalloc(sizeof(htb_cls_array), GFP_KERNEL)) >+ == NULL) >+ goto failure; >+ } > if ((cl = kzalloc(sizeof(*cl), GFP_KERNEL)) == NULL) > goto failure;it would be probably more readable and a bit merciful to the stressed system to free this htb_cls_array after the last error (I know it''s not a leak). Regards, Jarek P. PS: 1c extra - it''s easier to read a diff if you use -p option. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html