Hello, I would like ask for help in despair... I am running complex htb setup to manage two leased lines [1+2MBit] for 500 users. Setup works from few months but in this time I have still awfully problems with stable work. Nowadays there is no week when system hangs two or three times. I tried dozens of setups, patches, recompilations of kernel, iproute and iptables, tricks and still almost without improvement. I tried to find reason but without success. Bellow I mention some observed facts. Maybe someboty could advise or solve problem... When I started with HTB [2.0] [about 200 users on 1Mbit link; Previous it worked on CBQ setup] there wasn''t propably [AFAIK] problems with stable work [many days between manual reboot]. From time to time I had some sudden hangs, which become more often later. I changed kernel to 2.4.18 but the reason what I found [or one of the reasons...] whas hardware related [damaged cpu and/or mboard]. I changed completly all hardware and changed kernel to 2.4.18 patched with htb v2. but problem didn''t dissapear - frozen system during several days. Most ofen hangs was after 1 or 2 days of stable work but some cases it was 4 or even 7 days. . Then I tried 2.4.18 patched with htb v.3. but situation become hopeless. I found that each changing of htb class parameters [tc class change...] lead to freezing system. Sometimes one sometimes more changing params caused crash. Htb 3 was useless for me and htb 2 was/is also unstable. I also tried htb2 on completly other machine [but with the same Linux distro - PLD], but system crashed almost immediately. Now I am testing 2.4.20-pre7 [with included htb3], system didn''t hang immediatelly after changing htb class params but now it works from several hours to one day. Unfortunatelly I didn''t have any error messages in logs and on console. ERRATA: Today I have first time to see some logs on console [kernel 2.4.20-pre7], there is something about Oops "Process swapper (pid: 0, stackpage=c0211000)" and a lot of digits. I put screen on location: http://eter.tym.pl/bug2.gif Diging in LARTC archive I found maybe something simmilar problem in post of "Dimitris Zilaskos: [LARTC] tc reliably hangs my system " [http://mailman.ds9a.nl/pipermail/lartc/2002q3/004316.html] but there is not solution for me and despaired I dont know the reason and what to do with it. BTW. If this bug report [http://eter.tym.pl/bug2.gif file] is related more to kernel than to tc/htb please tell me where to send it. Regards tw -- ---------------- ck.eter.tym.pl "Never let shooling disturb Your education" _______________________________________________ LARTC mailing list / LARTC@mailman.ds9a.nl http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/
Tomasz Wrona wrote:> I tried to find reason but without success. Bellow I mention some > observed facts. Maybe someboty could advise or solve problem...In case you want to try systematic debugging of HTB, you may find tcsim useful. tcsim can be run under Electric Fence and under Valgrind (http://developer.kde.org/~sewardj/). It won''t help you find race conditions and such, but spotting odd side-effects of parameter changes may be well within its capabilities. Of course, a decent set of regression tests should also be useful for future HTB development ... Concerning the Oops you got: you should run it through ksymoops (see Documentation/oops-tracing.txt in your kernel source tree). If you don''t want to type in the whole Oops text, you can also get the location of individual symbols with gdb your/kernel/dir/vmlinux (gdb) info line *0xd093caa4 etc. The most useful data is in the EIP and the call trace. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / /_http://www.almesberger.net/____________________________________________/ _______________________________________________ LARTC mailing list / LARTC@mailman.ds9a.nl http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/
Werner thanks for usefull info ! On Wed, 18 Sep 2002, Werner Almesberger wrote:> Concerning the Oops you got: you should run it through ksymoops > (see Documentation/oops-tracing.txt in your kernel source tree).OK, I retyped screenshot and put it to ksymoops and it said: [Will it be enough info to debug, what can I do also ?] ### BEGIN ### ksymoops 2.4.6 on i686 2.4.20-pre7. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.4.20-pre7/ (default) -m /boot/System.map-2.4.20-pre7 (default) Warning: You did not tell me where to find symbol information. I will assume that the log matches the kernel and modules that are running right now and I''ll use the default options above for symbol resolution. If the current kernel and/or modules do not match the log, you can get more accurate output by telling me the kernel version and where to find map, modules, ksyms etc. ksymoops -h explains the options. Oops: 0000 CPU: 0 EIP: 0010:[<d093c56f>] Not tained Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010202 eax: 00005198 ebx: 00000030 ecx: cd227400 edx: cf6ecc84 esi: cf6ecc84 edi: 00000030 ebp: 00000000 esp: c0211e3c ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 0, stackpage=c0211000) Stack: 00000000 cf6ecc00 00000000 cd227400 d093caa4 cd363c5c cf6ecc84 cd227400 00000003 cd227400 00000001 cd363c5c cd363c5c d093ce55 cd363c5c cd227400 03938700 00000000 cd227400 d093d71f cd363c5c cd227400 c0211eb4 cd363924 Call Trace: [<d093caa4>] [<d093ce55>] [<d093d71f>] [<d093dc1b>] [<d093d913>] [<d093dc8c>] [<c0199843>] [<c019384d>] [<c01168aa>] [<c0109962>] [<c0106ba0>] [<c0106ba0>] [<c010bb18>] [<c0106ba0>] [<c0106ba0>] [<c0106bc3>] [<c0106c29>] [<c0105000>] [<c0105027>] Code: 81 38 f1 fe fa fe 74 12 68 84 01 00 00 68 00 f3 93 d0 c8 82>>EIP; d093c56f <[sch_htb]htb_add_to_id_tree+93/130> <==== >>ecx; cd227400 <_end+cfb8e4c/10596a4c> >>edx; cf6ecc84 <_end+f47e6d0/10596a4c> >>esi; cf6ecc84 <_end+f47e6d0/10596a4c> >>esp; c0211e3c <init_task_union+1e3c/2000>Trace; d093caa4 <[sch_htb]htb_activate_prios+a4/13c> Trace; d093ce55 <[sch_htb]htb_change_class_mode+89/a0> Trace; d093d71f <[sch_htb]htb_do_events+1bb/210> Trace; d093dc1b <[sch_htb]htb_dequeue+10b/21c> Trace; d093d913 <[sch_htb]htb_dequeue_tree+a7/218> Trace; d093dc8c <[sch_htb]htb_dequeue+17c/21c> Trace; c0199843 <qdisc_restart+13/d8> Trace; c019384d <net_tx_action+99/a8> Trace; c01168aa <do_softirq+5a/a4> Trace; c0109962 <do_IRQ+96/a8> Trace; c0106ba0 <default_idle+0/28> Trace; c0106ba0 <default_idle+0/28> Trace; c010bb18 <call_do_IRQ+5/d> Trace; c0106ba0 <default_idle+0/28> Trace; c0106ba0 <default_idle+0/28> Trace; c0106bc3 <default_idle+23/28> Trace; c0106c29 <cpu_idle+41/54> Trace; c0105000 <_stext+0/0> Trace; c0105027 <rest_init+27/28> Code; d093c56f <[sch_htb]htb_add_to_id_tree+93/130> 00000000 <_EIP>: Code; d093c56f <[sch_htb]htb_add_to_id_tree+93/130> <==== 0: 81 38 f1 fe fa fe cmpl $0xfefafef1,(%eax) <====Code; d093c575 <[sch_htb]htb_add_to_id_tree+99/130> 6: 74 12 je 1a <_EIP+0x1a> d093c589 <[sch_htb]htb_add_to_id_tree+ad/130> Code; d093c577 <[sch_htb]htb_add_to_id_tree+9b/130> 8: 68 84 01 00 00 push $0x184 Code; d093c57c <[sch_htb]htb_add_to_id_tree+a0/130> d: 68 00 f3 93 d0 push $0xd093f300 Code; d093c581 <[sch_htb]htb_add_to_id_tree+a5/130> 12: c8 82 00 00 enter $0x82,$0x0 <0>Kernel panic: Aiee, killing interrupt handler! 1 warning issued. Results may not be reliable. ### END ### Regards tw -- ---------------- ck.eter.tym.pl _______________________________________________ LARTC mailing list / LARTC@mailman.ds9a.nl http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/
Tomasz Wrona wrote:> OK, I retyped screenshot and put it to ksymoops and it said: > [Will it be enough info to debug, what can I do also ?]I guess Martin could figure it out from this. I''m too lazy ;-) But it would be interesting to see whether this problem also shows up in tcsim (then, it should be easy to diagnose it completely). You seem to have a sequence of configuration commands that reliably cause this crash, right ? If yes, it would be good if you could send them. Also, do you know what happens at the time of the crash ? Is this simply the first packet that hits HTB after some change, or is this a packet with special characteristics ? (Specific flow, etc.) - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net / /_http://www.almesberger.net/____________________________________________/ _______________________________________________ LARTC mailing list / LARTC@mailman.ds9a.nl http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/