Geoffroy Desvernay
2008-Dec-02 02:13 UTC
RELENG_7_1: bce driver change generating too much interrupts ?
Since last upgrade, I see much more CPU time "eated" by interrupts (at least 10% cpu in top) (see http://dgeo.perso.ec-marseille.fr/cpu-week.png) The server behave correctly (Or seems to?), and high interrupt number seems to come from bce cards (source: systat -vmstat) I just upgraded from "RELENG_7 Mon Sep 8 12:33:06 CEST 2008" to "RELENG_7_1 Sat Nov 29 16:20:35 CET 2008" We have the same machine (dell PE 1950) which have not been upgraded (production use - the two machine are carp(4)-redundant) I don't know if it is related to "SVN rev 184826 on 2008-11-10 22:40:16Z by delphij" patch to sys/dev/bce/if_bce.c If I can help debugging something? These are production machines, but I may test patches or ? on the faulty system. Some clues: Under the very same load (carp interfaces down on other machine), vmstat shows: for newer system: procs memory page disk faults cpu r b w avm fre flt re pi po fr sr mf0 in sy cs us sy id 0 1 1 4806M 460M 649 0 0 0 582 2 0 21770 1270 13653 1 15 85 and for older: procs memory page disk faults cpu r b w avm fre flt re pi po fr sr mf0 in sy cs us sy id 0 1 0 3694M 414M 236 0 0 0 199 17 0 286 317 386 1 1 97 bce-related part of dmesg for the newer system: bce0: <Broadcom NetXtreme II BCM5708 1000Base-T (B2)> mem 0xf4000000-0xf5ffffff irq 16 at device 0.0 on pci9 miibus0: <MII bus> on bce0 bce0: Ethernet address: 00:15:c5:f1:56:f4 bce0: [ITHREAD] bce0: ASIC (0x57081020); Rev (B2); Bus (PCI-X, 64-bit, 133MHz); F/W (0x02090105); Flags( SPLT MFW MSI ) bce1: <Broadcom NetXtreme II BCM5708 1000Base-T (B2)> mem 0xf8000000-0xf9ffffff irq 16 at device 0.0 on pci5 miibus1: <MII bus> on bce1 bce1: Ethernet address: 00:15:c5:f1:56:f2 bce1: [ITHREAD] bce1: ASIC (0x57081020); Rev (B2); Bus (PCI-X, 64-bit, 133MHz); F/W (0x02090105); Flags( SPLT MFW MSI ) And on the older system: bce0: <Broadcom NetXtreme II BCM5708 1000Base-T (B2)> mem 0xf4000000-0xf5ffffff irq 16 at device 0.0 on pci9 miibus0: <MII bus> on bce0 bce0: Ethernet address: 00:15:c5:f1:6a:47 bce0: [ITHREAD] bce0: ASIC (0x57081020); Rev (B2); Bus (PCI-X, 64-bit, 133MHz); F/W (0x02090105); Flags( MFW MSI ) bce1: <Broadcom NetXtreme II BCM5708 1000Base-T (B2)> mem 0xf8000000-0xf9ffffff irq 16 at device 0.0 on pci5 miibus1: <MII bus> on bce1 bce1: Ethernet address: 00:15:c5:f1:6a:45 bce1: [ITHREAD] bce1: ASIC (0x57081020); Rev (B2); Bus (PCI-X, 64-bit, 133MHz); F/W (0x02090105); Flags( MFW MSI ) -- Geoffroy Desvernay Ecole Centrale de Marseille -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 258 bytes Desc: OpenPGP digital signature Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20081202/b9fe1c12/signature.pgp
Mike Jakubik
2008-Dec-02 08:01 UTC
RELENG_7_1: bce driver change generating too much interrupts ?
On Tue, December 2, 2008 4:57 am, Geoffroy Desvernay wrote:> Since last upgrade, I see much more CPU time "eated" by interrupts (at > least 10% cpu in top) > (see http://dgeo.perso.ec-marseille.fr/cpu-week.png)I am also seeing the same behavior on a farm of Dell servers. root@web.local:~# vmstat -i interrupt total rate irq1: atkbd0 18 0 irq14: ata0 176 0 irq16: mfi0 67924 1 irq20: uhci1 uhci3 1 0 irq21: uhci0 uhci+ 5 0 cpu0: timer 132244117 1997 irq257: bce1 3366039632 50853 cpu1: timer 132244053 1997 cpu2: timer 132244053 1997 cpu3: timer 132244053 1997 Total 3895084032 58846 Not only this, but i have also noticed that there are a number of errors reported by netstat now. before the drivers update, i would not get these errors. root@web.local:~# netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll 0 bce1 1500 <Link#2> 00:1e:c9:b5:cc:b6 1848959 2197 1357031 0 0
Dmitry Sivachenko
2008-Dec-03 01:03 UTC
RELENG_7_1: bce driver change generating too much interrupts ?
On Tue, Dec 02, 2008 at 04:44:46PM -0800, Xin LI wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi guys, > > I think I got a real fix. >I tried that patch with very recent 7-STABLE. I does fix the problem for me. Thanks a lot!> Cheers, > - -- > Xin LI <delphij@delphij.net> http://www.delphij.net/ > FreeBSD - The Power to Serve! > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.9 (FreeBSD) > > iEYEARECAAYFAkk11n0ACgkQi+vbBBjt66Dy6wCfSl3eLRhj5TVs24Q+8ao5Mcz0 > FNQAoK8KvziiXFoanhSlWv636o+HfYIj > =AixC > -----END PGP SIGNATURE-----> Index: if_bce.c > ==================================================================> --- if_bce.c (revision 185565) > +++ if_bce.c (working copy) > @@ -7030,13 +7030,14 @@ > > /* Was it a link change interrupt? */ > if ((status_attn_bits & STATUS_ATTN_BITS_LINK_STATE) !> - (sc->status_block->status_attn_bits_ack & STATUS_ATTN_BITS_LINK_STATE)) > + (sc->status_block->status_attn_bits_ack & STATUS_ATTN_BITS_LINK_STATE)) { > bce_phy_intr(sc); > > - /* Clear any transient status updates during link state change. */ > - REG_WR(sc, BCE_HC_COMMAND, > - sc->hc_command | BCE_HC_COMMAND_COAL_NOW_WO_INT); > - REG_RD(sc, BCE_HC_COMMAND); > + /* Clear any transient status updates during link state change. */ > + REG_WR(sc, BCE_HC_COMMAND, > + sc->hc_command | BCE_HC_COMMAND_COAL_NOW_WO_INT); > + REG_RD(sc, BCE_HC_COMMAND); > + } > > /* If any other attention is asserted then the chip is toast. */ > if (((status_attn_bits & ~STATUS_ATTN_BITS_LINK_STATE) !=
Mike Jakubik
2008-Dec-03 07:48 UTC
RELENG_7_1: bce driver change generating too much interrupts ?
On Wed, December 3, 2008 3:27 am, Dmitry Sivachenko wrote:> On Tue, Dec 02, 2008 at 04:44:46PM -0800, Xin LI wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Hi guys, >> >> I think I got a real fix. >> > > > I tried that patch with very recent 7-STABLE. > I does fix the problem for me.Good to hear. I will have to wait a few days before i update the code as these systems are in production. Thanks guys.
geoffroy desvernay
2008-Dec-03 13:23 UTC
RELENG_7_1: bce driver change generating too much interrupts ?
Xin LI a ?crit :> Hi guys, > > I think I got a real fix. >It seems to "work for me?" too Server under normal charge (smtp/imap/Maildir for ~1000 users, NFS filer), everything seems ok... (1h uptime for now) Thank you ! -- geoffroy desvernay -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: OpenPGP digital signature Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20081203/2fd06398/signature.pgp
Xin LI
2008-Dec-05 13:40 UTC
RELENG_7_1: bce driver change generating too much interrupts ?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 FYI, I have committed the patch as r185653 (stable/7) and r185654 (releng/7.1) so new build would get this issue fixed. Thanks goes to David who gave review for the changes and all who tested the earlier patches. Cheers, - -- Xin LI <delphij@delphij.net> http://www.delphij.net/ FreeBSD - The Power to Serve! -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (FreeBSD) iEYEARECAAYFAkk5n6YACgkQi+vbBBjt66BToACfTp+1hqno30HTpNfcvMn7SpAF 6XoAn1St590CMK2Lz9jLwlnTLDKGW8cV =/FVN -----END PGP SIGNATURE-----
Oleg Gorokhov
2008-Dec-08 01:53 UTC
RELENG_7_1: bce driver change generating too much interrupts ?
This patch committed fixes the issue reported earlier with interruptions but there is one more problem discussed here: http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2008-11/msg00144.html We also have observed similar bad behavior for network (especially ssl-based) operations: imaps, ssh and smtp starttls connections - all of them were failed to establish after a day of successful operation: Dec 7 23:32:41 imaps[62530]: accepted connection Dec 7 23:32:41 imaps[62530]: SSL_accept() incomplete -> wait Dec 7 23:32:41 imaps[62530]: wrong version number in SSL_accept() -> fail Dec 7 23:32:41 master[3930]: process 62530 exited, status 75 Dec 7 23:32:41 master[3930]: service imaps pid 62530 in BUSY state: terminated abnormally Dec 7 23:39:26 imaps[91999]: SSL_accept() incomplete -> wait Dec 7 23:39:26 imaps[91999]: decryption failed or bad record mac in SSL_accept() -> fail Dec 7 23:32:44 lmtp[77715]: [lmtpd] STARTTLS failed: gamgee.yandex.ru [77.88.19.54] We have reverted back to stable before the last bce driver update was commited to releng branch and now hope that the system should run as expected. Xin LI wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > FYI, I have committed the patch as r185653 (stable/7) and r185654 > (releng/7.1) so new build would get this issue fixed. Thanks goes to > David who gave review for the changes and all who tested the earlier > patches. > > Cheers, > - -- > Xin LI <delphij@delphij.net> http://www.delphij.net/ > FreeBSD - The Power to Serve! > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.9 (FreeBSD) > > iEYEARECAAYFAkk5n6YACgkQi+vbBBjt66BToACfTp+1hqno30HTpNfcvMn7SpAF > 6XoAn1St590CMK2Lz9jLwlnTLDKGW8cV > =/FVN > -----END PGP SIGNATURE----- > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"-- Oleg Gorokhov System Administrator, Yandex Tel.: +7 (495) 739-7000 (+7166)
Mike Jakubik
2008-Dec-08 11:46 UTC
RELENG_7_1: bce driver change generating too much interrupts ?
On Mon, December 8, 2008 4:29 am, Oleg Gorokhov wrote:> This patch committed fixes the issue reported earlier with interruptions > but there is one more problem discussed here: > > http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2008-11/msg00144.html > > We also have observed similar bad behavior for network (especially > ssl-based) operations: imaps, ssh and smtp starttls connections - all of > them were failed to establish after a day of successful operation: >I wonder if my problem is related to this. I have a java chat service application that starts dropping connections after about 4 days of uptime. There is nothing in the applications logs, and i know this works fine on Linux. Will try updating to the latest bce patch tonight to see if it helps.
Danny Braniss
2008-Dec-18 02:49 UTC
RELENG_7_1: bce driver change generating too much interrupts ?
> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, Nawfal, > > Nawfal bin Mohmad Rouyan wrote: > > I have been using a Dell machine with 2 bce interfaces as a bridge > > between my LAN and Firewall to shape the traffic. Since after the > > update, the machine can only run for a few minutes and after that no > > more connection can go through. > > > > Ping from LAN to Internet is OK but when I telnet say to www.yahoo.com > > at port 80 and issue "GET / HTTP/1.0" I can see the data of different > > application including the HTML text. > > > > For example, I can see uTorrent packets with binaries and also the HTML > > page being cut short. It's as if, I'm seeing packets jumbled together > > from different application. > > > > I'm using PF to shape the traffic. If I reboot the server, it will panic > > and I have about 3 different vmcores in /var/crash and not sure what to > > do with it :( . I've tested the patch to remove > > stat_IfInFramesL2FilterDiscards but the problem still occurs. > > The last patch is not a functional change, but a behavior change that > removes the L2FilterDiscards from being counted to match previous behavior. > > Would you please do this: > > script bt.txt kgdb /boot/kernel/kernel.symbols /var/crash/vmcore.0 > > Then, do 'bt', press enter until all display has finished, then exit > kgdb, and send me the result (bt.txt)? > > > As for now, I'm not using the server to shape the traffic because I > > suspect the driver isn't reliable. I'm going to revert back to the > > previous driver and hopes its going to work. > > > > Sorry if there is not much detail since I'm not sure what to provide. > > Just tell me what to provide and I'd be happy to do so.I don't know if the following is related, but: - while stress testing nfs/zfs, I get many weird things on the server (dell-2950/bce) example: impossible packet length (33555456) from nfs server fr-01:/vol/system/share impossible packet length (1792323116) from nfs server fr-01:/vol/system/share ... and things get worse soon after. Now, there are no input errors, so it seems some memory starvation are not properly handled ... cheers, danny