I have one machine that is seeing watchdog timeouts on em0, running 7-STABLE amd64 as of 2009.04.19, and also some other more perverse errors. Twice now in the last 48 hours, this machine has become unreachable via the network, and connecting to the console shows an endless string of [...] em0: watchdog timeout -- resetting em0: watchdog timeout -- resetting em0: watchdog timeout -- resetting messages. The machine is almost locked up. That is, I can get a login prompt, but can go no further than typing in a username; after the username, no password prompt, and nothing further. The only option is to hard reset the machine or to drop to debugger and reboot. Now the "perverse" part. After restarting, the system partition is no more. Background detail: the machine is a fileserver, with a 3Ware 9650SE-16ML SATA controller, connected to 16 1TB SATA drives, this configured as a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition and 6.5TB data partition. The system partition is configured as da1, with one slice and more or less standard partitions for / /var /tmp, etc. (the data partition of the array is sliced with gpt). The issue here is that, upon restart, all parition information on da0 seems to have disappeared, and restarting results in a "no operating system found" message, and a failure to boot (obviously). But all of the data is still present. If I boot into rescue mode, recreate da0s1, mark it bootable, and restore the bsdlabel, then everything works again. I can restart the machine, and it comes back up normally (it requires an fsck of everything on da0, but after that everything is back to normal). I don't know if this is two unrelated problems, or one problem with two symptoms, or something else. I think that I can safely say that it is not a problem with the 3Ware controller itself, as I replaced the controller with a spare (identical model), and the problem recurred. Additionally, I have an almost-identical configuration on four other machines, none of which are experiencing any problems. One thing that is different is that the other machines use Intel PRO/1000 PF (pci-e) NICs. Is there some known problem with the Intel 2572 fibre NIC? Or some potential interaction of it with the 3ware RAID controller? For the moment, I've set hw.pci.enable_msi=0 (as discussed in the threads on 7.2/bge), and am building a new kernel/world from sources csup'd one hour ago, but I'd really like to hear any ideas about this -- particularly the wiping of the label. Some information about the system: # /dev/da0s1: 8 partitions: # size offset fstype [fsize bsize bps/cpg] a: 2097152 0 4.2BSD 0 0 0 b: 8388608 2097152 swap c: 104856192 0 unused 0 0 # "raw" part, don't edit d: 8388608 10485760 4.2BSD 0 0 0 e: 2097152 18874368 4.2BSD 0 0 0 f: 41943040 20971520 4.2BSD 0 0 0 g: 41941632 62914560 4.2BSD 0 0 0 em0@pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation'thernet Controller (Fiber)' device = '2572 10/100/1000 Ethernet Controller (Fiber)' class = networktory, range 32, base 0xda000000, size 131072, enabled subclass = ethernetory, range 32, base 0xda000000, size 131072, enabled bar [10] = type Memory, range 32, base 0xda000000, size 131072, enabled bar [14] = type Memory, range 32, base 0xda020000, size 65536, enabled0x00 twa0@pci0:9:0:0: class=0x010400 card=0x100413c1 chip=0x100413c1 rev=0x01 hdr=0x00 device = '9650SE Series PCI-Express SATA2 Raid Controller' class = mass storage subclass = RAID bar [10] = type Prefetchable Memory, range 64, base 0xd8000000, size 33554432, enabled bar [18] = type Memory, range 64, base 0xda300000, size 4096, enabled bar [20] = type I/O Port, range 32, base 0x3000, size 256, enabled cap 01[40] = powerspec 2 supports D0 D1 D2 D3 current D0 cap 05[50] = MSI supports 32 messages, 64 bit cap 10[70] = PCI-Express 1 legacy endpoint -- greg byshenk - gbyshenk@byshenk.net - Leiden, NL
Greg, I have another report of this problem, and I have a patch for you to try out, will be sending it out a bit later today. Jack On Sun, Apr 26, 2009 at 5:50 AM, Greg Byshenk <freebsd@byshenk.net> wrote:> I have one machine that is seeing watchdog timeouts on em0, running > 7-STABLE > amd64 as of 2009.04.19, and also some other more perverse errors. > > Twice now in the last 48 hours, this machine has become unreachable via the > network, and connecting to the console shows an endless string of > > [...] > em0: watchdog timeout -- resetting > em0: watchdog timeout -- resetting > em0: watchdog timeout -- resetting > > messages. The machine is almost locked up. That is, I can get a login > prompt, but can go no further than typing in a username; after the > username, no password prompt, and nothing further. The only option is > to hard reset the machine or to drop to debugger and reboot. > > Now the "perverse" part. After restarting, the system partition is no > more. > > Background detail: the machine is a fileserver, with a 3Ware 9650SE-16ML > SATA controller, connected to 16 1TB SATA drives, this configured as > a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition > and 6.5TB data partition. The system partition is configured as da1, > with one slice and more or less standard partitions for / /var /tmp, etc. > (the data partition of the array is sliced with gpt). > > The issue here is that, upon restart, all parition information on da0 > seems to have disappeared, and restarting results in a "no operating > system found" message, and a failure to boot (obviously). > > But all of the data is still present. If I boot into rescue mode, > recreate da0s1, mark it bootable, and restore the bsdlabel, then > everything works again. I can restart the machine, and it comes back > up normally (it requires an fsck of everything on da0, but after that > everything is back to normal). > > I don't know if this is two unrelated problems, or one problem with > two symptoms, or something else. I think that I can safely say that > it is not a problem with the 3Ware controller itself, as I replaced > the controller with a spare (identical model), and the problem > recurred. Additionally, I have an almost-identical configuration on > four other machines, none of which are experiencing any problems. > One thing that is different is that the other machines use > Intel PRO/1000 PF (pci-e) NICs. > > Is there some known problem with the Intel 2572 fibre NIC? Or some > potential interaction of it with the 3ware RAID controller? > > For the moment, I've set hw.pci.enable_msi=0 (as discussed in the > threads on 7.2/bge), and am building a new kernel/world from sources > csup'd one hour ago, but I'd really like to hear any ideas about this > -- particularly the wiping of the label. > > Some information about the system: > > > # /dev/da0s1: > 8 partitions: > # size offset fstype [fsize bsize bps/cpg] > a: 2097152 0 4.2BSD 0 0 0 > b: 8388608 2097152 swap > c: 104856192 0 unused 0 0 # "raw" part, don't > edit > d: 8388608 10485760 4.2BSD 0 0 0 > e: 2097152 18874368 4.2BSD 0 0 0 > f: 41943040 20971520 4.2BSD 0 0 0 > g: 41941632 62914560 4.2BSD 0 0 0 > > > em0@pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02 > hdr=0x00 > vendor = 'Intel Corporation'thernet Controller (Fiber)' > device = '2572 10/100/1000 Ethernet Controller (Fiber)' > class = networktory, range 32, base 0xda000000, size 131072, > enabled > subclass = ethernetory, range 32, base 0xda000000, size 131072, > enabled > bar [10] = type Memory, range 32, base 0xda000000, size 131072, > enabled > bar [14] = type Memory, range 32, base 0xda020000, size 65536, > enabled0x00 > > twa0@pci0:9:0:0: class=0x010400 card=0x100413c1 chip=0x100413c1 > rev=0x01 hdr=0x00 > device = '9650SE Series PCI-Express SATA2 Raid Controller' > class = mass storage > subclass = RAID > bar [10] = type Prefetchable Memory, range 64, base 0xd8000000, size > 33554432, enabled > bar [18] = type Memory, range 64, base 0xda300000, size 4096, enabled > bar [20] = type I/O Port, range 32, base 0x3000, size 256, enabled > cap 01[40] = powerspec 2 supports D0 D1 D2 D3 current D0 > cap 05[50] = MSI supports 32 messages, 64 bit > cap 10[70] = PCI-Express 1 legacy endpoint > > -- > greg byshenk - gbyshenk@byshenk.net - Leiden, NL > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" >
As a followup to my own previous message, I continue to have annoying problems with "em?: watchdog timeout" on one of my machines (now running 7.2-STABLE as of 2009-05-08). I have discontinued using the on-board (em, copper) NICs, and replaced the original fibre NIC with a newer model, but the problem persists. I've also set hw.pci.enable_msix=0 hw.pci.enable_msi=0 hw.em.rxd=1024 hw.em.txd=1024 net.inet.tcp.tso=0 ...as suggested in some discussions of this problem, and set the em1 interface to 'polling', all to no avail. Frequently, though irregularly (once or twice a day), the console begins to display em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting the nework is down, and the machine locks up. [Note: I am getting 'em1' now instead of 'em0' as previously, but this is due to changing all of the nics, which led to a different numbering; the timeout is still occurring on the (main) interface, the fibre gigabit connection.] What is particularly perverse (IMO) is that, since changing the NIC to the newer model (and updating the kernel), I can no longer break to the debugger when the lockup occurs (there is no response to the break) -- bit I _can_ shut the machine down cleanly via hardware (a touch of the power switch sends 'shutdown', and the machine shuts down cleanly -- after killing off processes waiting on network i/o). The machine is running nfs and samba (3.2.10, from ports), and pretty much nothing else. Anyone have any ideas about this...? I'm going mad with this. -greg byshenk # pciconf -lvb [...] em1@pci0:7:1:0: class=0x020000 card=0x10028086 chip=0x10118086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82545EM Gigabit Ethernet Controller (Fiber)' class = network subclass = ethernet bar [10] = type Memory, range 64, base 0xda300000, size 131072, enabled bar [20] = type I/O Port, range 32, base 0x5000, size 64, enabled [...] # vmstat -i interrupt total rate irq4: sio0 1666 0 irq6: fdc0 10 0 irq14: ata0 58 0 irq16: skc0 em0 1437801 98 irq18: twa0 846981 57 irq24: em1 4378650 299 cpu0: timer 29258004 1999 cpu1: timer 29249758 1999 cpu3: timer 29249816 1999 cpu7: timer 29249779 1999 cpu2: timer 29249729 1999 cpu4: timer 29249852 1999 cpu6: timer 29249851 1999 cpu5: timer 29249814 1999 Total 240671769 16450 On Sun, Apr 26, 2009 at 02:50:08PM +0200, Greg Byshenk wrote:> I have one machine that is seeing watchdog timeouts on em0, running 7-STABLE > amd64 as of 2009.04.19, and also some other more perverse errors. > > Twice now in the last 48 hours, this machine has become unreachable via the > network, and connecting to the console shows an endless string of > > [...] > em0: watchdog timeout -- resetting > em0: watchdog timeout -- resetting > em0: watchdog timeout -- resetting > > messages. The machine is almost locked up. That is, I can get a login > prompt, but can go no further than typing in a username; after the > username, no password prompt, and nothing further. The only option is > to hard reset the machine or to drop to debugger and reboot. > > Now the "perverse" part. After restarting, the system partition is no > more. > > Background detail: the machine is a fileserver, with a 3Ware 9650SE-16ML > SATA controller, connected to 16 1TB SATA drives, this configured as > a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition > and 6.5TB data partition. The system partition is configured as da1, > with one slice and more or less standard partitions for / /var /tmp, etc. > (the data partition of the array is sliced with gpt). > > The issue here is that, upon restart, all parition information on da0 > seems to have disappeared, and restarting results in a "no operating > system found" message, and a failure to boot (obviously). > > But all of the data is still present. If I boot into rescue mode, > recreate da0s1, mark it bootable, and restore the bsdlabel, then > everything works again. I can restart the machine, and it comes back > up normally (it requires an fsck of everything on da0, but after that > everything is back to normal). > > I don't know if this is two unrelated problems, or one problem with > two symptoms, or something else. I think that I can safely say that > it is not a problem with the 3Ware controller itself, as I replaced > the controller with a spare (identical model), and the problem > recurred. Additionally, I have an almost-identical configuration on > four other machines, none of which are experiencing any problems. > One thing that is different is that the other machines use > Intel PRO/1000 PF (pci-e) NICs. > > Is there some known problem with the Intel 2572 fibre NIC? Or some > potential interaction of it with the 3ware RAID controller? > > For the moment, I've set hw.pci.enable_msi=0 (as discussed in the > threads on 7.2/bge), and am building a new kernel/world from sources > csup'd one hour ago, but I'd really like to hear any ideas about this > -- particularly the wiping of the label. > > Some information about the system: > > > # /dev/da0s1: > 8 partitions: > # size offset fstype [fsize bsize bps/cpg] > a: 2097152 0 4.2BSD 0 0 0 > b: 8388608 2097152 swap > c: 104856192 0 unused 0 0 # "raw" part, don't edit > d: 8388608 10485760 4.2BSD 0 0 0 > e: 2097152 18874368 4.2BSD 0 0 0 > f: 41943040 20971520 4.2BSD 0 0 0 > g: 41941632 62914560 4.2BSD 0 0 0 > > > em0@pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02 hdr=0x00 > vendor = 'Intel Corporation'thernet Controller (Fiber)' > device = '2572 10/100/1000 Ethernet Controller (Fiber)' > class = networktory, range 32, base 0xda000000, size 131072, enabled > subclass = ethernetory, range 32, base 0xda000000, size 131072, enabled > bar [10] = type Memory, range 32, base 0xda000000, size 131072, enabled > bar [14] = type Memory, range 32, base 0xda020000, size 65536, enabled0x00 > > twa0@pci0:9:0:0: class=0x010400 card=0x100413c1 chip=0x100413c1 rev=0x01 hdr=0x00 > device = '9650SE Series PCI-Express SATA2 Raid Controller' > class = mass storage > subclass = RAID > bar [10] = type Prefetchable Memory, range 64, base 0xd8000000, size 33554432, enabled > bar [18] = type Memory, range 64, base 0xda300000, size 4096, enabled > bar [20] = type I/O Port, range 32, base 0x3000, size 256, enabled > cap 01[40] = powerspec 2 supports D0 D1 D2 D3 current D0 > cap 05[50] = MSI supports 32 messages, 64 bit > cap 10[70] = PCI-Express 1 legacy endpoint >-- greg byshenk - gbyshenk@byshenk.net - Leiden, NL