thr3ads.net - freebsd stable - em0 watchdog timeout (and 3ware problems) 7-stable [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Greg Byshenk

2009-Apr-26 13:05 UTC

em0 watchdog timeout (and 3ware problems) 7-stable

I have one machine that is seeing watchdog timeouts on em0, running 7-STABLE
amd64 as of 2009.04.19, and also some other more perverse errors.

Twice now in the last 48 hours, this machine has become unreachable via the
network, and connecting to the console shows an endless string of 

   [...]
   em0: watchdog timeout -- resetting
   em0: watchdog timeout -- resetting
   em0: watchdog timeout -- resetting

messages. The machine is almost locked up.  That is, I can get a login
prompt, but can go no further than typing in a username; after the
username, no password prompt, and nothing further.  The only option is
to hard reset the machine or to drop to debugger and reboot.

Now the "perverse" part.  After restarting, the system partition is no
more.

Background detail:  the machine is a fileserver, with a 3Ware 9650SE-16ML
SATA controller, connected to 16 1TB SATA drives, this configured as
a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition
and 6.5TB data partition.  The system partition is configured as da1,
with one slice and more or less standard partitions for / /var /tmp, etc.
(the data partition of the array is sliced with gpt).

The issue here is that, upon restart, all parition information on da0
seems to have disappeared, and restarting results in a "no operating
system found" message, and a failure to boot (obviously).

But all of the data is still present.  If I boot into rescue mode,
recreate da0s1, mark it bootable, and restore the bsdlabel, then
everything works again.  I can restart the machine, and it comes back
up normally (it requires an fsck of everything on da0, but after that
everything is back to normal).

I don't know if this is two unrelated problems, or one problem with
two symptoms, or something else.  I think that I can safely say that
it is not a problem with the 3Ware controller itself, as I replaced
the controller with a spare (identical model), and the problem
recurred.  Additionally, I have an almost-identical configuration on
four other machines, none of which are experiencing any problems.
One thing that is different is that the other machines use
Intel PRO/1000 PF (pci-e) NICs.

Is there some known problem with the Intel 2572 fibre NIC?  Or some
potential interaction of it with the 3ware RAID controller?

For the moment, I've set hw.pci.enable_msi=0 (as discussed in the
threads on 7.2/bge), and am building a new kernel/world from sources
csup'd one hour ago, but I'd really like to hear any ideas about this
-- particularly the wiping of the label.

Some information about the system:


# /dev/da0s1:
8 partitions:
#        size   offset    fstype   [fsize bsize bps/cpg]
  a:  2097152        0    4.2BSD        0     0     0 
  b:  8388608  2097152      swap                    
  c: 104856192        0    unused        0     0         # "raw" part,
don't edit
  d:  8388608 10485760    4.2BSD        0     0     0 
  e:  2097152 18874368    4.2BSD        0     0     0 
  f: 41943040 20971520    4.2BSD        0     0     0 
  g: 41941632 62914560    4.2BSD        0     0     0 


em0@pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02 hdr=0x00
    vendor     = 'Intel Corporation'thernet Controller (Fiber)'
    device     = '2572 10/100/1000 Ethernet Controller (Fiber)'
    class      = networktory, range 32, base 0xda000000, size 131072, enabled
    subclass   = ethernetory, range 32, base 0xda000000, size 131072, enabled
    bar   [10] = type Memory, range 32, base 0xda000000, size 131072, enabled
    bar   [14] = type Memory, range 32, base 0xda020000, size 65536, enabled0x00
 
twa0@pci0:9:0:0:        class=0x010400 card=0x100413c1 chip=0x100413c1 rev=0x01
hdr=0x00
    device     = '9650SE Series PCI-Express SATA2 Raid Controller'
    class      = mass storage
    subclass   = RAID
    bar   [10] = type Prefetchable Memory, range 64, base 0xd8000000, size
33554432, enabled
    bar   [18] = type Memory, range 64, base 0xda300000, size 4096, enabled
    bar   [20] = type I/O Port, range 32, base 0x3000, size 256, enabled
    cap 01[40] = powerspec 2  supports D0 D1 D2 D3  current D0
    cap 05[50] = MSI supports 32 messages, 64 bit
    cap 10[70] = PCI-Express 1 legacy endpoint

-- 
greg byshenk  -  gbyshenk@byshenk.net  -  Leiden, NL

Jack Vogel

2009-Apr-27 17:21 UTC

head link

em0 watchdog timeout (and 3ware problems) 7-stable

Greg,

I have another report of this problem, and I have a patch for you to try
out, will
be sending it out a bit later today.

Jack


On Sun, Apr 26, 2009 at 5:50 AM, Greg Byshenk <freebsd@byshenk.net> wrote:
> I have one machine that is seeing watchdog timeouts on em0, running
> 7-STABLE
> amd64 as of 2009.04.19, and also some other more perverse errors.
>
> Twice now in the last 48 hours, this machine has become unreachable via the
> network, and connecting to the console shows an endless string of
>
>   [...]
>   em0: watchdog timeout -- resetting
>   em0: watchdog timeout -- resetting
>   em0: watchdog timeout -- resetting
>
> messages. The machine is almost locked up.  That is, I can get a login
> prompt, but can go no further than typing in a username; after the
> username, no password prompt, and nothing further.  The only option is
> to hard reset the machine or to drop to debugger and reboot.
>
> Now the "perverse" part.  After restarting, the system partition
is no
> more.
>
> Background detail:  the machine is a fileserver, with a 3Ware 9650SE-16ML
> SATA controller, connected to 16 1TB SATA drives, this configured as
> a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition
> and 6.5TB data partition.  The system partition is configured as da1,
> with one slice and more or less standard partitions for / /var /tmp, etc.
> (the data partition of the array is sliced with gpt).
>
> The issue here is that, upon restart, all parition information on da0
> seems to have disappeared, and restarting results in a "no operating
> system found" message, and a failure to boot (obviously).
>
> But all of the data is still present.  If I boot into rescue mode,
> recreate da0s1, mark it bootable, and restore the bsdlabel, then
> everything works again.  I can restart the machine, and it comes back
> up normally (it requires an fsck of everything on da0, but after that
> everything is back to normal).
>
> I don't know if this is two unrelated problems, or one problem with
> two symptoms, or something else.  I think that I can safely say that
> it is not a problem with the 3Ware controller itself, as I replaced
> the controller with a spare (identical model), and the problem
> recurred.  Additionally, I have an almost-identical configuration on
> four other machines, none of which are experiencing any problems.
> One thing that is different is that the other machines use
> Intel PRO/1000 PF (pci-e) NICs.
>
> Is there some known problem with the Intel 2572 fibre NIC?  Or some
> potential interaction of it with the 3ware RAID controller?
>
> For the moment, I've set hw.pci.enable_msi=0 (as discussed in the
> threads on 7.2/bge), and am building a new kernel/world from sources
> csup'd one hour ago, but I'd really like to hear any ideas about
this
> -- particularly the wiping of the label.
>
> Some information about the system:
>
>
> # /dev/da0s1:
> 8 partitions:
> #        size   offset    fstype   [fsize bsize bps/cpg]
>  a:  2097152        0    4.2BSD        0     0     0
>  b:  8388608  2097152      swap
>  c: 104856192        0    unused        0     0         # "raw"
part, don't
> edit
>  d:  8388608 10485760    4.2BSD        0     0     0
>  e:  2097152 18874368    4.2BSD        0     0     0
>  f: 41943040 20971520    4.2BSD        0     0     0
>  g: 41941632 62914560    4.2BSD        0     0     0
>
>
> em0@pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02
> hdr=0x00
>    vendor     = 'Intel Corporation'thernet Controller (Fiber)'
>    device     = '2572 10/100/1000 Ethernet Controller (Fiber)'
>    class      = networktory, range 32, base 0xda000000, size 131072,
> enabled
>    subclass   = ethernetory, range 32, base 0xda000000, size 131072,
> enabled
>    bar   [10] = type Memory, range 32, base 0xda000000, size 131072,
> enabled
>    bar   [14] = type Memory, range 32, base 0xda020000, size 65536,
> enabled0x00
>
> twa0@pci0:9:0:0:        class=0x010400 card=0x100413c1 chip=0x100413c1
> rev=0x01 hdr=0x00
>    device     = '9650SE Series PCI-Express SATA2 Raid Controller'
>    class      = mass storage
>    subclass   = RAID
>    bar   [10] = type Prefetchable Memory, range 64, base 0xd8000000, size
> 33554432, enabled
>    bar   [18] = type Memory, range 64, base 0xda300000, size 4096, enabled
>    bar   [20] = type I/O Port, range 32, base 0x3000, size 256, enabled
>    cap 01[40] = powerspec 2  supports D0 D1 D2 D3  current D0
>    cap 05[50] = MSI supports 32 messages, 64 bit
>    cap 10[70] = PCI-Express 1 legacy endpoint
>
> --
> greg byshenk  -  gbyshenk@byshenk.net  -  Leiden, NL
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
"freebsd-stable-unsubscribe@freebsd.org"
>

Greg Byshenk

2009-May-13 16:42 UTC

head link

em0 watchdog timeout 7-stable

As a followup to my own previous message, I continue to have annoying 
problems with "em?: watchdog timeout" on one of my machines (now
running
7.2-STABLE as of 2009-05-08).

I have discontinued using the on-board (em, copper) NICs, and replaced
the original fibre NIC with a newer model, but the problem persists.
I've also set

   hw.pci.enable_msix=0
   hw.pci.enable_msi=0
   hw.em.rxd=1024
   hw.em.txd=1024
   net.inet.tcp.tso=0

...as suggested in some discussions of this problem, and set the em1
interface to 'polling', all to no avail.  Frequently, though irregularly
(once or twice a day), the console begins to display

   em1: watchdog timeout -- resetting
   em1: watchdog timeout -- resetting
   em1: watchdog timeout -- resetting

the nework is down, and the machine locks up.

[Note: I am getting 'em1' now instead of 'em0' as previously,
but this
is due to changing all of the nics, which led to a different numbering;
the timeout is still occurring on the (main) interface, the fibre 
gigabit connection.]

What is particularly perverse (IMO) is that, since changing the NIC to
the newer model (and updating the kernel), I can no longer break to the
debugger when the lockup occurs (there is no response to the break) --
bit I _can_ shut the machine down cleanly via hardware (a touch of the
power switch sends 'shutdown', and the machine shuts down cleanly --
after killing off processes waiting on network i/o).

The machine is running nfs and samba (3.2.10, from ports), and pretty
much nothing else.


Anyone have any ideas about this...?  I'm going mad with this.

-greg byshenk



# pciconf -lvb
[...]
em1@pci0:7:1:0: class=0x020000 card=0x10028086 chip=0x10118086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82545EM Gigabit Ethernet Controller (Fiber)'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 64, base 0xda300000, size 131072, enabled
    bar   [20] = type I/O Port, range 32, base 0x5000, size 64, enabled
[...]

# vmstat -i
interrupt                          total       rate
irq4: sio0                          1666          0
irq6: fdc0                            10          0
irq14: ata0                           58          0
irq16: skc0 em0                  1437801         98
irq18: twa0                       846981         57
irq24: em1                       4378650        299
cpu0: timer                     29258004       1999
cpu1: timer                     29249758       1999
cpu3: timer                     29249816       1999
cpu7: timer                     29249779       1999
cpu2: timer                     29249729       1999
cpu4: timer                     29249852       1999
cpu6: timer                     29249851       1999
cpu5: timer                     29249814       1999
Total                          240671769      16450



On Sun, Apr 26, 2009 at 02:50:08PM +0200, Greg Byshenk
wrote:> I have one machine that is seeing watchdog timeouts on em0, running
7-STABLE
> amd64 as of 2009.04.19, and also some other more perverse errors.
> 
> Twice now in the last 48 hours, this machine has become unreachable via the
> network, and connecting to the console shows an endless string of 
> 
>    [...]
>    em0: watchdog timeout -- resetting
>    em0: watchdog timeout -- resetting
>    em0: watchdog timeout -- resetting
> 
> messages. The machine is almost locked up.  That is, I can get a login
> prompt, but can go no further than typing in a username; after the
> username, no password prompt, and nothing further.  The only option is
> to hard reset the machine or to drop to debugger and reboot.
> 
> Now the "perverse" part.  After restarting, the system partition
is no
> more.
> 
> Background detail:  the machine is a fileserver, with a 3Ware 9650SE-16ML
> SATA controller, connected to 16 1TB SATA drives, this configured as
> a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition
> and 6.5TB data partition.  The system partition is configured as da1,
> with one slice and more or less standard partitions for / /var /tmp, etc.
> (the data partition of the array is sliced with gpt).
> 
> The issue here is that, upon restart, all parition information on da0
> seems to have disappeared, and restarting results in a "no operating
> system found" message, and a failure to boot (obviously).
> 
> But all of the data is still present.  If I boot into rescue mode,
> recreate da0s1, mark it bootable, and restore the bsdlabel, then
> everything works again.  I can restart the machine, and it comes back
> up normally (it requires an fsck of everything on da0, but after that
> everything is back to normal).
> 
> I don't know if this is two unrelated problems, or one problem with
> two symptoms, or something else.  I think that I can safely say that
> it is not a problem with the 3Ware controller itself, as I replaced
> the controller with a spare (identical model), and the problem
> recurred.  Additionally, I have an almost-identical configuration on
> four other machines, none of which are experiencing any problems.
> One thing that is different is that the other machines use
> Intel PRO/1000 PF (pci-e) NICs.
> 
> Is there some known problem with the Intel 2572 fibre NIC?  Or some
> potential interaction of it with the 3ware RAID controller?
> 
> For the moment, I've set hw.pci.enable_msi=0 (as discussed in the
> threads on 7.2/bge), and am building a new kernel/world from sources
> csup'd one hour ago, but I'd really like to hear any ideas about
this
> -- particularly the wiping of the label.
> 
> Some information about the system:
> 
> 
> # /dev/da0s1:
> 8 partitions:
> #        size   offset    fstype   [fsize bsize bps/cpg]
>   a:  2097152        0    4.2BSD        0     0     0 
>   b:  8388608  2097152      swap                    
>   c: 104856192        0    unused        0     0         # "raw"
part, don't edit
>   d:  8388608 10485760    4.2BSD        0     0     0 
>   e:  2097152 18874368    4.2BSD        0     0     0 
>   f: 41943040 20971520    4.2BSD        0     0     0 
>   g: 41941632 62914560    4.2BSD        0     0     0 
> 
> 
> em0@pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02
hdr=0x00
>     vendor     = 'Intel Corporation'thernet Controller (Fiber)'
>     device     = '2572 10/100/1000 Ethernet Controller (Fiber)'
>     class      = networktory, range 32, base 0xda000000, size 131072,
enabled
>     subclass   = ethernetory, range 32, base 0xda000000, size 131072,
enabled
>     bar   [10] = type Memory, range 32, base 0xda000000, size 131072,
enabled
>     bar   [14] = type Memory, range 32, base 0xda020000, size 65536,
enabled0x00
>  
> twa0@pci0:9:0:0:        class=0x010400 card=0x100413c1 chip=0x100413c1
rev=0x01 hdr=0x00
>     device     = '9650SE Series PCI-Express SATA2 Raid Controller'
>     class      = mass storage
>     subclass   = RAID
>     bar   [10] = type Prefetchable Memory, range 64, base 0xd8000000, size
33554432, enabled
>     bar   [18] = type Memory, range 64, base 0xda300000, size 4096, enabled
>     bar   [20] = type I/O Port, range 32, base 0x3000, size 256, enabled
>     cap 01[40] = powerspec 2  supports D0 D1 D2 D3  current D0
>     cap 05[50] = MSI supports 32 messages, 64 bit
>     cap 10[70] = PCI-Express 1 legacy endpoint
> 
-- 
greg byshenk  -  gbyshenk@byshenk.net  -  Leiden, NL

freebsd stable - Apr 2009 - em0 watchdog timeout (and 3ware problems) 7-stable

em0 watchdog timeout (and 3ware problems) 7-stable

em0 watchdog timeout (and 3ware problems) 7-stable

em0 watchdog timeout 7-stable