Torfinn Ingolfsen
2010-Jan-31 13:42 UTC
panic - sleeping thread on FreeBSD 8.0-stable / amd64
Hi, One of my machines had a panic or something. The machine was pingable, but I couldn't ssh into it, and there was no response on the console. On the console was these lines: "Sleeping thread (tid 10014, pid 0) owns a non-sleepable lock" "panic: sleeping thread" "cpuid = 1" The only thing I did with the machine yesterday (before this happened) was to upgrade the machine froms zfs v13 to v14. But that went well, no problems. root@kg-f2# last | head -6 tingo pts/1 kg-v2.kg4.no Sun Jan 31 14:26 still logged in tingo pts/0 kg-v2.kg4.no Sun Jan 31 14:26 still logged in root ttyv0 Sun Jan 31 14:25 still logged in reboot ~ Sun Jan 31 14:25 reboot ~ Sat Jan 30 17:48 reboot ~ Sat Jan 30 17:31 The two reboots yesterday is from the zfs upgrade. Since I couldn't do anything useful with it, I just turned off the power and rebooted it. The machine runs FreeBSD 8,0 stable and zfs: root@kg-f2# uname -a FreeBSD kg-f2.kg4.no 8.0-STABLE FreeBSD 8.0-STABLE #1: Fri Jan 15 16:43:49 CET 2010 root@kg-f2.kg4.no:/usr/obj/usr/src/sys/GENERIC amd64 It will be a file server, nut is mostly idle now (I haven't started using it yet). Details on the hardware: http://sites.google.com/site/tingox/ga-ma74gm-s2h dmesgs and more on the FreeBSD page for this machine: http://sites.google.com/site/tingox/ga-ma74gm-s2h_freebsd HTH -- Regards, Torfinn Ingolfsen
Torfinn Ingolfsen
2010-Jan-31 16:56 UTC
panic - sleeping thread on FreeBSD 8.0-stable / amd64
On Sun, 31 Jan 2010 14:42:17 +0100 Torfinn Ingolfsen <torfinn.ingolfsen@broadpark.no> wrote:> Hi, > > One of my machines had a panic or something. > The machine was pingable, but I couldn't ssh into it, and there was no response on the console. >And it did it again, only a few hours later. I'll try to update to latest -stable, and see if that helps. Same messgae as last time, unfortunately I didn't record tge details (tid and pid). Oh well. -- Regards, Torfinn Ingolfsen
Torfinn Ingolfsen
2010-Feb-07 15:36 UTC
panic - sleeping thread on FreeBSD 8.0-stable / amd64
On Sun, 31 Jan 2010 17:56:39 +0100 Torfinn Ingolfsen <torfinn.ingolfsen@broadpark.no> wrote:> On Sun, 31 Jan 2010 14:42:17 +0100 > Torfinn Ingolfsen <torfinn.ingolfsen@broadpark.no> wrote: > > > Hi, > > > > One of my machines had a panic or something. > > The machine was pingable, but I couldn't ssh into it, and there was no response on the console. > > > > And it did it again, only a few hours later. I'll try to update to latest -stable, and see if that helps. > Same messgae as last time, unfortunately I didn't record tge details (tid and pid). Oh well.Well, it was stable for many days, but today it rebooted on its ownb again. After the fact, I see this in /var/log/messages: Feb 7 11:50:16 kg-f2 ntpd[906]: time reset +2.376096 s Feb 7 12:02:21 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 0000007f Feb 7 12:02:21 kg-f2 kernel: ata6: hardware reset timeout Feb 7 12:05:43 kg-f2 syslogd: kernel boot file is /boot/kernel/kernel So there is probably some problem with a cable or disk. On the plus side, it did reboot and came upa agin without any issues. -- Regards, Torfinn Ingolfsen
Torfinn Ingolfsen
2010-Feb-17 08:17 UTC
panic - sleeping thread on FreeBSD 8.0-stable / amd64
Another crash last night. In /var/log/messages: Feb 16 23:13:22 kg-f2 ntpd[2826]: time reset +1.780863 s Feb 16 23:16:42 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:16:42 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata5: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata5: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 00000080 Feb 16 23:20:39 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ad4: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=65614674 Feb 16 23:20:39 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata5: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata5: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata5: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 00000080 Feb 16 23:20:39 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata5: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 00000080 Feb 16 23:20:39 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ad6: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=65614674 Feb 16 23:20:39 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata5: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata5: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata5: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 0000007f Feb 16 23:20:39 kg-f2 kernel: ata5: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd = 00000080 Feb 16 23:20:39 kg-f2 kernel: ata6: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ad6: WARNING - WRITE_DMA requeued due to channel reset LBA=65614674 Feb 16 23:20:39 kg-f2 kernel: ata3: FAILURE - already active DMA on this device Feb 16 23:20:39 kg-f2 kernel: ata3: setting up DMA failed Feb 16 23:20:39 kg-f2 kernel: ad6: TIMEOUT - READ_DMA retrying (1 retry left) LBA=8389026 Feb 16 23:20:39 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 00000080 Feb 16 23:20:39 kg-f2 kernel: ata5: hardware reset timeout Feb 16 23:20:39 kg-f2 kernel: ad4: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=65614674 Feb 16 23:20:39 kg-f2 kernel: ad6: TIMEOUT - READ_DMA retrying (0 retries left) LBA=8389026 Feb 16 23:20:39 kg-f2 root: ZFS: vdev I/O failure, zpool=zroot path=/dev/gpt/disk1 offset=29299662848 size=4096 error=5 Feb 17 09:09:40 kg-f2 syslogd: kernel boot file is /boot/kernel/kernel But ad4 and ad6 are the two disk mirrir (zfs) that I have built my root filesystem on. Hmm. -- Torfinn
Torfinn Ingolfsen
2010-Feb-20 19:21 UTC
panic - sleeping thread on FreeBSD 8.0-stable / amd64
Another day, another crash.>From /var/log/messages:Feb 20 08:52:26 kg-f2 ntpd[58609]: time reset +1.169751 s Feb 20 08:54:57 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd = 0000007f Feb 20 08:54:57 kg-f2 kernel: ata5: hardware reset timeout Feb 20 19:18:51 kg-f2 syslogd: kernel boot file is /boot/kernel/kernel The drives are as follows: root@kg-f2# atacontrol list;camcontrol devlist ATA channel 0: Master: no device present Slave: no device present ATA channel 2: Master: ad4 <SAMSUNG HD252HJ/1AC01118> SATA revision 2.x Slave: no device present ATA channel 3: Master: ad6 <SAMSUNG HD252HJ/1AC01118> SATA revision 2.x Slave: no device present ATA channel 4: Master: ad8 <SAMSUNG HD103SJ/1AJ100E4> SATA revision 2.x Slave: no device present ATA channel 5: Master: ad10 <SAMSUNG HD103SJ/1AJ100E4> SATA revision 2.x Slave: no device present ATA channel 6: Master: ad12 <SAMSUNG HD103SJ/1AJ100E4> SATA revision 2.x Slave: no device present ATA channel 7: Master: ad14 <SAMSUNG HD103SJ/1AJ100E4> SATA revision 2.x Slave: no device present <SAMSUNG HD103SJ 1AJ100E4> at scbus0 target 0 lun 0 (pass0,ada0) Smartctl is happy, too: root@kg-f2# smartctl -H /dev/ad4 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.0-STABLE amd64] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED root@kg-f2# smartctl -H /dev/ad6 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.0-STABLE amd64] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED root@kg-f2# smartctl -H /dev/ad8 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.0-STABLE amd64] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED root@kg-f2# smartctl -H /dev/ad10 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.0-STABLE amd64] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED root@kg-f2# smartctl -H /dev/ad12 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.0-STABLE amd64] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED root@kg-f2# smartctl -H /dev/ada0 smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.0-STABLE amd64] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION ==SMART overall-health self-assessment test result: PASSED Maybe the hardware is just plain broken. -- Torfinn
Torfinn Ingolfsen
2010-Mar-06 13:20 UTC
panic - sleeping thread on FreeBSD 8.0-stable / amd64
Ok, a new development in this story.
Note that as of yet, I haven't change SATA cables or done anything else
with the hardware. However, I did upgrade to latest FreeBSD
8.0-stable / amd64 yesterday.
The machine is still up (it iahsn't crashed yet), and today I found this in
/var/log/messages:
Mar 6 06:25:34 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd =
0000007f
Mar 6 06:25:34 kg-f2 kernel: ata5: hardware reset timeout
Mar 6 06:25:45 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd =
0000007f
Mar 6 06:25:45 kg-f2 kernel: ata6: hardware reset timeout
Mar 6 06:25:45 kg-f2 root: ZFS: vdev failure, zpool=storage
type=vdev.no_replicas
Mar 6 06:25:56 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd =
00000080
Mar 6 06:25:56 kg-f2 kernel: ata5: hardware reset timeout
Mar 6 06:26:06 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd =
0000007f
Mar 6 06:26:06 kg-f2 kernel: ata6: hardware reset timeout
Mar 6 06:26:08 kg-f2 root: ZFS: zpool I/O failure, zpool=storage error=28
Mar 6 06:26:08 kg-f2 last message repeated 2 times
Mar 6 06:26:08 kg-f2 root: ZFS: vdev I/O failure, zpool=storage path= offset=
size= errorMar 6 06:26:16 kg-f2 kernel: ata5: port is not ready (timeout
10000ms) tfd = 0000007f
Mar 6 06:26:16 kg-f2 kernel: ata5: hardware reset timeout
Mar 6 06:26:27 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd =
0000007f
Mar 6 06:26:27 kg-f2 kernel: ata6: hardware reset timeout
Mar 6 06:26:37 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd =
00000080
Mar 6 06:26:37 kg-f2 kernel: ata5: hardware reset timeout
Mar 6 06:26:47 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd =
0000007f
Mar 6 06:26:47 kg-f2 kernel: ata6: hardware reset timeout
Mar 6 06:26:58 kg-f2 kernel: ata5: port is not ready (timeout 10000ms) tfd =
0000007f
Mar 6 06:26:58 kg-f2 kernel: ata5: hardware reset timeout
Mar 6 06:27:08 kg-f2 kernel: ata6: port is not ready (timeout 10000ms) tfd =
00000080
Mar 6 06:27:08 kg-f2 kernel: ata6: hardware reset timeout
Before the upgrade, messages such as these would (AFAICT) nresult on a panic and
reboot.
Uptime:
root@kg-f2# uptime
2:11PM up 19:38, 3 users, load averages: 0.00, 0.00, 0.00
The boot / root mirror pool is okay:
root@kg-f2# zpool status zroot
pool: zroot
state: ONLINE
scrub: scrub completed after 0h8m with 0 errors on Fri Mar 5 18:45:24 2010
config:
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
mirror ONLINE 0 0 0
gpt/disk0 ONLINE 0 0 0
gpt/disk1 ONLINE 0 0 0
errors: No known data errors
However, the storage pool is not:
root@kg-f2# zpool status storage
pool: storage
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool
clear'.
see: http://www.sun.com/msg/ZFS-8000-HC
scrub: scrub completed after 0h0m with 0 errors on Fri Mar 5 18:36:17 2010
config:
NAME STATE READ WRITE CKSUM
storage UNAVAIL 0 3 0 insufficient replicas
raidz1 UNAVAIL 0 0 0 insufficient replicas
ad8 ONLINE 0 0 0
ad10 REMOVED 0 0 0
ad12 REMOVED 0 0 0
ad14 ONLINE 0 0 0
ada0 ONLINE 0 0 0
errors: 2 data errors, use '-v' for a list
Currently, this pool isn't in use, so I am not concerned about data loss
(luckily).
Note that before this upgrade, with all panics and reboots, both zfs pools have
always been
clean and trouble-free after a reboot.
atacontrol confirms that ad10 and ad12 are "gone" (ie. disconnected:
root@kg-f2# atacontrol list
ATA channel 0:
Master: no device present
Slave: no device present
ATA channel 2:
Master: ad4 <SAMSUNG HD252HJ/1AC01118> SATA revision 2.x
Slave: no device present
ATA channel 3:
Master: ad6 <SAMSUNG HD252HJ/1AC01118> SATA revision 2.x
Slave: no device present
ATA channel 4:
Master: ad8 <SAMSUNG HD103SJ/1AJ100E4> SATA revision 2.x
Slave: no device present
ATA channel 5:
Master: no device present
Slave: no device present
ATA channel 6:
Master: no device present
Slave: no device present
ATA channel 7:
Master: ad14 <SAMSUNG HD103SJ/1AJ100E4> SATA revision 2.x
Slave: no device present
What happens if I just rebot the server now? (I think that ad10 and ad12 will be
detected and connected),
but what will zfs do with the 'storage' pool?
As always, more info (including verbose dmesgs etc.) on the FreeBSD page[1] for
this machine.
References:
1) FreeBSd on this machine:
http://sites.google.com/site/tingox/ga-ma74gm-s2h_freebsd
--
Torfinn