Ray Van Dolson
2010-Oct-19 21:12 UTC
[zfs-discuss] NFS/SATA lockups (svc_cots_kdup no slots free & sata port time out)
I have a Solaris 10 U8 box (142901-14) running as an NFS server with a 23 disk zpool behind it (three RAIDZ2 vdevs). We have a single Intel X-25E SSD operating as an slog ZIL device attached to a SATA port on this machine''s motherboard. The rest of the drives are in a hot-swap enclosure. Infrequently (maybe once every 4-6 weeks), the zpool on the box stops responding and although we can still SSH in and manage the server, there appears to be no way to get the zpool to function again until we hard reset. shutdown -i6 -g0 -y simply hangs forever trying to call ''sync''. The logs show the following: Oct 19 11:42:42 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free Oct 19 11:42:50 dev-zfs1 last message repeated 189 times Oct 19 11:42:51 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free Oct 19 11:42:55 dev-zfs1 last message repeated 99 times Oct 19 11:42:56 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe840f453b68 timed out Oct 19 11:42:56 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free Oct 19 11:44:00 dev-zfs1 last message repeated 1128 times Oct 19 11:44:01 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83dffad0e8 timed out Oct 19 11:44:02 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free Oct 19 11:45:05 dev-zfs1 last message repeated 1108 times Oct 19 11:45:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffbe00a008 timed out Oct 19 11:45:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffac7bc7e8 timed out Oct 19 11:45:06 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: svc_cots_kdup no slots free Oct 19 11:46:10 dev-zfs1 last message repeated 1091 times Oct 19 11:46:11 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffb9438008 timed out Oct 19 11:47:16 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffb03452a8 timed out Oct 19 11:48:21 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83dfa5cd20 timed out Oct 19 11:49:26 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffb6eaf2a0 timed out Oct 19 11:50:31 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83dfa5c380 timed out Oct 19 11:51:36 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83ca418b68 timed out Oct 19 11:52:41 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83fff758c0 timed out Oct 19 11:53:46 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffb1144548 timed out Oct 19 11:54:51 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83dffad9a8 timed out Oct 19 11:55:56 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83e8cd18c0 timed out Oct 19 11:57:01 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83c43659a8 timed out Oct 19 11:58:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffb9136468 timed out Oct 19 11:59:11 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83e9f147e0 timed out Oct 19 12:00:16 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffb1be7d20 timed out Oct 19 12:01:21 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83dfa5fee0 timed out Oct 19 12:02:26 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffbe6f7e08 timed out Oct 19 12:03:31 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffb903c380 timed out Oct 19 12:04:36 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83eee6f8c8 timed out Oct 19 12:05:41 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffb04b7000 timed out Oct 19 12:06:46 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83fff7dd28 timed out Oct 19 12:07:51 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffb94389a8 timed out Oct 19 12:08:56 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xffffffffae0ff388 timed out Oct 19 12:10:01 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe84158032a8 timed out Oct 19 12:11:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: watchdog port 0 satapkt 0xfffffe83f07f7e00 timed out Oct 19 12:11:25 dev-zfs1 power: [ID 199196 kern.notice] NOTICE: Power Button pressed 2 times, cancelling all requests I''m not sure if these are related. The first one appears to be an NFS error that we can potentially address by changing a setting in /etc/system per this[1] thread. However, the second issue is less clear. I came across a bug report[2] which appears to be similar, but there is no fix. The patches mentioned are older and likely have already been applied on our machine (though we could update to Solaris 10 U9). Anyone have any thoughts on what could be causing the above? Am I right to think one issue could lead to the other? Could heavy writes on to the slog device be triggering an issue within the ahci driver? Looking for some ideas... fmdump shows a clear fault log... Thanks, Ray [1] http://mail.opensolaris.org/pipermail/nfs-discuss/2006-June/000238.html [2] http://bugs.opensolaris.org/view_bug.do?bug_id=6728187
Possibly Parallel Threads
- Unable to allocate dma memory for extra SGL
- time keeps on slipping... slipping...
- AHCI Timeout errors on Intel Patsburg
- Problem detecting Sil3124 SATA controllers off of Sandy Bridge northbridge-connected PCIe slots
- ahcich reset -> cannot mount zfs root in 9.1-PRE