Shannon Roddy
2006-Aug-28 20:01 UTC
[zfs-discuss] Sol 10 x86_64 intermittent SATA device locks up server
Hello All, I have an issue where I have two SATA cards with 5 drives each in one zfs pool. The issue is one of the devices has been intermittently failing. The problem is that the entire box seems to lock up on occasion when this happens. I currently have the SATA cable to that device disconnected in the hopes that the box will at least stay up for now. This is a new build that I am "burning in" in the hopes that it will serve as some NFS space for our solaris boxen. Below is the output from "zpool status -vx" bash-3.00# zpool status pool: tank state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using ''zpool online''. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 raidz DEGRADED 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 UNAVAIL 42 63 0 cannot open errors: No known data errors And below is some info from /var/adm/messages: Aug 29 12:42:08 localhost marvell88sx: [ID 812917 kern.warning] WARNING: marvell88sx1: error on port 5: Aug 29 12:42:08 localhost marvell88sx: [ID 702911 kern.notice] SError interrupt Aug 29 12:42:08 localhost marvell88sx: [ID 702911 kern.notice] link data receive error - crc Aug 29 12:42:08 localhost marvell88sx: [ID 702911 kern.notice] link data receive error - state Aug 29 12:42:08 localhost marvell88sx: [ID 812917 kern.warning] WARNING: marvell88sx1: error on port 5: Aug 29 12:42:08 localhost marvell88sx: [ID 702911 kern.notice] device error Aug 29 12:42:08 localhost marvell88sx: [ID 702911 kern.notice] SError interrupt Aug 29 12:42:08 localhost marvell88sx: [ID 702911 kern.notice] EDMA self disabled Aug 29 12:43:08 localhost marvell88sx: [ID 812917 kern.warning] WARNING: marvell88sx1: error on port 5: Aug 29 12:43:08 localhost marvell88sx: [ID 702911 kern.notice] device disconnected Aug 29 12:43:08 localhost marvell88sx: [ID 702911 kern.notice] device connected Aug 29 12:43:08 localhost marvell88sx: [ID 702911 kern.notice] SError interrupt Aug 29 12:43:10 localhost marvell88sx: [ID 812917 kern.warning] WARNING: marvell88sx1: error on port 5: Aug 29 12:43:10 localhost marvell88sx: [ID 702911 kern.notice] SError interrupt Aug 29 12:43:10 localhost marvell88sx: [ID 702911 kern.notice] link data receive error - crc Aug 29 12:43:10 localhost marvell88sx: [ID 702911 kern.notice] link data receive error - state Aug 29 12:43:11 localhost marvell88sx: [ID 812917 kern.warning] WARNING: marvell88sx1: error on port 5: Aug 29 12:43:11 localhost marvell88sx: [ID 702911 kern.notice] device error Aug 29 12:43:11 localhost marvell88sx: [ID 702911 kern.notice] SError interrupt Aug 29 12:43:11 localhost marvell88sx: [ID 702911 kern.notice] EDMA self disabled Aug 29 12:44:10 localhost marvell88sx: [ID 812917 kern.warning] WARNING: marvell88sx1: error on port 5: Aug 29 12:44:10 localhost marvell88sx: [ID 702911 kern.notice] device disconnected Aug 29 12:44:10 localhost marvell88sx: [ID 702911 kern.notice] device connected Aug 29 12:44:10 localhost marvell88sx: [ID 702911 kern.notice] SError interrupt My question is, shouldn''t it be possible for the solaris to stay up even with an intermittent drive error? I have a replacement drive and cable on order to see if that fixes the problem. Thanks!