Nothing like a "heart in mouth moment" to shave tears from your life. I rebooted a snv_132 box in perfect heath, and it came back up with two FAULTED disks in the same vdisk group. Everything an hour on Google I found basically said "your data is gone". All 45Tb of it. A postmortem of fmadm showed a single disk failed with smart predictive failure. No indication why the second failed. I don''t give up easily, and it is now back up and scrubbing - no errors so far. I checked both the drives were readable, so it didn''t seem to be a hardware fault. I moved one into a different server and ran a zpool import to see what it made of it. The disk was ONLINE, and it''s vdisk buddies were unavailable. Ok, so I moved the disks into different bays and booted from the snv_134 cdrom. Ran zpool import and the zpool came back with everything online. That was encouraging, so I exported it and booted from the origional 132 boot drive. Well, it came back, and at 1:00AM I was able to get back to the origional issue I was chasing. So, don''t give up hope when all hope appears to be lost. Mark. Still an Open_Solaris fan keen to help the community achieve a 2010 release on it''s own. -- This message posted from opensolaris.org
Hi Mark, I would recheck with fmdump to see if you have any persistent errors on the second disk. The fmdump command will display faults and fmdump -eV will display errors (persistent faults that have turned into errors based on some criteria). If fmdump -eV doesn''t show any activity for that second disk, then review /var/adm/messages or iostat -En for driver-level resets and so on. Thanks, Cindy On 08/16/10 18:53, Mark Bennett wrote:> Nothing like a "heart in mouth moment" to shave tears from your life. > > I rebooted a snv_132 box in perfect heath, and it came back up with two FAULTED disks in the same vdisk group. > > Everything an hour on Google I found basically said "your data is gone". > > All 45Tb of it. > > A postmortem of fmadm showed a single disk failed with smart predictive failure. > No indication why the second failed. > > I don''t give up easily, and it is now back up and scrubbing - no errors so far. > > I checked both the drives were readable, so it didn''t seem to be a hardware fault. > I moved one into a different server and ran a zpool import to see what it made of it. > > The disk was ONLINE, and it''s vdisk buddies were unavailable. > Ok, so I moved the disks into different bays and booted from the snv_134 cdrom. > Ran zpool import and the zpool came back with everything online. > > That was encouraging, so I exported it and booted from the origional 132 boot drive. > > Well, it came back, and at 1:00AM I was able to get back to the origional issue I was chasing. > > So, don''t give up hope when all hope appears to be lost. > > Mark. > > Still an Open_Solaris fan keen to help the community achieve a 2010 release on it''s own.
Hi Cindy, Not very enlightening. No previous errors for the disks. I did replace one about a month earlier when it showed a rise in io errors, and before it reached a level where fault management would have failed it. The disk mentioned is not one of those that went FAULTED. Also, no more smart error events since. The ZFS failed on boot after a reboot command. The scrub was eventually stopped at 75% due to the performance impact. No errors were found up to that point. One thing I see from the (attached) messages log is that the zfs error occurs before all the disks have been logged as enumerated. This is probably the first reboot since at least 8, and maybe 16 extra disks were hot plugged and added to the pool. The Hardware is a Supermicro 3U plus 2 x 4U SAS storage chassis. The SAS controller has 16 disks on one SAS port, and 32 in the other. Aug 16 18:44:39.2154 02f57499-ae0a-c46c-b8f8-825205a8505d ZFS-8000-D3 100% fault.fs.zfs.device Problem in: zfs://pool=drgvault/vdev=d79c5fc5b5c3b789 Affects: zfs://pool=drgvault/vdev=d79c5fc5b5c3b789 FRU: - Location: - Aug 16 18:44:39.5569 25e0bdc2-0171-c4b5-b530-a268f8572bd1 ZFS-8000-D3 100% fault.fs.zfs.device Problem in: zfs://pool=drgvault/vdev=e912d259d7829903 Affects: zfs://pool=drgvault/vdev=e912d259d7829903 FRU: - Location: - Aug 16 18:44:39.8964 8e9cff35-8e9d-c0f1-cd5b-bd1d0276cda1 ZFS-8000-CS 100% fault.fs.zfs.pool Problem in: zfs://pool=drgvault Affects: zfs://pool=drgvault FRU: - Location: - Aug 16 18:45:47.2604 3848ba46-ee18-4aad-b632-9baf25b532ea DISK-8000-0X 100% fault.io.disk.predictive-failure Problem in: hc://:product-id=LSILOGIC-SASX36-A.1:server-id=:chassis-id=50030480005a337f:serial=6XW15V2S:part=ST32000542AS-ST32000542AS:revision=CC34/ses-enclosure=1/bay=6/disk=0 Affects: dev:///:devid=id1,sd at n5000c50021f4916f//pci at 0,0/pci8086,4023 at 3/pci15d9,a680 at 0/sd at 24,0 FRU: hc://:product-id=LSILOGIC-SASX36-A.1:server-id=:chassis-id=50030480005a337f:serial=6XW15V2S:part=ST32000542AS-ST32000542AS:revision=CC34/ses-enclosure=1/bay=6/disk=0 Location: 006 Mark. -- This message posted from opensolaris.org -------------- next part -------------- A non-text attachment was scrubbed... Name: messages.zip Type: application/octet-stream Size: 9324 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100818/58907440/attachment.obj>
Its hard to tell what caused the smart predictive failure message, like a temp fluctuation. If ZFS noticed that a disk wasn''t available yet, then I would expect a message to that effect. In any case, I think I would have a replacement disk available. The important thing is that you continue to monitor your hardware for failures. We recommend using ZFS redundancy and alway have backups of your data. Thanks, Cindy On 08/18/10 02:38, Mark Bennett wrote:> Hi Cindy, > > Not very enlightening. > No previous errors for the disks. > I did replace one about a month earlier when it showed a rise in io errors, and before it reached a level where fault management would have failed it. > > The disk mentioned is not one of those that went FAULTED. > Also, no more smart error events since. > The ZFS failed on boot after a reboot command. > > The scrub was eventually stopped at 75% due to the performance impact. > No errors were found up to that point. > > One thing I see from the (attached) messages log is that the zfs error occurs before all the disks have been logged as enumerated. > This is probably the first reboot since at least 8, and maybe 16 extra disks were hot plugged and added to the pool. > > The Hardware is a Supermicro 3U plus 2 x 4U SAS storage chassis. > The SAS controller has 16 disks on one SAS port, and 32 in the other. > > > > > Aug 16 18:44:39.2154 02f57499-ae0a-c46c-b8f8-825205a8505d ZFS-8000-D3 > 100% fault.fs.zfs.device > Problem in: zfs://pool=drgvault/vdev=d79c5fc5b5c3b789 > Affects: zfs://pool=drgvault/vdev=d79c5fc5b5c3b789 > FRU: - > Location: - > Aug 16 18:44:39.5569 25e0bdc2-0171-c4b5-b530-a268f8572bd1 ZFS-8000-D3 > 100% fault.fs.zfs.device > Problem in: zfs://pool=drgvault/vdev=e912d259d7829903 > Affects: zfs://pool=drgvault/vdev=e912d259d7829903 > FRU: - > Location: - > Aug 16 18:44:39.8964 8e9cff35-8e9d-c0f1-cd5b-bd1d0276cda1 ZFS-8000-CS > 100% fault.fs.zfs.pool > Problem in: zfs://pool=drgvault > Affects: zfs://pool=drgvault > FRU: - > Location: - > Aug 16 18:45:47.2604 3848ba46-ee18-4aad-b632-9baf25b532ea DISK-8000-0X > 100% fault.io.disk.predictive-failure > Problem in: hc://:product-id=LSILOGIC-SASX36-A.1:server-id=:chassis-id=50030480005a337f:serial=6XW15V2S:part=ST32000542AS-ST32000542AS:revision=CC34/ses-enclosure=1/bay=6/disk=0 > Affects: dev:///:devid=id1,sd at n5000c50021f4916f//pci at 0,0/pci8086,4023 at 3/pci15d9,a680 at 0/sd at 24,0 > FRU: hc://:product-id=LSILOGIC-SASX36-A.1:server-id=:chassis-id=50030480005a337f:serial=6XW15V2S:part=ST32000542AS-ST32000542AS:revision=CC34/ses-enclosure=1/bay=6/disk=0 > Location: 006 > > > > Mark. > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Well I do have a plan. Thanks to the portability of ZFS boot disks, I''ll make two new OS disks on another machine with the next Nexcenta release, export the data pool and swap in the new ones. That way, I can at least manage a zfs scrub without killing the performance and get the Intel SSD''s I have been testing to work properly. On the other hand, I could just use the spare 7210 Appliance boot disk I have lying about. Mark. -- This message posted from opensolaris.org
Maybe Matching Threads
- cannot replace X with Y: devices have different sector alignment
- NUT PSU/IPMI driver using FreeIPMI (was: [Freeipmi-devel] in need of guidance...)
- Seagate ST32000542AS and ZFS perf
- solaris 10u8 hangs with message Disconnected command timeout for Target 0
- ZFS storage server hardware