On Mon, Mar 8, 2010 at 2:00 PM, Chris Dunbar <cdunbar at earthside.net>
wrote:
> Hello,
>
> I just found this list and am very excited that you all are here! I have a
> homemade ZFS server that serves as our poor man''s Thumper (we
named it
> thumpthis) and provides primarily NFS shares for our VMware environment. As
> is often the case, the server has developed a hardware problem mere days
> before I am ready to go live with a new replacement server (thumpthat). At
> first the problem appeared to be a bad drive, but now I am not so sure. I
> would like to sanity check my thought process with this list and see if
> anybody has some different ideas. Here is a quick timeline of the trouble:
>
> 1. I noticed the following when running a routine zpool status:
>
> <snip>
> mirror DEGRADED 0 0 0
> c3t2d0 ONLINE 0 0 0
> c3t3d0 REMOVED 0 368K 0
> </snip>
>
> 2. I determined which drive appeared to be offline by watching drive lights
> and then rebooted the server.
>
> 3. Initially the drive appeared to be fine and ZFS picked it backup and
> resilvered the mirror. About 30 minutes later I noticed that the same drive
> was again marked REMOVED.
>
> 4. I shut the server down and replaced the drives with a new, larger disk.
>
> 5. I ran zpool replace tank c3t3d0 and it happily went to work on the
> replacement drive. A few hours later the resilver was complete and all
> seemed well.
>
> 6. The next day, about 12 hours after installing the new drive I found the
> same error message (here''s the whole pool):
>
> config:
>
> NAME STATE READ WRITE CKSUM
> tank DEGRADED 0 0 0
> mirror ONLINE 0 0 0
> c3t0d0 ONLINE 0 0 0
> c3t1d0 ONLINE 0 0 0
> mirror DEGRADED 0 0 0
> c3t2d0 ONLINE 0 0 0
> c3t3d0 REMOVED 0 370K 0
> mirror ONLINE 0 0 0
> c4t0d0 ONLINE 0 0 0
> c4t1d0 ONLINE 0 0 0
> mirror ONLINE 0 0 0
> c4t2d0 ONLINE 0 0 0
> c4t3d0 ONLINE 0 0 0
>
> errors: No known data errors
>
> This is where I am now. Either my new hard drive is bad (not impossible) or
> I am looking at some other hardware failure, possibly the AOC-SAT2-MV8
> controller card. I have a spare controller card (same make and model
> purchased at the same time we built the server) and plan to replace that
> tonight. Does that seem like the correct course of action? Are there any
> steps I can take beforehand to zero in on the problem? Any words of
> encouragement or wisdom?
>
What does `iostat -En` say ?
My suggestion is to replace the cable that''s connecting the c3t3d0
disk.
IMHO, the cable is much more likely to be faulty than a single port on the
disk controller.
--
Giovanni Tirloni
sysdroid.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100308/7eab5fb2/attachment.html>