thr3ads.net - zfs discuss - [zfs-discuss] Problem with missing disk in RaidZ [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Peter Hawkins

2008-Jun-16 17:57 UTC

[zfs-discuss] Problem with missing disk in RaidZ

Thanks to the help in a previous post I have imported my pool. However I would
appreciate some help with my next problem.

This all arose because my motherboard failed while my zpool was resilvering from
a failed disk. I moved the disks to a new motherboard and imported the pool with
the help of the posters here. However when imported the new system spawned
regular error messages regarding the new disk and eventually the system would
hang after about a minute, really hang - completeley locked. I tried killing the
resilver with scrub -s but it just said that no scrub was in progress.
Eventaully I detached the replacement disk and the system stayed running with
the pool imported.

However my pool is now in this state:

  pool: rz1500
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using ''zpool
online''.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: resilver completed with 0 errors on Mon Jun 16 18:37:07 2008
config:

        NAME        STATE     READ WRITE CKSUM
        rz1500      DEGRADED     0     0     0
          raidz1    DEGRADED     0     0     0
            c2d0p0  ONLINE       0     0     0
            c3d0p0  ONLINE       0     0     0
            c6d0p0  ONLINE       0     0     0
            c7d0    UNAVAIL      0     0     0  cannot open

errors: No known data errors

The missing device was c7d0p0 and I now have another brand new disk to replace
it. I can''t attach the device as there seems to be no RaidZ attach
syntax, and onlining the device makes no difference.  I need to add back the
device that I detached to this RaidZ pool.

I''m on S10 x86 U3 patched to zpool V4.
 
 
This message posted from opensolaris.org

Eric Schrock

2008-Jun-16 18:16 UTC

head link

[zfs-discuss] Problem with missing disk in RaidZ

Try ''zpool replace''.

- Eric

On Mon, Jun 16, 2008 at 10:57:40AM -0700, Peter Hawkins
wrote:> Thanks to the help in a previous post I have imported my pool. However I
would appreciate some help with my next problem.
> 
> This all arose because my motherboard failed while my zpool was resilvering
from a failed disk. I moved the disks to a new motherboard and imported the pool
with the help of the posters here. However when imported the new system spawned
regular error messages regarding the new disk and eventually the system would
hang after about a minute, really hang - completeley locked. I tried killing the
resilver with scrub -s but it just said that no scrub was in progress.
Eventaully I detached the replacement disk and the system stayed running with
the pool imported.
> 
> However my pool is now in this state:
> 
>   pool: rz1500
>  state: DEGRADED
> status: One or more devices could not be opened.  Sufficient replicas exist
for
>         the pool to continue functioning in a degraded state.
> action: Attach the missing device and online it using ''zpool
online''.
>    see: http://www.sun.com/msg/ZFS-8000-D3
>  scrub: resilver completed with 0 errors on Mon Jun 16 18:37:07 2008
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         rz1500      DEGRADED     0     0     0
>           raidz1    DEGRADED     0     0     0
>             c2d0p0  ONLINE       0     0     0
>             c3d0p0  ONLINE       0     0     0
>             c6d0p0  ONLINE       0     0     0
>             c7d0    UNAVAIL      0     0     0  cannot open
> 
> errors: No known data errors
> 
> The missing device was c7d0p0 and I now have another brand new disk to
replace it. I can''t attach the device as there seems to be no RaidZ
attach syntax, and onlining the device makes no difference.  I need to add back
the device that I detached to this RaidZ pool.
> 
> I''m on S10 x86 U3 patched to zpool V4.
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Peter Hawkins

2008-Jun-16 21:54 UTC

head link

[zfs-discuss] Problem with missing disk in RaidZ

Tried zpool replace. Unfortunately that takes me back into the cycle where as
soon as the resilver starts the system hangs, not even CAPS Lock works. When I
reset the system I have about a 10 second window to detach the device again to
get the system back before it freezes. Finally detached it so I''m back
where I started.
 
 
This message posted from opensolaris.org

Miles Nordin

2008-Jun-20 17:46 UTC

head link

[zfs-discuss] Problem with missing disk in RaidZ

>>>>> "ph" == Peter Hawkins <peter.hawkins at
hos.horizon.ie> writes:
ph> Tried zpool replace. Unfortunately that takes me back into the
ph> cycle where as soon as the resilver starts the system hangs,
ph> not even CAPS Lock works. When I reset the system I have about
ph> a 10 second window to detach the device again to get the
ph> system back before it freezes.

I had problems like this with a firewire disk serving as a mirror
component on b44.

1) if half of the mirror went away unexpectedly without
''offline''ing
it first, zpool would later (after bringing the device back) show
checksum errors accumulating on the bounced part of the mirror for
weeks after.

2) if I try to fix this with ''zpool scrub'' it would announce
it
expected the scrub of ~200GB to take 7 hours, and immediately the
system was 1/10th speed or less for anything that used the disk,
like a web browser writing to cache, but xterm was still fine.

After about 1 hour the system stopped accessing the disk. It did
not panic, and I think the mouse pointer still moved, but windows
couldn''t be moved or raised. I could never complete a scrub.

I think (2) might have had something to do with (1). My firewire case
had a bad Prolific chipset, and every two days ot so the firewire case
crashed and needed to be rebooted. This is documented on the web as
happening with other operating systems, and does not happen under
Solaris with my Oxford 911 case. This problem is why (1) kept
happening to me. For (2), I bet the case was crashing during the
scrub. I replaced that firewire case with an iSCSI target (and
continued using another firewire case with an Oxford 911 chip in it),
and (1) and (2) both went away. well, the system is still useless
during a scrub, but by ``went away'''' I mean it
doesn''t lock up---it
eventually finishes.

so, I would suggest exercising each of your devices. Maybe one of the
disks or cables is bad (and not necessarily the one you''re replacing).
In my (highly anecdotal) experience, ZFS isn''t robust to failures
during a scrub, only to failures that happen not during a scrub.
Probably the actual situation is more tangled and not exactly what I
said.

I have two ways of ``exercising'''' devices. one:

dd if=/dev/rdsk/cxtxdxs2 of=/dev/null bs=$(( 56 * 1024 )) (SMD label)
dd if=/dev/rdsk/cxtxdx of=/dev/null bs=$(( 56 * 1024 )) (EFI label)

this tests the disks, controllers, and cables. It should be safe to
do on a running system, and probably won''t slow down your system as
much as a ZFS scrub. Watch for I/O errors received by ''dd''
and for
more detailed errors in dmesg.

another:

smartctl -t long /dev/rdsk/...

smartctl -a /dev/rdsk/... (check that the test is running,
has an in progress fow in the test
log at the bottom, and how long
it should take)

smartctl -a /dev/rdsk/... (check that it does not say ``aborted
by host command''''. the
test is
supposed to run in the background,
but with some old or dumb disks,
it doesn''t)
[wait several hours]
smartctl -a /dev/rdsk/... (check that a new row as appeared
in the test log at the bottom, and
that it says Extended offline,
Completed without error)

This will test that the disk is good, regardless of the
controller/driver/cable. The two tests together can help isolate a
problem after you know that you have it. The ''-t long'' also
should
not harm performance as much as a zpool scrub---the disks are supposed
to be smart about it.

I''m not sure what to do to test your last disk. You could try to
''zpool offline'' the UNAVAIL disk and see if that stops ZFS
from trying
to open it, but this hasn''t worked perfectly for me. You could test
the disk in another machine, but this doesn''t test
driver/controller/cable. If you can''t work anything else out, you can
boot your system with ''boot -m milestone=none'' to prevent ZFS
from
coming up, and do your testing there. This is what I have to do to
remove iSCSI targets for which ZFS is `patiently waiting'' forever.

For problem (1) mentioned way at the top of my mail, I found I could
avoid the checksum errors in (1) by ''zpool offline''ing iSCSI
and
firewire targets before I take them away. When I bring them back
online in that case, the brief resilver DOES do enough to avoid
checksum errors accumulating later. I would say the (1) problem is
probably in Linux IET rather than ZFS because I''m highly suspicious of
Linux developers'' ability to understand synchronize-cache or write
barriers or anything of that sort, except that (1) happened with
firewire too so I think it is a real ZFS problem. I don''t really find
this acceptable, but at least it''s repeatable so it''s possible
for me
to do maintenance without suffering a day of unavailability for
scrubbing.

However I still have some problems because ZFS seems to like to bring
iSCSI targets online all by itself when they reappear, which is good
sometimes, but it does this even if I''ve offlined them manually which
I found surprising because the documentation makes it sound like
marking something offline is supposed to survive even reboots. and
because of some crappyness with the Linux iSCSI target, I sometimes
have to restart the target and add/remove the solaris initiator''s
``discovery address'''' to get the connection to come back up.
The
first time the connection comes up, ZFS onlines the target, and then
when I later bounce it I run into problem (1) and get checksum errors
later. so I still have to do scrubs which for 1TB with my slowass
setup can take more than a day of too-slow-to-play-video.

I have not tested it, though, with bouncing iSCSI targets _during_ the
scrub. I can try it if someone cares.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080620/24db77b5/attachment.bin>

zfs discuss - Jun 2008 - Problem with missing disk in RaidZ

[zfs-discuss] Problem with missing disk in RaidZ

[zfs-discuss] Problem with missing disk in RaidZ

[zfs-discuss] Problem with missing disk in RaidZ

[zfs-discuss] Problem with missing disk in RaidZ