>>>>> "ph" == Peter Hawkins <peter.hawkins at
hos.horizon.ie> writes:
ph> Tried zpool replace. Unfortunately that takes me back into the
ph> cycle where as soon as the resilver starts the system hangs,
ph> not even CAPS Lock works. When I reset the system I have about
ph> a 10 second window to detach the device again to get the
ph> system back before it freezes.
I had problems like this with a firewire disk serving as a mirror
component on b44.
1) if half of the mirror went away unexpectedly without
''offline''ing
it first, zpool would later (after bringing the device back) show
checksum errors accumulating on the bounced part of the mirror for
weeks after.
2) if I try to fix this with ''zpool scrub'' it would announce
it
expected the scrub of ~200GB to take 7 hours, and immediately the
system was 1/10th speed or less for anything that used the disk,
like a web browser writing to cache, but xterm was still fine.
After about 1 hour the system stopped accessing the disk. It did
not panic, and I think the mouse pointer still moved, but windows
couldn''t be moved or raised. I could never complete a scrub.
I think (2) might have had something to do with (1). My firewire case
had a bad Prolific chipset, and every two days ot so the firewire case
crashed and needed to be rebooted. This is documented on the web as
happening with other operating systems, and does not happen under
Solaris with my Oxford 911 case. This problem is why (1) kept
happening to me. For (2), I bet the case was crashing during the
scrub. I replaced that firewire case with an iSCSI target (and
continued using another firewire case with an Oxford 911 chip in it),
and (1) and (2) both went away. well, the system is still useless
during a scrub, but by ``went away'''' I mean it
doesn''t lock up---it
eventually finishes.
so, I would suggest exercising each of your devices. Maybe one of the
disks or cables is bad (and not necessarily the one you''re replacing).
In my (highly anecdotal) experience, ZFS isn''t robust to failures
during a scrub, only to failures that happen not during a scrub.
Probably the actual situation is more tangled and not exactly what I
said.
I have two ways of ``exercising'''' devices. one:
dd if=/dev/rdsk/cxtxdxs2 of=/dev/null bs=$(( 56 * 1024 )) (SMD label)
dd if=/dev/rdsk/cxtxdx of=/dev/null bs=$(( 56 * 1024 )) (EFI label)
this tests the disks, controllers, and cables. It should be safe to
do on a running system, and probably won''t slow down your system as
much as a ZFS scrub. Watch for I/O errors received by ''dd''
and for
more detailed errors in dmesg.
another:
smartctl -t long /dev/rdsk/...
smartctl -a /dev/rdsk/... (check that the test is running,
has an in progress fow in the test
log at the bottom, and how long
it should take)
smartctl -a /dev/rdsk/... (check that it does not say ``aborted
by host command''''. the
test is
supposed to run in the background,
but with some old or dumb disks,
it doesn''t)
[wait several hours]
smartctl -a /dev/rdsk/... (check that a new row as appeared
in the test log at the bottom, and
that it says Extended offline,
Completed without error)
This will test that the disk is good, regardless of the
controller/driver/cable. The two tests together can help isolate a
problem after you know that you have it. The ''-t long'' also
should
not harm performance as much as a zpool scrub---the disks are supposed
to be smart about it.
I''m not sure what to do to test your last disk. You could try to
''zpool offline'' the UNAVAIL disk and see if that stops ZFS
from trying
to open it, but this hasn''t worked perfectly for me. You could test
the disk in another machine, but this doesn''t test
driver/controller/cable. If you can''t work anything else out, you can
boot your system with ''boot -m milestone=none'' to prevent ZFS
from
coming up, and do your testing there. This is what I have to do to
remove iSCSI targets for which ZFS is `patiently waiting'' forever.
For problem (1) mentioned way at the top of my mail, I found I could
avoid the checksum errors in (1) by ''zpool offline''ing iSCSI
and
firewire targets before I take them away. When I bring them back
online in that case, the brief resilver DOES do enough to avoid
checksum errors accumulating later. I would say the (1) problem is
probably in Linux IET rather than ZFS because I''m highly suspicious of
Linux developers'' ability to understand synchronize-cache or write
barriers or anything of that sort, except that (1) happened with
firewire too so I think it is a real ZFS problem. I don''t really find
this acceptable, but at least it''s repeatable so it''s possible
for me
to do maintenance without suffering a day of unavailability for
scrubbing.
However I still have some problems because ZFS seems to like to bring
iSCSI targets online all by itself when they reappear, which is good
sometimes, but it does this even if I''ve offlined them manually which
I found surprising because the documentation makes it sound like
marking something offline is supposed to survive even reboots. and
because of some crappyness with the Linux iSCSI target, I sometimes
have to restart the target and add/remove the solaris initiator''s
``discovery address'''' to get the connection to come back up.
The
first time the connection comes up, ZFS onlines the target, and then
when I later bounce it I run into problem (1) and get checksum errors
later. so I still have to do scrubs which for 1TB with my slowass
setup can take more than a day of too-slow-to-play-video.
I have not tested it, though, with bouncing iSCSI targets _during_ the
scrub. I can try it if someone cares.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080620/24db77b5/attachment.bin>