On Thu, Jan 21, 2010 at 03:55:59PM +0100, Matthias Appel
wrote:> I have a serious issue with my zpool.
Yes. You need to figure out what the root cause of the issue is.
> My zpool consists of 4 vdevs which are assembled to 2 mirrors.
>
> One of this mirrors got degraded cause of too many errors on each vdev
> of the mirror.
> Yes, both vdevs of the mirror got degraded.
Yes, but note they''re still active, just in a degraded state. You
should be able to read the data, with care and with some luck that the
errors don''t align to take out both copies of a particular chunk of
data.
> According to murphys law I don''t have a backup as well (I have a
backup,
> which was made several months ago, and some backups spread across several
> disks)
> Both of these backups are not the best so I want to access my data on
> the zpool so I can make a backup and replace the opensolaris server.
If you have them, you may be able to combine them with the current
data in the pool to reconstruct what you need.
> As the two faulted vdevs are connected to different controllers I assume
> that the problem is located on the server, not on the harddisks/
> controllers....one of the faulted harddisks has been replaced some weeks
> ago due to crc errors, so I assume the server is bad, not the
> disks/cables/controllers.
Yeah, especially since these are chksum errors. Prime suspects would
be bad memory and inadequate power supply. Bad motherboard or pci,
dust, temperature and others follow after that.
> My state is as follows:
>
> NAME STATE READ WRITE CKSUM
> performance DEGRADED 0 0 8
> mirror DEGRADED 0 0 16
> c1t1d0 DEGRADED 0 0 23 too many errors
> c2d0 DEGRADED 0 0 24 too many errors
> mirror ONLINE 0 0 0
> c1t2d0 ONLINE 0 0 7
> c3d0 ONLINE 0 0 7
Note that you have cksum errors on both mirror vdevs, even if the
second one has not yet been marked degraded. It''s curious that mirror
pairs are showing basically the same number of errors, this suggests
that maybe the data corruption occurred in the original data as it was
written, and both copies are indeed bad.
I would guess from this that c2/c3 are cmdk (ide, or older sata, or
not in ahci mode), which are not hot-plug and that''s why you
don''t see
them in cfgadm. Those aren''t ideal controllers for a performance
oriented pool anyway, another reason to start thinking about new
hardware options.
> Is there a possibility to force the two degraded vdevs online, so I can
> fully acces the zpool and do a backup?
They are online, but the rate of errors is a concern. I would stop
writing to this pool to avoid further damage, if you haven''t already.
You probably want to set the pool''s failmode property to
"continue" to
maximise the amount of data you can get off.
The nice thing about zfs is, in general, you know if you''re getting a
good backup off the pool. If you have bad memory, though, things can
get corrupted once they''re out of zfs''s hands. I would
recommend zfs
send | zfs recv as the method of making the backup, rather than some
other tool that might not notice corruption through memory.
If you get errors in the backup stream, don''t panic - they may be
introduced after the data is read from disk, and a retry later might
not get hit in the same spot. If they''re in the source data on disk,
you will need to switch to a file-by-file copy to read everything you
can around those errors, once you find where they are.
The first thing I would do is shutdown the box and run memtest86+ for
24h or so. Look into your options for replacement parts or a
replacement box while that runs.
If you can get the disks out and into another known-good server, that
might be a good idea, but take care of them. Note that you don''t
absolutely have to have all 4 disks online in that server - if you
have space to copy the disks with dd one at a time to some other
storage, that would work too.
Look at your FMA logs for other clues and reports of errors.
You may want to scrub the pool, to see what data has been damaged, but
I wouldn''t do that until you have resolved the root cause, lest you
"repair" good data with bad.
> I wanted to ask first, before doing any stupid things and to loose the
> whole pool.
Wise, if you have the luxury of time to wait for advice. Let us know
what you find and we can make more suggestions.
--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100125/7f9f3fc9/attachment.bin>