thr3ads.net - zfs discuss - [zfs-discuss] Degraded Zpool [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Matthias Appel

2010-Jan-21 14:55 UTC

[zfs-discuss] Degraded Zpool

Hi list,

I have a serious issue with my zpool.

My zpool consists of 4 vdevs which are assembled to 2 mirrors.

One of this mirrors got degraded cause of too many errors on each vdev  
of the mirror.
Yes, both vdevs of the mirror got degraded.

According to murphys law I don''t have a backup as well (I have a  
backup, which was made several months ago, and some backups spread  
across several disks)
Both of these backups are not the best so I want to access my data on  
the zpool so I can make a backup and replace the opensolaris server.

As the two faulted vdevs are connected to different controllers I  
assume that the problem is located on the server, not on the harddisks/ 
controllers....one of the faulted harddisks has been replaced some  
weeks ago due to crc errors, so I assume the server is bad, not the  
disks/cables/controllers.

My state is as follows:

         NAME        STATE     READ WRITE CKSUM
         performance  DEGRADED     0     0     8
           mirror    DEGRADED     0     0    16
             c1t1d0  DEGRADED     0     0    23  too many errors
             c2d0    DEGRADED     0     0    24  too many errors
           mirror    ONLINE       0     0     0
             c1t2d0  ONLINE       0     0     7
             c3d0    ONLINE       0     0     7

The disks (at least the c1t1/2 ones,I don''t see the c2/3 ones via  
cfgadm) are online as fare I can see via cfgadm.	

Is there a possibility to force the two degraded vdevs online, so I  
can fully acces the zpool and do a backup?

I wanted to ask first, before doing any stupid things and to loose the  
whole pool.

Any suggestions are very welcome.

Alexander Welter

2010-Jan-24 12:26 UTC

head link

[zfs-discuss] Degraded Zpool

Hi,

the only thing that might help is an export/import, so zfs is forced to re-scan
the pool for operative devices. If that doesn''t help ....

If you suspect the server itself to be the problem, try to attach the drives to
a different box and import the pool there. Just make sure, that the
''new'' servers zfs pool version is >= of the box causing the
trouble.

Alex
-- 
This message posted from opensolaris.org

Daniel Carosone

2010-Jan-25 00:38 UTC

head link

[zfs-discuss] Degraded Zpool

On Thu, Jan 21, 2010 at 03:55:59PM +0100, Matthias Appel
wrote:> I have a serious issue with my zpool.
Yes.  You need to figure out what the root cause of the issue is.
> My zpool consists of 4 vdevs which are assembled to 2 mirrors.
>
> One of this mirrors got degraded cause of too many errors on each vdev  
> of the mirror.
> Yes, both vdevs of the mirror got degraded.
Yes, but note they''re still active, just in a degraded state.  You
should be able to read the data, with care and with some luck that the
errors don''t align to take out both copies of a particular chunk of
data.  
> According to murphys law I don''t have a backup as well (I have a
backup,
> which was made several months ago, and some backups spread across several 
> disks)
> Both of these backups are not the best so I want to access my data on  
> the zpool so I can make a backup and replace the opensolaris server.
If you have them, you may be able to combine them with the current
data in the pool to reconstruct what you need.
> As the two faulted vdevs are connected to different controllers I assume 
> that the problem is located on the server, not on the harddisks/ 
> controllers....one of the faulted harddisks has been replaced some weeks 
> ago due to crc errors, so I assume the server is bad, not the  
> disks/cables/controllers.
Yeah, especially since these are chksum errors.  Prime suspects would
be bad memory and inadequate power supply. Bad motherboard or pci,
dust, temperature and others follow after that.
> My state is as follows:
>
>         NAME        STATE     READ WRITE CKSUM
>         performance  DEGRADED     0     0     8
>           mirror    DEGRADED     0     0    16
>             c1t1d0  DEGRADED     0     0    23  too many errors
>             c2d0    DEGRADED     0     0    24  too many errors
>           mirror    ONLINE       0     0     0
>             c1t2d0  ONLINE       0     0     7
>             c3d0    ONLINE       0     0     7
Note that you have cksum errors on both mirror vdevs, even if the
second one has not yet been marked degraded.  It''s curious that mirror
pairs are showing basically the same number of errors, this suggests
that maybe the data corruption occurred in the original data as it was
written, and both copies are indeed bad.

I would guess from this that c2/c3 are cmdk (ide, or older sata, or
not in ahci mode), which are not hot-plug and that''s why you
don''t see
them in cfgadm. Those aren''t ideal controllers for a performance
oriented pool anyway, another reason to start thinking about new
hardware options. 
> Is there a possibility to force the two degraded vdevs online, so I can 
> fully acces the zpool and do a backup?
They are online, but the rate of errors is a concern.  I would stop
writing to this pool to avoid further damage, if you haven''t already.
You probably want to set the pool''s failmode property to
"continue" to
maximise the amount of data you can get off.

The nice thing about zfs is, in general, you know if you''re getting a
good backup off the pool.  If you have bad memory, though, things can
get corrupted once they''re out of zfs''s hands.  I would
recommend zfs
send | zfs recv as the method of making the backup, rather than some
other tool that might not notice corruption through memory.  

If you get errors in the backup stream, don''t panic - they may be
introduced after the data is read from disk, and a retry later might
not get hit in the same spot.  If they''re in the source data on disk,
you will need to switch to a file-by-file copy to read everything you
can around those errors, once you find where they are. 

The first thing I would do is shutdown the box and run memtest86+ for
24h or so.  Look into your options for replacement parts or a
replacement box while that runs. 

If you can get the disks out and into another known-good server, that
might be a good idea, but take care of them.   Note that you don''t
absolutely have to have all 4 disks online in that server - if you
have space to copy the disks with dd one at a time to some other
storage, that would work too.  

Look at your FMA logs for other clues and reports of errors.

You may want to scrub the pool, to see what data has been damaged, but
I wouldn''t do that until you have resolved the root cause, lest you
"repair" good data with bad.
> I wanted to ask first, before doing any stupid things and to loose the  
> whole pool.
Wise, if you have the luxury of time to wait for advice.  Let us know
what you find and we can make more suggestions.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100125/7f9f3fc9/attachment.bin>

zfs discuss - Jan 2010 - Degraded Zpool

[zfs-discuss] Degraded Zpool

[zfs-discuss] Degraded Zpool

[zfs-discuss] Degraded Zpool