thr3ads.net - zfs discuss - [zfs-discuss] Zpool with data errors [Jun 2011]

If this information is useful, please help other people find it:
Share via:

Todd Urie

2011-Jun-21 05:36 UTC

[zfs-discuss] Zpool with data errors

I have a zpool that shows the following from a zpool status -v <zpool
name>

brsnnfs0104 [/var/spool/cron/scripts]# zpool status -v ABC0101
pool:ABC0101
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
ABC0101 ONLINE 0 0 10
/dev/vx/dsk/ABC01dg/ABC0101_01 ONLINE 0 0 2
/dev/vx/dsk/ABC01dg/ABC0101_02 ONLINE 0 0 8
/dev/vx/dsk/ABC01dg/ABC0101_03 ONLINE 0 0 10

errors: Permanent errors have been detected in the following files:

/clients/ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/717b52282ea059452621587173561360
/clients/
ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/6e6a9f37c4d13fdb3dcb8649272a2a49
/clients/ABC0101/rep/d0/prod1/reports/ReutersCMOLoad/ReutersCMOLoad.
ABCntss001.20110620.141330.26496.ROLLBACK_FOR_UPDATE_COUPONS.html
/clients/
ABC0101/rep/local/bfm/web/htdocs/tmp/G2_0.related_detail_loader.1308593666.54643.n5cpoli3355.data
/clients/
ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/F_OLPO82_A.gp.
ABCIM_GA.nlaf.xml.gz
/clients/
ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNVLXCIAFI.gp.
ABCIM_GA.nlaf.xml.gz
/clients/
ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNIVLEXCIA.gp.BARCRATING_
ABC.nlaf.xml.gz

I think that a scrub at least has the possibility to clear this up. A quick
search suggests that others have had some good experience with using scrub
in similar circumstances. I was wondering if anyone could share some of
their experiences, good and bad, so that I can assess the risk and
probability of success with this approach. Also, any other ideas would
certainly be appreciated.

-----RTU
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110621/46151bf3/attachment.html>

Tomas Ögren

2011-Jun-21 08:16 UTC

head link

[zfs-discuss] Zpool with data errors

On 21 June, 2011 - Todd Urie sent me these 5,9K bytes:
> I have a zpool that shows the following from a zpool status -v <zpool
name>
> 
> brsnnfs0104 [/var/spool/cron/scripts]# zpool status -v ABC0101
>   pool:ABC0101
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: none requested
> config:
> 
>         NAME                              STATE     READ WRITE CKSUM
>         ABC0101                           ONLINE       0     0    10
>           /dev/vx/dsk/ABC01dg/ABC0101_01  ONLINE       0     0     2
>           /dev/vx/dsk/ABC01dg/ABC0101_02  ONLINE       0     0     8
>           /dev/vx/dsk/ABC01dg/ABC0101_03  ONLINE       0     0    10
> 
> errors: Permanent errors have been detected in the following files:
> 
>
/clients/ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/717b52282ea059452621587173561360
>         /clients/
>
ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/6e6a9f37c4d13fdb3dcb8649272a2a49
>        
/clients/ABC0101/rep/d0/prod1/reports/ReutersCMOLoad/ReutersCMOLoad.
> ABCntss001.20110620.141330.26496.ROLLBACK_FOR_UPDATE_COUPONS.html
>         /clients/
>
ABC0101/rep/local/bfm/web/htdocs/tmp/G2_0.related_detail_loader.1308593666.54643.n5cpoli3355.data
>         /clients/
> ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/F_OLPO82_A.gp.
> ABCIM_GA.nlaf.xml.gz
>         /clients/
> ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNVLXCIAFI.gp.
> ABCIM_GA.nlaf.xml.gz
>         /clients/
>
ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNIVLEXCIA.gp.BARCRATING_
> ABC.nlaf.xml.gz
> 
> I think that a scrub at least has the possibility to clear this up.  A
quick
> search suggests that others have had some good experience with using scrub
> in similar circumstances.  I was wondering if anyone could share some of
> their experiences, good and bad, so that I can assess the risk and
> probability of success with this approach.  Also, any other ideas would
> certainly be appreciated.
As you have no ZFS based redundancy, it can only detect that some blocks
delivered from the devices (SAN I guess?) were broken according to the
checksum. If you had raidz/mirror in zfs, it would have corrected the
problems and written back correct data to the malfunctioning device. Now
it does not. A scrub only reads the data and verifies that data matches
checksums.

/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Remco Lengers

2011-Jun-21 08:17 UTC

head link

[zfs-discuss] Zpool with data errors

Todd,

Is that ZFS on top of VxVM ?  Are those volumes okay? I wonder if this 
is really a sensible combination?

..Remco

On 6/21/11 7:36 AM, Todd Urie wrote:> I have a zpool that shows the following from a zpool status -v <zpool 
> name>
>
> brsnnfs0104 [/var/spool/cron/scripts]# zpool status -v ABC0101
>   pool: ABC0101
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: none requested
> config:
>
>         NAME                              STATE     READ WRITE CKSUM
> ABC0101                           ONLINE       0     0    10
>           /dev/vx/dsk/ ABC01dg/ ABC0101_01  ONLINE       0     0     2
>           /dev/vx/dsk/ ABC01dg/ ABC0101_02  ONLINE       0     0     8
>           /dev/vx/dsk/ ABC01dg/ ABC0101_03  ONLINE       0     0    10
>
> errors: Permanent errors have been detected in the following files:
>         
>
/clients/ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/717b52282ea059452621587173561360
>         /clients/ 
>
ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/6e6a9f37c4d13fdb3dcb8649272a2a49
>         /clients/ 
> ABC0101/rep/d0/prod1/reports/ReutersCMOLoad/ReutersCMOLoad. 
> ABCntss001.20110620.141330.26496.ROLLBACK_FOR_UPDATE_COUPONS.html
>         /clients/ 
>
ABC0101/rep/local/bfm/web/htdocs/tmp/G2_0.related_detail_loader.1308593666.54643.n5cpoli3355.data
>         /clients/ 
> ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/F_OLPO82_A.gp. 
> ABCIM_GA.nlaf.xml.gz
>         /clients/ 
> ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNVLXCIAFI.gp. 
> ABCIM_GA.nlaf.xml.gz
>         /clients/ 
>
ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNIVLEXCIA.gp.BARCRATING_
> ABC.nlaf.xml.gz
>
> I think that a scrub at least has the possibility to clear this up.  A 
> quick search suggests that others have had some good experience with 
> using scrub in similar circumstances.  I was wondering if anyone could 
> share some of their experiences, good and bad, so that I can assess 
> the risk and probability of success with this approach.  Also, any 
> other ideas would certainly be appreciated.
>
>
> -----RTU
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110621/df4694df/attachment.html>

Todd Urie

2011-Jun-21 11:54 UTC

head link

[zfs-discuss] Zpool with data errors

The volumes sit on HDS SAN.  The only reason for the volumes is to prevent
inadvertent import of the zpool on two nodes of a cluster simultaneously.
 Since we''re on SAN with Raid internally, didn''t seem to we
would need zfs
to provide that redundancy also.

On Tue, Jun 21, 2011 at 4:17 AM, Remco Lengers <remco at lengers.com>
wrote:
> **
> Todd,
>
> Is that ZFS on top of VxVM ?  Are those volumes okay? I wonder if this is
> really a sensible combination?
>
> ..Remco
>
>
> On 6/21/11 7:36 AM, Todd Urie wrote:
>
> I have a zpool that shows the following from a zpool status -v <zpool
name>
>
>
>  brsnnfs0104 [/var/spool/cron/scripts]# zpool status -v  ABC0101
>   pool: ABC0101
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: none requested
> config:
>
>          NAME                              STATE     READ WRITE CKSUM
>          ABC0101                           ONLINE       0     0    10
>           /dev/vx/dsk/ ABC01dg/ ABC0101_01  ONLINE       0     0     2
>           /dev/vx/dsk/ ABC01dg/ ABC0101_02  ONLINE       0     0     8
>           /dev/vx/dsk/ ABC01dg/ ABC0101_03  ONLINE       0     0    10
>
>  errors: Permanent errors have been detected in the following files:
>
>
/clients/ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/717b52282ea059452621587173561360
>         /clients/
>
ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/6e6a9f37c4d13fdb3dcb8649272a2a49
>         /clients/
> ABC0101/rep/d0/prod1/reports/ReutersCMOLoad/ReutersCMOLoad.
> ABCntss001.20110620.141330.26496.ROLLBACK_FOR_UPDATE_COUPONS.html
>         /clients/
>
ABC0101/rep/local/bfm/web/htdocs/tmp/G2_0.related_detail_loader.1308593666.54643.n5cpoli3355.data
>         /clients/
> ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/F_OLPO82_A.gp.
> ABCIM_GA.nlaf.xml.gz
>         /clients/
> ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNVLXCIAFI.gp.
> ABCIM_GA.nlaf.xml.gz
>         /clients/
>
ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNIVLEXCIA.gp.BARCRATING_
> ABC.nlaf.xml.gz
>
>  I think that a scrub at least has the possibility to clear this up.  A
> quick search suggests that others have had some good experience with using
> scrub in similar circumstances.  I was wondering if anyone could share some
> of their experiences, good and bad, so that I can assess the risk and
> probability of success with this approach.  Also, any other ideas would
> certainly be appreciated.
>
>
> -----RTU
>
>
> _______________________________________________
> zfs-discuss mailing listzfs-discuss at
opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>

-- 
-----RTU
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110621/ea8e1cee/attachment-0001.html>

Sami Ketola

2011-Jun-21 12:02 UTC

head link

[zfs-discuss] Zpool with data errors

On Jun 21, 2011, at 2:54 PM, Todd Urie wrote:
> The volumes sit on HDS SAN.  The only reason for the volumes is to prevent
inadvertent import of the zpool on two nodes of a cluster simultaneously.  Since
we''re on SAN with Raid internally, didn''t seem to we would
need zfs to provide that redundancy also.

Not a wise way of building a pool. Your HDS SAN does not give any protection
against data corruption and not doing redundancy with ZFS it can only report
data corruption and not correct them. Also VxVM does not give you any more
protection against importing the luns/volumes/pools than what ZFS gives. They
both warn the admin if they are trying to shoot their leg but let them do it if
they use the force.

Time to rebuild your pool without VxVM involved and restore data from backups.

Sami

Toby Thain

2011-Jun-21 12:14 UTC

head link

[zfs-discuss] Zpool with data errors

On 21/06/11 7:54 AM, Todd Urie wrote:> The volumes sit on HDS SAN.  The only reason for the volumes is to
> prevent inadvertent import of the zpool on two nodes of a cluster
> simultaneously.  Since we''re on SAN with Raid internally,
didn''t seem to
> we would need zfs to provide that redundancy also.
You do if you want self-healing, as Tomas points out. A non-redundant
pool, even on mirrored or RAID storage, offers no ability to recover
from detected errors anywhere on the data path. To gain this benefit of
ZFS, it needs to manage redundancy.

On the upside, ZFS at least *detected* the errors, while other systems
would not.

--Toby
> 
> On Tue, Jun 21, 2011 at 4:17 AM, Remco Lengers <remco at lengers.com
> <mailto:remco at lengers.com>> wrote:
> 
>     Todd,
> 
>     Is that ZFS on top of VxVM ?  Are those volumes okay? I wonder if
>     this is really a sensible combination?
> 
>     ..Remco
> 
> 
>     On 6/21/11 7:36 AM, Todd Urie wrote:
>>     I have a zpool that shows the following from a zpool status -v
>>     <zpool name>
>>...

Marty Scholes

2011-Jun-21 13:23 UTC

head link

[zfs-discuss] Zpool with data errors

> didn''t seem to we would need zfs to provide that redundancy also.
There was a time when I fell for this line of reasoning too.  The problem (if
you want to call it that) with zfs is that it will show you, front and center,
the corruption taking place in your stack.
> Since we''re on SAN with Raid internally
Your situation would suggest that your RAID silently corrupted data and
didn''t even know about it.

Until you can trust the volumes behind zfs (and I don''t trust any of
them anymore, regardless of the brand name on the cabinet), give zfs at least
some redundancy so that it can pick up the slack.

By the way, I used to trust storage because I didn''t believe it was
corrupting data, but I had no proof one way or the other, so I gave it the
benefit of the doubt.

Since I have been using zfs, my standards have gone up considerably.  Now I
trust storage because I can *prove* it''s correct.

If someone can''t prove that a volume is returning correct data,
don''t trust it.  Let zfs manage it.
-- 
This message posted from opensolaris.org

Cindy Swearingen

2011-Jun-22 14:48 UTC

head link

[zfs-discuss] Zpool with data errors

Hi Todd,

Yes, I have seen zpool scrub do some miracles but I think it depends
on the amount of corruption.

A few suggestions are:

1. Identify and resolve the corruption problems on the underlying
hardware. No point in trying to clear the pool errors if this
problem continues.

The fmdump command and the fmdump -eV command output will
tell you how long these errors have occurred.

2. Run zpool scrub and zpool clear to attempt to clear the errors.

3. If the errors below don''t clear, then manually remove the corrupted
files below, if possible, and restore from backup. Depending on what
fmdump says, you might check your backups for corruption.

4. Run zpool scrub and zpool clear again as needed.

5. Consider replacing this configuration with a redundant ZFS storage
pool. We can provide the recommended syntax.

Let us know how this turns out.

Thanks,

Cindy

On 06/20/11 23:36, Todd Urie wrote:> I have a zpool that shows the following from a zpool status -v <zpool
name>
> 
> brsnnfs0104 [/var/spool/cron/scripts]# zpool status -v ABC0101
>   pool:ABC0101
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: none requested
> config:
> 
>         NAME                              STATE     READ WRITE CKSUM
>         ABC0101                           ONLINE       0     0    10
>           /dev/vx/dsk/ABC01dg/ABC0101_01  ONLINE       0     0     2
>           /dev/vx/dsk/ABC01dg/ABC0101_02  ONLINE       0     0     8
>           /dev/vx/dsk/ABC01dg/ABC0101_03  ONLINE       0     0    10
> 
> errors: Permanent errors have been detected in the following files:
>         
>
/clients/ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/717b52282ea059452621587173561360
>         
>
/clients/ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/6e6a9f37c4d13fdb3dcb8649272a2a49
>         
>
/clients/ABC0101/rep/d0/prod1/reports/ReutersCMOLoad/ReutersCMOLoad.ABCntss001.20110620.141330.26496.ROLLBACK_FOR_UPDATE_COUPONS.html
>         
>
/clients/ABC0101/rep/local/bfm/web/htdocs/tmp/G2_0.related_detail_loader.1308593666.54643.n5cpoli3355.data
>         
>
/clients/ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/F_OLPO82_A.gp.ABCIM_GA.nlaf.xml.gz
>         
>
/clients/ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNVLXCIAFI.gp.ABCIM_GA.nlaf.xml.gz
>         
>
/clients/ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNIVLEXCIA.gp.BARCRATING_ABC.nlaf.xml.gz
> 
> I think that a scrub at least has the possibility to clear this up.  A 
> quick search suggests that others have had some good experience with 
> using scrub in similar circumstances.  I was wondering if anyone could 
> share some of their experiences, good and bad, so that I can assess the 
> risk and probability of success with this approach.  Also, any other 
> ideas would certainly be appreciated.
> 
> 
> -----RTU
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Todd Urie

2011-Jun-22 15:33 UTC

head link

[zfs-discuss] Zpool with data errors

I''ll be doing this over the upcoming weekend so I''ll see how
it goes.

Thanks for all of the suggestions. 

Todd



On Jun 22, 2011, at 10:48 AM, Cindy Swearingen <cindy.swearingen at
oracle.com> wrote:
> Hi Todd,
> 
> Yes, I have seen zpool scrub do some miracles but I think it depends
> on the amount of corruption.
> 
> A few suggestions are:
> 
> 1. Identify and resolve the corruption problems on the underlying
> hardware. No point in trying to clear the pool errors if this
> problem continues.
> 
> The fmdump command and the fmdump -eV command output will
> tell you how long these errors have occurred.
> 
> 2. Run zpool scrub and zpool clear to attempt to clear the errors.
> 
> 3. If the errors below don''t clear, then manually remove the
corrupted
> files below, if possible, and restore from backup. Depending on what
> fmdump says, you might check your backups for corruption.
> 
> 4. Run zpool scrub and zpool clear again as needed.
> 
> 5. Consider replacing this configuration with a redundant ZFS storage
> pool. We can provide the recommended syntax.
> 
> Let us know how this turns out.
> 
> Thanks,
> 
> Cindy
> 
> On 06/20/11 23:36, Todd Urie wrote:
>> I have a zpool that shows the following from a zpool status -v
<zpool name>
>> brsnnfs0104 [/var/spool/cron/scripts]# zpool status -v ABC0101
>>  pool:ABC0101
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>>        corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore
the
>>        entire pool from backup.
>>   see: http://www.sun.com/msg/ZFS-8000-8A
>> scrub: none requested
>> config:
>>        NAME                              STATE     READ WRITE CKSUM
>>        ABC0101                           ONLINE       0     0    10
>>          /dev/vx/dsk/ABC01dg/ABC0101_01  ONLINE       0     0     2
>>          /dev/vx/dsk/ABC01dg/ABC0101_02  ONLINE       0     0     8
>>          /dev/vx/dsk/ABC01dg/ABC0101_03  ONLINE       0     0    10
>> errors: Permanent errors have been detected in the following files:
>>       
/clients/ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/717b52282ea059452621587173561360
>>       
/clients/ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/6e6a9f37c4d13fdb3dcb8649272a2a49
>>       
/clients/ABC0101/rep/d0/prod1/reports/ReutersCMOLoad/ReutersCMOLoad.ABCntss001.20110620.141330.26496.ROLLBACK_FOR_UPDATE_COUPONS.html
>>       
/clients/ABC0101/rep/local/bfm/web/htdocs/tmp/G2_0.related_detail_loader.1308593666.54643.n5cpoli3355.data
>>       
/clients/ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/F_OLPO82_A.gp.ABCIM_GA.nlaf.xml.gz
>>       
/clients/ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNVLXCIAFI.gp.ABCIM_GA.nlaf.xml.gz
>>       
/clients/ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNIVLEXCIA.gp.BARCRATING_ABC.nlaf.xml.gz
>> I think that a scrub at least has the possibility to clear this up.  A
quick search suggests that others have had some good experience with using scrub
in similar circumstances.  I was wondering if anyone could share some of their
experiences, good and bad, so that I can assess the risk and probability of
success with this approach.  Also, any other ideas would certainly be
appreciated.
>> -----RTU
>>
------------------------------------------------------------------------
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Jun 2011 - Zpool with data errors

[zfs-discuss] Zpool with data errors

[zfs-discuss] Zpool with data errors

[zfs-discuss] Zpool with data errors

[zfs-discuss] Zpool with data errors

[zfs-discuss] Zpool with data errors

[zfs-discuss] Zpool with data errors

[zfs-discuss] Zpool with data errors

[zfs-discuss] Zpool with data errors

[zfs-discuss] Zpool with data errors