thr3ads.net - zfs discuss - [zfs-discuss] Oddly-persistent file error on ZFS root pool [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Lou Picciano

2012-Jan-28 17:52 UTC

head link

[zfs-discuss] Oddly-persistent file error on ZFS root pool

Hello ZFS wizards,

Have an odd ZFS problem I''d like to run by you -

Root pool on this machine is a ''simple'' mirror - just two
disks. # zpool status

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 3
mirror-0 ONLINE 0 0 6
c2t0d0s0 ONLINE 0 0 6
c2t1d0s0 ONLINE 0 0 6

errors: Permanent errors have been detected in the following files:

rpool/ROOT/openindiana-userland-154 at
zfs-auto-snap_monthly-2011-11-22-09h19:/etc/svc/repository-boot-tmpEdaGba

... or similar; CKSUM counts have varied, but were always in that 1x - 2x ,
''symmetrical'' pattern.

After working through the problems above, scrubbing and zfs destroying the
snapshot with ''permanent errors'', the CKSUMS clear up, but
vestiges of the file remain as hex addresses:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0
c2t1d0s0 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

<0x18e73>:<0x78007>

I have no evidence that ZFS is itself the direct culprit here; it may just be on
the receiving end of one of the couple of problems we''ve recently
worked through on this machine:
1. a defective CPU, managed by the fault manager, but without a fully-configured
crashdump (now rectified), then
2. the SandyBridge ''interrupt storm'' problem, which we seem to
have now worked around.

The storage pools are scrubbed pretty regularly, and we generally have no cksum
errors at all. At one point, vmstat reported 7+ _million+ interrupt faults over
5 seconds! I''ve attempted to clear stats on the pool as well
(didn''t expect this to work, but worth a try, right?)

Important to note that Memtest+ had been run, last time for ~14 hrs, with no
error reported.

Don''t think the storage controller is the culprit, either, as _all_
drives are controlled by the P67A - and no other problems seen. And no errors
reported via smartctl.

Would welcome input from two perspectives:

1) Before I rebuild the pool/reinstall/whatever, is anyone here interested in
any diagnostic output which might still be available? Is any of this useful as a
bug report?
2) Then, would love to hear ideas on a solution.

Proposed solutions include:
1) creating new BE based on snap of root pool:
- Snapshot root pool
- (zfs send to datapool for safekeeping)
- Split rpool
- zpool create newpool (on Drive ''B'')
- beadm -p create newpool NEWboot (being sure to use slice 0 of Drive
''B'')

2) Simply deleting _all_ snapshots on the rpool.

3) complete re-install

Tks for feedback. Lou Picciano
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120128/b050b104/attachment.html>

Bayard G. Bell

2012-Jan-29 20:22 UTC

head link

[zfs-discuss] [zfs] Oddly-persistent file error on ZFS root pool

Lou,

Tried to answer this when you asked on IRC. Try a zpool clear and scrub
again to see if the errors persist.

Cheers,
Bayard

On Sat, 2012-01-28 at 17:52 +0000, Lou Picciano wrote:> 
> 
> 
> Hello ZFS wizards, 
> 
> Have an odd ZFS problem I''d like to run by you - 
> 
> Root pool on this machine is a ''simple'' mirror - just two
disks. # zpool status
> 
> NAME STATE READ WRITE CKSUM 
> rpool ONLINE 0 0 3 
> mirror-0 ONLINE 0 0 6 
> c2t0d0s0 ONLINE 0 0 6 
> c2t1d0s0 ONLINE 0 0 6 
> 
> errors: Permanent errors have been detected in the following files: 
> 
> rpool/ROOT/openindiana-userland-154 at
zfs-auto-snap_monthly-2011-11-22-09h19:/etc/svc/repository-boot-tmpEdaGba
> 
> ... or similar; CKSUM counts have varied, but were always in that 1x - 2x ,
''symmetrical'' pattern.
> 
> After working through the problems above, scrubbing and zfs destroying the
snapshot with ''permanent errors'', the CKSUMS clear up, but
vestiges of the file remain as hex addresses:
> 
> NAME STATE READ WRITE CKSUM 
> rpool ONLINE 0 0 0 
> mirror-0 ONLINE 0 0 0 
> c2t0d0s0 ONLINE 0 0 0 
> c2t1d0s0 ONLINE 0 0 0 
> 
> errors: Permanent errors have been detected in the following files: 
> 
> <0x18e73>:<0x78007> 
> 
> I have no evidence that ZFS is itself the direct culprit here; it may just
be on the receiving end of one of the couple of problems we''ve recently
worked through on this machine:
> 1. a defective CPU, managed by the fault manager, but without a
fully-configured crashdump (now rectified), then
> 2. the SandyBridge ''interrupt storm'' problem, which we
seem to have now worked around.
> 
> The storage pools are scrubbed pretty regularly, and we generally have no
cksum errors at all. At one point, vmstat reported 7+ _million+ interrupt faults
over 5 seconds! I''ve attempted to clear stats on the pool as well
(didn''t expect this to work, but worth a try, right?)
> 
> Important to note that Memtest+ had been run, last time for ~14 hrs, with
no error reported.
> 
> Don''t think the storage controller is the culprit, either, as
_all_ drives are controlled by the P67A - and no other problems seen. And no
errors reported via smartctl.
> 
> Would welcome input from two perspectives: 
> 
> 1) Before I rebuild the pool/reinstall/whatever, is anyone here interested
in any diagnostic output which might still be available? Is any of this useful
as a bug report?
> 2) Then, would love to hear ideas on a solution. 
> 
> Proposed solutions include: 
> 1) creating new BE based on snap of root pool: 
> - Snapshot root pool 
> - (zfs send to datapool for safekeeping) 
> - Split rpool 
> - zpool create newpool (on Drive ''B'') 
> - beadm -p create newpool NEWboot (being sure to use slice 0 of Drive
''B'')
> 
> 2) Simply deleting _all_ snapshots on the rpool. 
> 
> 3) complete re-install 
> 
> Tks for feedback. Lou Picciano 
> 
> 
> 
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed:
https://www.listbox.com/member/archive/rss/182191/22062040-29ecd758
> Modify Your Subscription:
https://www.listbox.com/member/?member_id=22062040&id_secret=22062040-1799b5be
> Powered by Listbox: http://www.listbox.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120129/06246119/attachment.bin>

Lou Picciano

2012-Jan-30 01:50 UTC

head link

[zfs-discuss] [zfs] Oddly-persistent file error on ZFS root pool

Bayard, 

Indeed, you did answer it - and thanks for getting back to me - your suggestion
was spot ON!

However, the simple zpool clear/scrub cycle wouldn''t work in our case -
at least initially. In fact, after multiple ''rinse/repeats'',
the offending file - or its hex representation - would reappear. In fact, the
CHSKUM errors would often mount... Logically, this seems to make some sense;
that zfs would attempt to reconstitute the damaged file with each scrub...(?)

In any case, after gathering the nerve to start deleting old snapshots -
including the one with the offending file - the clear/scrub process worked a
charm. Many thanks again!

Lou Picciano 

----- Original Message -----
From: "Bayard G. Bell" <buffer.g.overflow at gmail.com> 
To: zfs at lists.illumos.org 
Cc: zfs-discuss at opensolaris.org 
Sent: Sunday, January 29, 2012 3:22:39 PM 
Subject: Re: [zfs] Oddly-persistent file error on ZFS root pool 

Lou, 

Tried to answer this when you asked on IRC. Try a zpool clear and scrub 
again to see if the errors persist. 

Cheers, 
Bayard 

On Sat, 2012-01-28 at 17:52 +0000, Lou Picciano wrote: > 
> 
> 
> Hello ZFS wizards, 
> 
> Have an odd ZFS problem I''d like to run by you - 
> 
> Root pool on this machine is a ''simple'' mirror - just two
disks. # zpool status
> 
> NAME STATE READ WRITE CKSUM 
> rpool ONLINE 0 0 3 
> mirror-0 ONLINE 0 0 6 
> c2t0d0s0 ONLINE 0 0 6 
> c2t1d0s0 ONLINE 0 0 6 
> 
> errors: Permanent errors have been detected in the following files: 
> 
> rpool/ROOT/openindiana-userland-154 at
zfs-auto-snap_monthly-2011-11-22-09h19:/etc/svc/repository-boot-tmpEdaGba
> 
> ... or similar; CKSUM counts have varied, but were always in that 1x - 2x ,
''symmetrical'' pattern.
> 
> After working through the problems above, scrubbing and zfs destroying the
snapshot with ''permanent errors'', the CKSUMS clear up, but
vestiges of the file remain as hex addresses:
> 
> NAME STATE READ WRITE CKSUM 
> rpool ONLINE 0 0 0 
> mirror-0 ONLINE 0 0 0 
> c2t0d0s0 ONLINE 0 0 0 
> c2t1d0s0 ONLINE 0 0 0 
> 
> errors: Permanent errors have been detected in the following files: 
> 
> <0x18e73>:<0x78007> 
> 
> I have no evidence that ZFS is itself the direct culprit here; it may just
be on the receiving end of one of the couple of problems we''ve recently
worked through on this machine:
> 1. a defective CPU, managed by the fault manager, but without a
fully-configured crashdump (now rectified), then
> 2. the SandyBridge ''interrupt storm'' problem, which we
seem to have now worked around.
> 
> The storage pools are scrubbed pretty regularly, and we generally have no
cksum errors at all. At one point, vmstat reported 7+ _million+ interrupt faults
over 5 seconds! I''ve attempted to clear stats on the pool as well
(didn''t expect this to work, but worth a try, right?)
> 
> Important to note that Memtest+ had been run, last time for ~14 hrs, with
no error reported.
> 
> Don''t think the storage controller is the culprit, either, as
_all_ drives are controlled by the P67A - and no other problems seen. And no
errors reported via smartctl.
> 
> Would welcome input from two perspectives: 
> 
> 1) Before I rebuild the pool/reinstall/whatever, is anyone here interested
in any diagnostic output which might still be available? Is any of this useful
as a bug report?
> 2) Then, would love to hear ideas on a solution. 
> 
> Proposed solutions include: 
> 1) creating new BE based on snap of root pool: 
> - Snapshot root pool 
> - (zfs send to datapool for safekeeping) 
> - Split rpool 
> - zpool create newpool (on Drive ''B'') 
> - beadm -p create newpool NEWboot (being sure to use slice 0 of Drive
''B'')
> 
> 2) Simply deleting _all_ snapshots on the rpool. 
> 
> 3) complete re-install 
> 
> Tks for feedback. Lou Picciano 
> 
> 
> 
> ------------------------------------------- 
> illumos-zfs 
> Archives: https://www.listbox.com/member/archive/182191/=now 
> RSS Feed:
https://www.listbox.com/member/archive/rss/182191/22062040-29ecd758
> Modify Your Subscription: https://www.listbox.com/member/?& 
> Powered by Listbox: http://www.listbox.com 




------------------------------------------- 
illumos-zfs 
Archives: https://www.listbox.com/member/archive/182191/=now 
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22086598-09fa5b64 
Modify Your Subscription:
https://www.listbox.com/member/?member_id=22086598&id_secret=22086598-86c7d407
Powered by Listbox: http://www.listbox.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120130/1e22be7a/attachment.html>

Bayard G. Bell

2012-Jan-31 12:01 UTC

head link

[zfs-discuss] [zfs] Oddly-persistent file error on ZFS root pool

On Mon, 2012-01-30 at 01:50 +0000, Lou Picciano wrote:> Bayard, 
> 
> Indeed, you did answer it - and thanks for getting back to me - your
suggestion was spot ON!
> 
> However, the simple zpool clear/scrub cycle wouldn''t work in our
case - at least initially. In fact, after multiple
''rinse/repeats'', the offending file - or its hex
representation - would reappear. In fact, the CHSKUM errors would often mount...
Logically, this seems to make some sense; that zfs would attempt to reconstitute
the damaged file with each scrub...(?)
As the truth is somewhere in between, I''ll insert my comment
accordingly. You should only see the errors continue if there''s a
dataset with a reference to the version of the file that creates those
errors. I''ve seen this before: until all of the datasets are deleted,
the errors will continue to be diagnosed, sometimes presented without
databaset names, which might be considered a bug (it seems wrong that
you don''t get a dataset name for clones). You wouldn''t happen
to have
preserved output that could be used to determine if/where there''s a
bug?
> In any case, after gathering the nerve to start deleting old snapshots -
including the one with the offending file - the clear/scrub process worked a
charm. Many thanks again!
> 
> Lou Picciano 
> 
> ----- Original Message -----
> From: "Bayard G. Bell" <buffer.g.overflow at gmail.com> 
> To: zfs at lists.illumos.org 
> Cc: zfs-discuss at opensolaris.org 
> Sent: Sunday, January 29, 2012 3:22:39 PM 
> Subject: Re: [zfs] Oddly-persistent file error on ZFS root pool 
> 
> Lou, 
> 
> Tried to answer this when you asked on IRC. Try a zpool clear and scrub 
> again to see if the errors persist. 
> 
> Cheers, 
> Bayard-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120131/05ae7171/attachment.bin>

Lou Picciano

2012-Jan-31 14:17 UTC

head link

[zfs-discuss] [zfs] Oddly-persistent file error on ZFS root pool

Bayard, 
>> You wouldn''t happen to have preserved output that could be
used to determine if/where there''s a bug?
Whoops - in fact, this was the gist of my email to the lists, really; that
someone might be interested in diagnostic information. The short answer: After
not hearing a lot of interest in that - apart from your own response - and now
having deleted the datasets in question: I hope so! I certainly do have the last
crash dump - intact, as the machine hasn''t crashed since...(...sound of
knocking on wood) Pls tell me what debug info I can provide.

The fix itself turned out to be quite easy, as you''ve indicated. My
first concern was in being a good ''Listizen'' -
identifying/reporting on a bug, if one exists.

Thanks again for your help, 

Lou Picciano 

----- Original Message -----
From: "Bayard G. Bell" <buffer.g.overflow at gmail.com> 
To: zfs at lists.illumos.org 
Cc: zfs-discuss at opensolaris.org 
Sent: Tuesday, January 31, 2012 7:01:53 AM 
Subject: Re: [zfs] Oddly-persistent file error on ZFS root pool 

On Mon, 2012-01-30 at 01:50 +0000, Lou Picciano wrote: > Bayard, 
> 
> Indeed, you did answer it - and thanks for getting back to me - your
suggestion was spot ON!
> 
> However, the simple zpool clear/scrub cycle wouldn''t work in our
case - at least initially. In fact, after multiple
''rinse/repeats'', the offending file - or its hex
representation - would reappear. In fact, the CHSKUM errors would often mount...
Logically, this seems to make some sense; that zfs would attempt to reconstitute
the damaged file with each scrub...(?)
As the truth is somewhere in between, I''ll insert my comment 
accordingly. You should only see the errors continue if there''s a 
dataset with a reference to the version of the file that creates those 
errors. I''ve seen this before: until all of the datasets are deleted, 
the errors will continue to be diagnosed, sometimes presented without 
databaset names, which might be considered a bug (it seems wrong that 
you don''t get a dataset name for clones). You wouldn''t happen
to have
preserved output that could be used to determine if/where there''s a
bug?
> In any case, after gathering the nerve to start deleting old snapshots -
including the one with the offending file - the clear/scrub process worked a
charm. Many thanks again!
> 
> Lou Picciano 
> 
> ----- Original Message ----- 
> From: "Bayard G. Bell" <buffer.g.overflow at gmail.com> 
> To: zfs at lists.illumos.org 
> Cc: zfs-discuss at opensolaris.org 
> Sent: Sunday, January 29, 2012 3:22:39 PM 
> Subject: Re: [zfs] Oddly-persistent file error on ZFS root pool 
> 
> Lou, 
> 
> Tried to answer this when you asked on IRC. Try a zpool clear and scrub 
> again to see if the errors persist. 
> 
> Cheers, 
> Bayard 


------------------------------------------- 
illumos-zfs 
Archives: https://www.listbox.com/member/archive/182191/=now 
RSS Feed: https://www.listbox.com/member/archive/rss/182191/22086598-09fa5b64 
Modify Your Subscription:
https://www.listbox.com/member/?member_id=22086598&id_secret=22086598-86c7d407
Powered by Listbox: http://www.listbox.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120131/6791edee/attachment.html>

zfs discuss - Jan 2012 - Oddly-persistent file error on ZFS root pool

[zfs-discuss] Oddly-persistent file error on ZFS root pool

[zfs-discuss] [zfs] Oddly-persistent file error on ZFS root pool

[zfs-discuss] [zfs] Oddly-persistent file error on ZFS root pool

[zfs-discuss] [zfs] Oddly-persistent file error on ZFS root pool

[zfs-discuss] [zfs] Oddly-persistent file error on ZFS root pool