thr3ads.net - zfs discuss - [zfs-discuss] Cannot delete errored file [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Ben Middleton

2008-Jun-03 11:27 UTC

[zfs-discuss] Cannot delete errored file

Hi,

I can''t seem to delete a file in my zpool that has permanent errors:

zpool status -vx
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 2h10m with 1 errors on Tue Jun  3 11:36:49 2008
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /export/duke/test/Acoustic/3466/88832/09 - Check.mp3


rm "/export/duke/test/Acoustic/3466/88832/09 - Check.mp3"

rm: cannot remove `/export/duke/test/Acoustic/3466/88832/09 -
Check.mp3'': I/O error

Each time I try to do anything to the file, the checksum error count goes up on
the pool.

I also tried a mv and a cp over the top - but same I/O error.

I performed a "zpool scrub rpool" followed by a "zpool clear
rpool" - but still get the same error. Any ideas?

PS - I''m running snv_86, and use the sata driver on an intel x86
architecture.

B
 
 
This message posted from opensolaris.org

Ben Middleton

2008-Jun-05 09:13 UTC

head link

[zfs-discuss] Cannot delete errored file

Hello again,

I''m not making progress on this.

Every time I run a zpool scrub rpool I see:

$ zpool status -vx
  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub in progress for 0h0m, 0.01% done, 177h43m to go
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     8
          raidz1    DEGRADED     0     0     8
            c0t0d0  DEGRADED     0     0     0  too many errors
            c0t1d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

/export/duke/test/Acoustic/3466/88832/09 - Check.mp3


I popped in a brand new disk of the same size, and did a zpool replace on the
persistently degraded drive and the new drive. i.e.:

$ zpool replace rpool c0t0d0 c0t7d0

But that simply had the effect of transferring the issue to the new drive:

$ zpool status -xv rpool
  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 2h41m with 1 errors on Wed Jun  4 20:22:27 2008
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         DEGRADED     0     0     8
          raidz1      DEGRADED     0     0     8
            spare     DEGRADED     0     0     0
              c0t0d0  DEGRADED     0     0     0  too many errors
              c0t7d0  ONLINE       0     0     0
            c0t1d0    ONLINE       0     0     0
            c0t2d0    ONLINE       0     0     0
        spares
          c0t7d0      INUSE     currently in use


$ zpool detach rpool c0t0d0

$ zpool status -vx rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 2h41m with 1 errors on Wed Jun  4 20:22:27 2008
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     8
          raidz1    ONLINE       0     0     8
            c0t7d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xc3>:<0x1c0>

$ zpool scrub rpool

...

$ zpool status -vx rpool

  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub in progress for 0h0m, 0.00% done, 0h0m to go
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     4
          raidz1    DEGRADED     0     0     4
            c0t7d0  DEGRADED     0     0     0  too many errors
            c0t1d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

/export/duke/test/Acoustic/3466/88832/09 - Check.mp3

$ rm -f "/export/duke/test/Acoustic/3466/88832/09 - Check.mp3"

rm: cannot remove `/export/duke/test/Acoustic/3466/88832/09 -
Check.mp3'': I/O error


I''m guessing this isn''t a hardware fault, but a glitch in ZFS
- but am hoping to be proved wrong.

Any ideas before I rebuild the pool from scratch? And if I do, is there anything
I can do to prevent this problem in the future?

B
 
 
This message posted from opensolaris.org

Marc Bevand

2008-Jun-05 10:05 UTC

head link

[zfs-discuss] Cannot delete errored file

Ben Middleton <ben <at> drn.org> writes:> 
> [...]
> But that simply had the effect of transferring the issue to the new drive:
When you see this behavior, it most likely means it''s not your drive
which is failing, but instead it indicates a bad SATA/SAS cable, or
port on the disk controller.

PS: have you tried ": >xxx.mp3" to truncate your corrupted file ?
(colon in a shell builtin that does nothing). If I were you I
would also try removing the directory containing the corrupted file.

-marc

Ben Middleton

2008-Jun-05 10:34 UTC

head link

[zfs-discuss] Cannot delete errored file

Hi Marc,

$ : > 09 - Check.mp3
bash:  09 - Check.mp3: I/O error

$ cd ..
$ rm -rf BAD
$ rm: cannot remove `BAD/09 - Check.mp3'': I/O error

I''ll try shuffling the cables - but as you see above it occasionally
reports on a different disk - so imagine the cables are OK. Also, the new disk I
added has a new cable too, and on a different SATA port - which is also showing
up as degraded.

Is there any lower level debugging that I can enable to try and work out what is
going on.

This machine has been running fine since last August.

I couldn''t see anything in builds later than snv 86 that might help -
but I could try upgrading to the latest?

B
 
 
This message posted from opensolaris.org

Marc Bevand

2008-Jun-07 09:17 UTC

head link

[zfs-discuss] Cannot delete errored file

Weird. I have no idea how you could remove that file (beside destroying the 
entire filesystem)...

One other thing I noticed:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     8
          raidz1    ONLINE       0     0     8
            c0t7d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0

When you see non-zero CKSUM error counters at the pool or raidz1/z2 vdev 
level, but no error on the devices like this, it means that ZFS
couldn''t
correct the corruption errors after multiple attempts of reconstructing the 
stripes, each time assuming a different device was corrupting data. IOW it 
means that 2+ (in a raidz1) or 3+ (in a raidz2) devices returned corrupted 
data in the same stripe. Since it is statistically improbable to have that 
many silent data corruption in the same stripe, most likely this condition 
indicates a hardware problem. I suggest running memtest to stress-test your 
cpu/mem/mobo.

-marc

Ben Middleton

2008-Jun-07 10:43 UTC

head link

[zfs-discuss] Cannot delete errored file

Thanks Marc - I''ll run memtest on Monday, and re-seat memory/cpu//cards
etc. If that fails, I''ll try moving the devices onto a different SATA
controller. Failing that I''ll rebuild from scratch. Failing that,
I''ll get a new motherboard!

Ben
 
 
This message posted from opensolaris.org

Ben Middleton

2008-Jun-09 15:16 UTC

head link

[zfs-discuss] Cannot delete errored file

Hi,

Today''s update:

- I ran a memtest a few times - no errors.
- I reseated, re-routed ad switched all connectors/cables
- I''m currently running a scrub, but it''s showing vast numbers
of cksum errors now across all devices:

$ zpool status -v
  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub in progress for 0h5m, 3.35% done, 2h26m to go
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0  211K
          raidz1    DEGRADED     0     0  211K
            c0t7d0  DEGRADED     0     0     0  too many errors
            c0t1d0  DEGRADED     0     0     0  too many errors
            c0t2d0  DEGRADED     0     0     0  too many errors

errors: Permanent errors have been detected in the following files:

/export/duke/test/Acoustic/3466/88832/09 - Check.mp3

I''ll start moving each device over to a different controller to see if
that helps once the scrub completes. Still getting I/O errors trying to delete
that file.

Ben
 
 
This message posted from opensolaris.org

Marc Bevand

2008-Jun-10 06:10 UTC

head link

[zfs-discuss] Cannot delete errored file

Ben Middleton <ben <at> drn.org> writes:> 
> Today''s update:
> - I ran a memtest a few times - no errors.
Just making sure you know about it: memtest should run for a _least_ a couple 
hours, and should complete at least 1 pass.

Also, after the scrub completes, any permanent errors you see (so far you only 
seem to have 1 such error in this mp3 file) are by definition uncorrectable. 
It means the on-disk data was permanently corrupted while it was in memory/in 
transit before being written to the drives. At this point your only option is 
to rewrite good, uncorrupted data (ie. destroy & recreate the pool since you
can''t remove the file...).

-marc

Ben Middleton

2008-Jun-10 08:09 UTC

head link

[zfs-discuss] Cannot delete errored file

Hi Marc,

Thanks for all of your suggestions.

I''ll restart memtest when I''m next in the office and leave it
running overnight.

I can recreate the pool - but I guess the question is am I safe to do this on
the existing setup, or am I going to hit the same issue again sometime? Assuming
I don''t find any obvious hardware issues - wouldn''t this be a
regarded as flaw in ZFS (i.e. no way of clearing such an error without a
rebuild)?

Would I be safer rebuilding to a pair of mirrors rather than a 3 disk zraid +
hotspare?

Ben
 
 
This message posted from opensolaris.org

Jeff Bonwick

2008-Jun-10 08:35 UTC

head link

[zfs-discuss] Cannot delete errored file

That''s odd -- the only way the ''rm'' should fail is if
it can''t
read the znode for that file.  The znode is metadata, and is
therefore stored in two distinct places using ditto blocks.
So even if you had one unlucky copy that was damaged on two
of your disks, you should still have another copy elsewhere.

Assuming you weren''t so shockingly unlucky, the only way to
get a corrupted znode that I know of is flaky memory, such that
the znode is checksummed, then the DRAM flips a bit, then you
write the znode to disk.  The fact that you''ve seen so many
checksum errors makes me suspect hardware all the more.

Can you send me the output of fmdump -ev and fmdump -eV ?
There should be some useful crumbs in there...

Jeff

On Tue, Jun 03, 2008 at 04:27:21AM -0700, Ben Middleton
wrote:> Hi,
> 
> I can''t seem to delete a file in my zpool that has permanent
errors:
> 
> zpool status -vx
>   pool: rpool
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: scrub completed after 2h10m with 1 errors on Tue Jun  3 11:36:49
2008
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         rpool       ONLINE       0     0     0
>           raidz1    ONLINE       0     0     0
>             c0t0d0  ONLINE       0     0     0
>             c0t1d0  ONLINE       0     0     0
>             c0t2d0  ONLINE       0     0     0
> 
> errors: Permanent errors have been detected in the following files:
> 
>         /export/duke/test/Acoustic/3466/88832/09 - Check.mp3
> 
> 
> rm "/export/duke/test/Acoustic/3466/88832/09 - Check.mp3"
> 
> rm: cannot remove `/export/duke/test/Acoustic/3466/88832/09 -
Check.mp3'': I/O error
> 
> Each time I try to do anything to the file, the checksum error count goes
up on the pool.
> 
> I also tried a mv and a cp over the top - but same I/O error.
> 
> I performed a "zpool scrub rpool" followed by a "zpool clear
rpool" - but still get the same error. Any ideas?
> 
> PS - I''m running snv_86, and use the sata driver on an intel x86
architecture.
> 
> B
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ben Middleton

2008-Jun-10 15:01 UTC

head link

[zfs-discuss] Cannot delete errored file

Sent response by private message.

Today''s findings are that the cksum errors appear on the new disk on
the other controller too - so I''ve ruled out controllers & cables.
It''s probably as Jeff says - just got to figure out now how to prove
the memory is duff.

Ben
 
 
This message posted from opensolaris.org

Brandon High

2008-Jun-10 15:57 UTC

head link

[zfs-discuss] Cannot delete errored file

On Tue, Jun 10, 2008 at 8:01 AM, Ben Middleton <ben at drn.org>
wrote:> Today''s findings are that the cksum errors appear on the new disk
on the other controller too - so I''ve ruled out controllers &
cables. It''s probably as Jeff says - just got to figure out now how to
prove the memory is duff.
How much memory do you have and what chipset / controller are you using?

There are some controllers that claim to do DMA to 64-bit addresses
but don''t actually support it, which causes errors on machines with
>
4gb of memory. The SB600 is one example.

If it is bad memory that has somehow passed memtest, swapping the
memory for known good (preferably ECC) memory is one option to
diagnose it.

-B

-- 
Brandon High bhigh at freaks.com
"The good is the enemy of the best." - Nietzsche

Ben Middleton

2008-Jun-10 16:12 UTC

head link

[zfs-discuss] Cannot delete errored file

Hi,

It''s an ASUS P5K-WS board with 2Gb of Corsair TwinX DDR2 8500 1066MHz
Non-ECC memory. The board uses the Intel P35 chipset - it also will not support
ECC RAM. TBH, this is probably the last time I''ll get an ASUS, as this
is the second board I''ve got through - the first one died for no
particular reason. I''ve been recommended a Supermicro C2SBX Mobo - this
will take my existing Supermicro PCI-X card as well as my current Intel CPU. The
only problem is that it''s DDR3 - and ECC DDR3 ram is pretty hard to
come by right now.

I''ll still try a long memtest run, followed by a rebuild of the errored
pool. I''ll have a read around to see if there''s anyway of
making the memory more stable on this mobo.

Ben
 
 
This message posted from opensolaris.org

Brandon High

2008-Jun-10 17:20 UTC

head link

[zfs-discuss] Cannot delete errored file

On Tue, Jun 10, 2008 at 9:12 AM, Ben Middleton <ben at drn.org>
wrote:> I''ll still try a long memtest run, followed by a rebuild of the
errored pool. I''ll have a read around to see if there''s anyway
of making the memory more stable on this mobo.
Run it at 800MHz. I have a MSI P35 Platinum for my Windows gaming
system and after trying to get my 1066 memory to run stably at speed,
I gave up and run it at 800. You should try reducing the memory speed
and relaxing the timing to 5-5-5-15 to see if it helps.

-B

-- 
Brandon High bhigh at freaks.com
"The good is the enemy of the best." - Nietzsche

Ben Middleton

2008-Jun-13 09:16 UTC

head link

[zfs-discuss] Cannot delete errored file

Hi,

Quick update:

I left memtest running over night - 39 passes, no errors.

I also attempted to force the BIOS to run the memory at 800MHz & 5-5-5-15 as
suggested - but the machine became very unstable - long boot times; PCI-Express
failure of Yukon network card on booting etc. I''ve switched it back to
Auto speed&timing for now. I''ll just hope that it was a one-off
glitch that corrupted the pool.

I''m going to rebuild the pool this weekend.

Thanks for all the suggestions.

Ben
 
 
This message posted from opensolaris.org

Jonathan Loran

2008-Jun-13 17:00 UTC

head link

[zfs-discuss] Cannot delete errored file

Ben Middleton wrote:> Hi,
>
> Quick update:
>
> I left memtest running over night - 39 passes, no errors.
>
> I also attempted to force the BIOS to run the memory at 800MHz &
5-5-5-15 as suggested - but the machine became very unstable - long boot times;
PCI-Express failure of Yukon network card on booting etc. I''ve switched
it back to Auto speed&timing for now. I''ll just hope that it was a
one-off glitch that corrupted the pool.
>
> I''m going to rebuild the pool this weekend.
>
> Thanks for all the suggestions.
>
>   Ben,

Haven''t read this whole thread, and this has been brought up before,
but
make sure you power supply is running clean.  I can''t tell you how many
times I''ve seen very strange and intermittent system errors occur from
a
flaky power supply.

Jon

-- 


-     _____/     _____/      /           - Jonathan Loran -           -
-    /          /           /                IT Manager               -
-  _____  /   _____  /     /     Space Sciences Laboratory, UC Berkeley
-        /          /     /      (510) 643-5146 jloran at ssl.berkeley.edu
- ______/    ______/    ______/           AST:7731^29u18e3

zfs discuss - Jun 2008 - Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file

[zfs-discuss] Cannot delete errored file