thr3ads.net - zfs code - [zfs-code] Corrupted pools [Jun 2007]

If this information is useful, please help other people find it:
Share via:

Ricardo Correia

2007-Jun-20 21:43 UTC

[zfs-code] Corrupted pools

Hi,

I recently received reports of 2 users who experienced corrupted raid-z 
pools with ZFS-FUSE and I''m having trouble reproducing the problem or 
even figuring out what the cause is.

One of the users experienced corruption only by rebooting the system:
> # zpool status
>  pool: media
> state: FAULTED
> scrub: none requested
> config:
>
>        NAME        STATE     READ WRITE CKSUM
>        media       UNAVAIL      0     0     0  insufficient replicas
>          raidz1    UNAVAIL      0     0     0  corrupted data
>            sda     ONLINE       0     0     0
>            sdb     ONLINE       0     0     0
>            sdc     ONLINE       0     0     0
>            sdd     ONLINE       0     0     0
First I thought it was a problem of the device names being renamed 
(caused by different order of disk detection on boot), but I believe in 
this case ZFS would report the drive as UNAVAIL.

Anyway, exporting and re-importing didn''t work:
> # zpool import
>  pool: media
>    id: 18446744072804078091
> state: FAULTED
> action: The pool cannot be imported due to damaged devices or data.
> config:
>        media       UNAVAIL   insufficient replicas
>          raidz1    UNAVAIL   corrupted data
>            sda     ONLINE
>            sdb     ONLINE
>            sdc     ONLINE
>            sdd     ONLINE 
Another user experienced a similar problem but in a different circumstance:

He had a raid-z pool with 2 drives and while the system was idle he 
removed one of the drives. zfs-fuse doesn''t notice the drive is removed
until it tries to read or write to the device, so "zpool status"
showed
the drive was still online. Anyway, after a slightly confusing sequence 
of events (replugging the drive, zfs-fuse crashing(?!), and some other 
weirdness), the end result was the same:

 > pool: pool
 > state: UNAVAIL
 > scrub: none requested
 > config:
 >
 >        NAME        STATE     READ WRITE CKSUM
 >        pool        UNAVAIL      0     0     0  insufficient replicas
 >          raidz1    UNAVAIL      0     0     0  corrupted data
 >            sdc2    ONLINE       0     0     0
 >            sdd2    ONLINE       0     0     0

I tried to reproduce this but I can''t. When I remove a USB drive from a
Raid-Z pool, zfs-fuse correctly shows READ/WRITE failures. I also tried 
killing zfs-fuse, changing the order of the drives and then starting 
zfs-fuse, but after exporting and importing it never corrupted the pool 
(although it found checksum errors on the drive that was unplugged, of 
course).

Something that might be useful knowing: zfs-fuse uses the block devices 
as if it were a normal file and it calls fsync() on the file descriptor 
when necessary (like in vdev_file.c), but this only guarantees that the 
kernel buffers are flushed, it doesn''t actually send the flush command 
to the disk (unfortunately there''s no DKIOCFLUSHWRITECACHE ioctl 
equivalent in Linux). Anyway, the possibility that this is the problem 
seems very remote to me (and it wouldn''t explain the second case).

Do you have any idea what the problem could be or how can I determine 
the cause? I''m stuck at this point, and the first user seems to have 
lost 280 GB of data (he didn''t have a backup)..

Regards,
Ricardo Correia

Ricardo Correia

2007-Jun-20 22:00 UTC

head link

[zfs-code] Corrupted pools

Ricardo Correia wrote:>> # zpool status
>>  pool: media
>> state: FAULTED
>> scrub: none requested
>> config:
>>
>>        NAME        STATE     READ WRITE CKSUM
>>        media       UNAVAIL      0     0     0  insufficient replicas
>>          raidz1    UNAVAIL      0     0     0  corrupted data
>>            sda     ONLINE       0     0     0
>>            sdb     ONLINE       0     0     0
>>            sdc     ONLINE       0     0     0
>>            sdd     ONLINE       0     0     0
> 
Another weird behaviour, after rebooting:
> # zpool import -f media
> cannot import ''media'': one or more devices is currently
unavailable
> # zpool status
> no pools available
> # cfdisk /dev/sda
>                              Disk Drive: /dev/sda
>                       Size: 500107862016 bytes, 500.1 GB
>                            Pri/Log   Free Space                     
500105.25
> Same for sdb/c/d
After another reboot (and this is really strange):
> # zpool import
>  pool: media
>    id: 18446744072804078091
> state: FAULTED
> action: The pool cannot be imported due to damaged devices or data.
> config:
>        media       UNAVAIL   insufficient replicas
>          raidz1    UNAVAIL   corrupted data
>            sda     ONLINE
>            sdb     ONLINE
>            sdc     ONLINE
>            sdd     ONLINE
> # zpool import media
> cannot import ''media'': pool may be in use from other
system
> # zpool import -f media
> cannot import ''media'': one or more devices is currently
unavailable
> When I do less -f /dev/sda, I see "raidz", "/dev/sdb"
etc, after lots of glibberish, and I can successfully do grep someknownfilename
-a|less, so it seems that things were indeed written, and to this disk.
Any ideas?

Pawel Jakub Dawidek

2007-Jun-21 07:51 UTC

head link

[zfs-code] Corrupted pools

On Wed, Jun 20, 2007 at 09:43:01PM +0000, Ricardo Correia
wrote:> Something that might be useful knowing: zfs-fuse uses the block devices as
if it were a normal file and it calls fsync() on the file descriptor when
necessary (like in
> vdev_file.c), but this only guarantees that the kernel buffers are flushed,
it doesn''t actually send the flush command to the disk (unfortunately
there''s no
> DKIOCFLUSHWRITECACHE ioctl equivalent in Linux). Anyway, the possibility
that this is the problem seems very remote to me (and it wouldn''t
explain the second case).
Can''t help with other problems, but let me clarify this one. Flushing
disk''s write cache is only important in a event of power failure. Disks
like to reorder requests and cache them, so you may end up in situation
where a pointer in uber block was updated, but new block wasn''t yet
written to the disk. But until you have power, the lack of
DKIOCFLUSHWRITECACHE shouldn''t cause any problems.

PS. In FreeBSD we use BIO_FLUSH for flushing write cache, which I
implemented as a part of different project (gjournal).

-- 
Pawel Jakub Dawidek                       wheel.pl
pjd at FreeBSD.org                           FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<mail.opensolaris.org/pipermail/zfs-code/attachments/20070621/c4b93774/attachment.bin>

zfs code - Jun 2007 - Corrupted pools

[zfs-code] Corrupted pools

[zfs-code] Corrupted pools

[zfs-code] Corrupted pools