Hi, I recently received reports of 2 users who experienced corrupted raid-z pools with ZFS-FUSE and I''m having trouble reproducing the problem or even figuring out what the cause is. One of the users experienced corruption only by rebooting the system:> # zpool status > pool: media > state: FAULTED > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > media UNAVAIL 0 0 0 insufficient replicas > raidz1 UNAVAIL 0 0 0 corrupted data > sda ONLINE 0 0 0 > sdb ONLINE 0 0 0 > sdc ONLINE 0 0 0 > sdd ONLINE 0 0 0First I thought it was a problem of the device names being renamed (caused by different order of disk detection on boot), but I believe in this case ZFS would report the drive as UNAVAIL. Anyway, exporting and re-importing didn''t work:> # zpool import > pool: media > id: 18446744072804078091 > state: FAULTED > action: The pool cannot be imported due to damaged devices or data. > config: > media UNAVAIL insufficient replicas > raidz1 UNAVAIL corrupted data > sda ONLINE > sdb ONLINE > sdc ONLINE > sdd ONLINEAnother user experienced a similar problem but in a different circumstance: He had a raid-z pool with 2 drives and while the system was idle he removed one of the drives. zfs-fuse doesn''t notice the drive is removed until it tries to read or write to the device, so "zpool status" showed the drive was still online. Anyway, after a slightly confusing sequence of events (replugging the drive, zfs-fuse crashing(?!), and some other weirdness), the end result was the same: > pool: pool > state: UNAVAIL > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pool UNAVAIL 0 0 0 insufficient replicas > raidz1 UNAVAIL 0 0 0 corrupted data > sdc2 ONLINE 0 0 0 > sdd2 ONLINE 0 0 0 I tried to reproduce this but I can''t. When I remove a USB drive from a Raid-Z pool, zfs-fuse correctly shows READ/WRITE failures. I also tried killing zfs-fuse, changing the order of the drives and then starting zfs-fuse, but after exporting and importing it never corrupted the pool (although it found checksum errors on the drive that was unplugged, of course). Something that might be useful knowing: zfs-fuse uses the block devices as if it were a normal file and it calls fsync() on the file descriptor when necessary (like in vdev_file.c), but this only guarantees that the kernel buffers are flushed, it doesn''t actually send the flush command to the disk (unfortunately there''s no DKIOCFLUSHWRITECACHE ioctl equivalent in Linux). Anyway, the possibility that this is the problem seems very remote to me (and it wouldn''t explain the second case). Do you have any idea what the problem could be or how can I determine the cause? I''m stuck at this point, and the first user seems to have lost 280 GB of data (he didn''t have a backup).. Regards, Ricardo Correia
Ricardo Correia wrote:>> # zpool status >> pool: media >> state: FAULTED >> scrub: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> media UNAVAIL 0 0 0 insufficient replicas >> raidz1 UNAVAIL 0 0 0 corrupted data >> sda ONLINE 0 0 0 >> sdb ONLINE 0 0 0 >> sdc ONLINE 0 0 0 >> sdd ONLINE 0 0 0 >Another weird behaviour, after rebooting:> # zpool import -f media > cannot import ''media'': one or more devices is currently unavailable > # zpool status > no pools available > # cfdisk /dev/sda > Disk Drive: /dev/sda > Size: 500107862016 bytes, 500.1 GB > Pri/Log Free Space 500105.25 > Same for sdb/c/dAfter another reboot (and this is really strange):> # zpool import > pool: media > id: 18446744072804078091 > state: FAULTED > action: The pool cannot be imported due to damaged devices or data. > config: > media UNAVAIL insufficient replicas > raidz1 UNAVAIL corrupted data > sda ONLINE > sdb ONLINE > sdc ONLINE > sdd ONLINE > # zpool import media > cannot import ''media'': pool may be in use from other system > # zpool import -f media > cannot import ''media'': one or more devices is currently unavailable> When I do less -f /dev/sda, I see "raidz", "/dev/sdb" etc, after lots of glibberish, and I can successfully do grep someknownfilename -a|less, so it seems that things were indeed written, and to this disk.Any ideas?
On Wed, Jun 20, 2007 at 09:43:01PM +0000, Ricardo Correia wrote:> Something that might be useful knowing: zfs-fuse uses the block devices as if it were a normal file and it calls fsync() on the file descriptor when necessary (like in > vdev_file.c), but this only guarantees that the kernel buffers are flushed, it doesn''t actually send the flush command to the disk (unfortunately there''s no > DKIOCFLUSHWRITECACHE ioctl equivalent in Linux). Anyway, the possibility that this is the problem seems very remote to me (and it wouldn''t explain the second case).Can''t help with other problems, but let me clarify this one. Flushing disk''s write cache is only important in a event of power failure. Disks like to reorder requests and cache them, so you may end up in situation where a pointer in uber block was updated, but new block wasn''t yet written to the disk. But until you have power, the lack of DKIOCFLUSHWRITECACHE shouldn''t cause any problems. PS. In FreeBSD we use BIO_FLUSH for flushing write cache, which I implemented as a part of different project (gjournal). -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20070621/c4b93774/attachment.bin>