I am new to ZFS. I searched around for this problem and did not find it. # cd /foo # mkfile 64m 1 # mkfile 64m 2 # mkfile 64m 3 # mkfile 64m 4 # mkfile 64m 5 # dd if=/dev/urandom of=afile bs=1024 count=102400 # zpool create tank2 raidz /foo/1 /foo/2 /foo/3 /foo/4 /foo/5 # cp afile /tank2 # zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT tank 22.5G 3.58M 22.5G 0% ONLINE - tank2 296M 125M 171M 42% ONLINE - # echo junk1 > 1 # echo junk2 > 2 # umount /tank2 # zfs mount /tank2 # cat /tank2/afile > /dev/null causes system restart. Is this the intended result? The result varies somewhat, but after putting junk in those files (which zfs is treating as a device), it will do a sudden OS reset at some point. I think it may be because the files are all of a sudden a different size than zfs is expecting. This message posted from opensolaris.org
You are corrupting two copies in a RAID-Z group, which can only has single fault tolerance. While we can survive uncorrectable read errors, write errors will result in a panic. By cat''ing that file, you are going to push the atime for the file, which will blow up. - Eric On Wed, Apr 19, 2006 at 04:55:47PM -0700, Jeff Davis wrote:> I am new to ZFS. I searched around for this problem and did not find it. > > # cd /foo > # mkfile 64m 1 > # mkfile 64m 2 > # mkfile 64m 3 > # mkfile 64m 4 > # mkfile 64m 5 > # dd if=/dev/urandom of=afile bs=1024 count=102400 > # zpool create tank2 raidz /foo/1 /foo/2 /foo/3 /foo/4 /foo/5 > # cp afile /tank2 > # zpool list > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > tank 22.5G 3.58M 22.5G 0% ONLINE - > tank2 296M 125M 171M 42% ONLINE - > # echo junk1 > 1 > # echo junk2 > 2 > # umount /tank2 > # zfs mount /tank2 > # cat /tank2/afile > /dev/null > > causes system restart. Is this the intended result? > The result varies somewhat, but after putting junk in those files (which zfs is treating as a device), it will do a sudden OS reset at some point. > > I think it may be because the files are all of a sudden a different size than zfs is expecting. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
The system is probably panicing because the ZFS checksums of the data read back by the cat /tank2/afile > /dev/null are incorrect as the pool was open at the time you corrupted the vdevs does your machine write a panic string out to /var/adm/messages at all? If you export the pool things behave much more gracefully..... bash-3.00# cd /mnt/ bash-3.00# mkfile 64m 1 bash-3.00# mkfile 64m 2 bash-3.00# mkfile 64m 3 bash-3.00# mkfile 64m 4 bash-3.00# mkfile 64m 5 bash-3.00# dd if=/dev/urandom of=afile bs=1024 count=1024 bash-3.00# zpool create tank2 raidz /mnt/1 /mnt/2 /mnt/3 /mnt/4 /mnt/5 bash-3.00# cp afile /tank2/ bash-3.00# zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT tank2 296M 1.31M 295M 0% ONLINE - bash-3.00# zpool export tank2 bash-3.00# echo blahblahblah > 1 bash-3.00# zpool import -d /mnt tank2 bash-3.00# zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT tank2 296M 1.36M 295M 0% DEGRADED - bash-3.00# zpool status pool: tank2 state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed with 0 errors on Thu Apr 20 09:50:59 2006 config: NAME STATE READ WRITE CKSUM tank2 DEGRADED 0 0 0 raidz DEGRADED 0 0 0 15027832298606416063 UNAVAIL 0 0 0 was /mnt/1 /mnt/2 ONLINE 0 0 0 /mnt/3 ONLINE 0 0 0 /mnt/4 ONLINE 0 0 0 /mnt/5 ONLINE 0 0 0 errors: No known data errors bash-3.00# dd if=/tank2/afile bs=1024k >/dev/null 1+0 records in 1+0 records out bash-3.00# cksum /tank2/afile 2746568689 1048576 /tank2/afile bash-3.00# mkfile 64m 6 bash-3.00# zpool replace tank2 /mnt/1 /mnt/6 bash-3.00# zpool status pool: tank2 state: ONLINE scrub: resilver completed with 0 errors on Thu Apr 20 09:52:21 2006 config: NAME STATE READ WRITE CKSUM tank2 ONLINE 0 0 0 raidz ONLINE 0 0 0 /mnt/6 ONLINE 0 0 0 267K resilvered /mnt/2 ONLINE 0 0 0 /mnt/3 ONLINE 0 0 0 /mnt/4 ONLINE 0 0 0 /mnt/5 ONLINE 0 0 0 errors: No known data errors bash-3.00# dd if=/tank2/afile bs=1024k >/dev/null 1+0 records in 1+0 records out bash-3.00# cksum /tank2/afile 2746568689 1048576 /tank2/afile That''s the replacement of one corrupt vdev. For 2 or more... bash-3.00# zpool export tank2 bash-3.00# echo blahblahblah > 2 bash-3.00# echo blahblahblah > 3 bash-3.00# zpool import -d /mnt tank2 cannot import ''tank2'': one or more devices is unavailable bash-3.00# zpool import -d /mnt pool: tank2 id: 11628506508750100983 state: FAULTED status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-5E config: tank2 UNAVAIL insufficient replicas raidz UNAVAIL insufficient replicas /mnt/6 ONLINE /mnt/2 UNAVAIL corrupted data /mnt/3 UNAVAIL corrupted data /mnt/4 ONLINE /mnt/5 ONLINE As expected, okay the data has gone because we have lost too many vdevs but there''s no panic :) Cheers, Alan This message posted from opensolaris.org
Hello Eric, Thursday, April 20, 2006, 2:34:49 AM, you wrote: ES> You are corrupting two copies in a RAID-Z group, which can only has ES> single fault tolerance. While we can survive uncorrectable read errors, ES> write errors will result in a panic. By cat''ing that file, you are ES> going to push the atime for the file, which will blow up. IMHO there should be a property which controls what to do in such situations - similar to UFS onerr=panic|lock. If I have few hundred TBs in different pools I do not really want to panic whole system just because one of the many pools is inconsistent. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Wed, Apr 19, 2006 at 08:27:00PM -0700, jeffrey davis wrote:> > Is it possible to cause an I/O error instead of resetting the system?Theoretically, yes. But it''s very difficult. We would need the ability to abort an entire transaction group mid-stride, as well as the ability to propagate those errors up the stack in a meaningful way. Something we''d like to do, but decidedly non-trivial.> My computer actually reboots. Is there a way after the reboot to tell > what happened?Yes, what is actually happening is your machine is panicking. Most likely you are running on a desktop, which means the console message indicating the panic isn''t visible since X has control of the screen. You can find out what happened after a reboot by doing: # cd /var/crash/<hostname> # mdb *.0 (or whatever the highest number is) Then "::status" "$C" and "::msgbuf" are useful commands.> What happens when Solaris encounters an unrecoverable error on a > non-ZFS drive?It depends on the subsystem that''s using it. The driver will return EIO, but whatever happens after that is up to the consumer. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Thu, Apr 20, 2006 at 11:21:34AM +0200, Robert Milkowski wrote:> > IMHO there should be a property which controls what to do in such > situations - similar to UFS onerr=panic|lock. > > If I have few hundred TBs in different pools I do not really want to > panic whole system just because one of the many pools is inconsistent.This is not a case of "if (flag) do something else". Recovering from a write error requires some fundamental changes to the architecture at multiple levels, as we are in the middle of syncing a transaction group and have long since lost any correlation with a filesystem-level request by the time the failure occurs. And due to the tree-like nature of ZFS, aborting an entire transaction group could cause a large number of I/O failures from a single block failure. Once we figure out how to handle this, then certainly a pool-wide property would seem reasonable. However, there is a ton of work that needs to be done before this. Some first steps include reallocating writes, which will let us retry writes on other drives/locations mid-sync. It''s important to note, however, that as of build 36, we will survive any read errors, and provide you a complete list (via ''zpool status -v'') of any such errors encountered. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock