I am new to ZFS. I searched around for this problem and did not find it. # cd /foo # mkfile 64m 1 # mkfile 64m 2 # mkfile 64m 3 # mkfile 64m 4 # mkfile 64m 5 # dd if=/dev/urandom of=afile bs=1024 count=102400 # zpool create tank2 raidz /foo/1 /foo/2 /foo/3 /foo/4 /foo/5 # cp afile /tank2 # zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT tank 22.5G 3.58M 22.5G 0% ONLINE - tank2 296M 125M 171M 42% ONLINE - # echo junk1 > 1 # echo junk2 > 2 # umount /tank2 # zfs mount /tank2 # cat /tank2/afile > /dev/null causes system restart. Is this the intended result? The result varies somewhat, but after putting junk in those files (which zfs is treating as a device), it will do a sudden OS reset at some point. I think it may be because the files are all of a sudden a different size than zfs is expecting. This message posted from opensolaris.org
You are corrupting two copies in a RAID-Z group, which can only has single fault tolerance. While we can survive uncorrectable read errors, write errors will result in a panic. By cat''ing that file, you are going to push the atime for the file, which will blow up. - Eric On Wed, Apr 19, 2006 at 04:55:47PM -0700, Jeff Davis wrote:> I am new to ZFS. I searched around for this problem and did not find it. > > # cd /foo > # mkfile 64m 1 > # mkfile 64m 2 > # mkfile 64m 3 > # mkfile 64m 4 > # mkfile 64m 5 > # dd if=/dev/urandom of=afile bs=1024 count=102400 > # zpool create tank2 raidz /foo/1 /foo/2 /foo/3 /foo/4 /foo/5 > # cp afile /tank2 > # zpool list > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > tank 22.5G 3.58M 22.5G 0% ONLINE - > tank2 296M 125M 171M 42% ONLINE - > # echo junk1 > 1 > # echo junk2 > 2 > # umount /tank2 > # zfs mount /tank2 > # cat /tank2/afile > /dev/null > > causes system restart. Is this the intended result? > The result varies somewhat, but after putting junk in those files (which zfs is treating as a device), it will do a sudden OS reset at some point. > > I think it may be because the files are all of a sudden a different size than zfs is expecting. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
The system is probably panicing because the ZFS checksums of the data read back
by the cat /tank2/afile > /dev/null are incorrect as the pool was open at the
time you corrupted the vdevs does your machine write a panic string out to
/var/adm/messages at all?
If you export the pool things behave much more gracefully.....
bash-3.00# cd /mnt/
bash-3.00# mkfile 64m 1
bash-3.00# mkfile 64m 2
bash-3.00# mkfile 64m 3
bash-3.00# mkfile 64m 4
bash-3.00# mkfile 64m 5
bash-3.00# dd if=/dev/urandom of=afile bs=1024 count=1024
bash-3.00# zpool create tank2 raidz /mnt/1 /mnt/2 /mnt/3 /mnt/4 /mnt/5
bash-3.00# cp afile /tank2/
bash-3.00# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
tank2 296M 1.31M 295M 0% ONLINE -
bash-3.00# zpool export tank2
bash-3.00# echo blahblahblah > 1
bash-3.00# zpool import -d /mnt tank2
bash-3.00# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
tank2 296M 1.36M 295M 0% DEGRADED -
bash-3.00# zpool status
pool: tank2
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using ''zpool replace''.
see: http://www.sun.com/msg/ZFS-8000-4J
scrub: resilver completed with 0 errors on Thu Apr 20 09:50:59 2006
config:
NAME STATE READ WRITE CKSUM
tank2 DEGRADED 0 0 0
raidz DEGRADED 0 0 0
15027832298606416063 UNAVAIL 0 0 0 was /mnt/1
/mnt/2 ONLINE 0 0 0
/mnt/3 ONLINE 0 0 0
/mnt/4 ONLINE 0 0 0
/mnt/5 ONLINE 0 0 0
errors: No known data errors
bash-3.00# dd if=/tank2/afile bs=1024k >/dev/null
1+0 records in
1+0 records out
bash-3.00# cksum /tank2/afile
2746568689 1048576 /tank2/afile
bash-3.00# mkfile 64m 6
bash-3.00# zpool replace tank2 /mnt/1 /mnt/6
bash-3.00# zpool status
pool: tank2
state: ONLINE
scrub: resilver completed with 0 errors on Thu Apr 20 09:52:21 2006
config:
NAME STATE READ WRITE CKSUM
tank2 ONLINE 0 0 0
raidz ONLINE 0 0 0
/mnt/6 ONLINE 0 0 0 267K resilvered
/mnt/2 ONLINE 0 0 0
/mnt/3 ONLINE 0 0 0
/mnt/4 ONLINE 0 0 0
/mnt/5 ONLINE 0 0 0
errors: No known data errors
bash-3.00# dd if=/tank2/afile bs=1024k >/dev/null
1+0 records in
1+0 records out
bash-3.00# cksum /tank2/afile
2746568689 1048576 /tank2/afile
That''s the replacement of one corrupt vdev.
For 2 or more...
bash-3.00# zpool export tank2
bash-3.00# echo blahblahblah > 2
bash-3.00# echo blahblahblah > 3
bash-3.00# zpool import -d /mnt tank2
cannot import ''tank2'': one or more devices is unavailable
bash-3.00# zpool import -d /mnt
pool: tank2
id: 11628506508750100983
state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
see: http://www.sun.com/msg/ZFS-8000-5E
config:
tank2 UNAVAIL insufficient replicas
raidz UNAVAIL insufficient replicas
/mnt/6 ONLINE
/mnt/2 UNAVAIL corrupted data
/mnt/3 UNAVAIL corrupted data
/mnt/4 ONLINE
/mnt/5 ONLINE
As expected, okay the data has gone because we have lost too many vdevs but
there''s no panic :)
Cheers,
Alan
This message posted from opensolaris.org
Hello Eric,
Thursday, April 20, 2006, 2:34:49 AM, you wrote:
ES> You are corrupting two copies in a RAID-Z group, which can only has
ES> single fault tolerance. While we can survive uncorrectable read errors,
ES> write errors will result in a panic. By cat''ing that file, you
are
ES> going to push the atime for the file, which will blow up.
IMHO there should be a property which controls what to do in such
situations - similar to UFS onerr=panic|lock.
If I have few hundred TBs in different pools I do not really want to
panic whole system just because one of the many pools is inconsistent.
--
Best regards,
Robert mailto:rmilkowski at task.gda.pl
http://milek.blogspot.com
On Wed, Apr 19, 2006 at 08:27:00PM -0700, jeffrey davis wrote:> > Is it possible to cause an I/O error instead of resetting the system?Theoretically, yes. But it''s very difficult. We would need the ability to abort an entire transaction group mid-stride, as well as the ability to propagate those errors up the stack in a meaningful way. Something we''d like to do, but decidedly non-trivial.> My computer actually reboots. Is there a way after the reboot to tell > what happened?Yes, what is actually happening is your machine is panicking. Most likely you are running on a desktop, which means the console message indicating the panic isn''t visible since X has control of the screen. You can find out what happened after a reboot by doing: # cd /var/crash/<hostname> # mdb *.0 (or whatever the highest number is) Then "::status" "$C" and "::msgbuf" are useful commands.> What happens when Solaris encounters an unrecoverable error on a > non-ZFS drive?It depends on the subsystem that''s using it. The driver will return EIO, but whatever happens after that is up to the consumer. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Thu, Apr 20, 2006 at 11:21:34AM +0200, Robert Milkowski wrote:> > IMHO there should be a property which controls what to do in such > situations - similar to UFS onerr=panic|lock. > > If I have few hundred TBs in different pools I do not really want to > panic whole system just because one of the many pools is inconsistent.This is not a case of "if (flag) do something else". Recovering from a write error requires some fundamental changes to the architecture at multiple levels, as we are in the middle of syncing a transaction group and have long since lost any correlation with a filesystem-level request by the time the failure occurs. And due to the tree-like nature of ZFS, aborting an entire transaction group could cause a large number of I/O failures from a single block failure. Once we figure out how to handle this, then certainly a pool-wide property would seem reasonable. However, there is a ton of work that needs to be done before this. Some first steps include reallocating writes, which will let us retry writes on other drives/locations mid-sync. It''s important to note, however, that as of build 36, we will survive any read errors, and provide you a complete list (via ''zpool status -v'') of any such errors encountered. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock