Peter Buckingham
2007-Jan-22 22:12 UTC
[zfs-discuss] file not persistent after node bounce when there is a bad disk?
Hi All, I noticed a behavior on a ZFS filesystem that was confusing to me and was hoping someone can shed some light on it. The summary is that I created two files, waited one minute, bounced the node, and noticed the files weren''t there when the node came back. There was a bad disk at the time, which I believe is contributing to this problem. Details below. thanks, peter -- Our platform is a modified x2100 system with 4 disks. We are running this version of Solaris: $ more /etc/release Solaris 10 11/06 s10x_u3wos_05a X86 Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 13 September 2006 One of my 4 disks is a flaky disk (/dev/dsk/c1t0d0) that is emitting these sorts of errors: Jan 19 00:32:55 somehost scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci108e,5348 at 8/disk at 0,0 (sd1): Jan 19 00:32:55 somehost Error for Command: read(10) Error Level: Retryable Jan 19 00:32:55 somehost scsi: [ID 107833 kern.notice] Requested Block: 23676213 Error Block: 1761607680 Jan 19 00:32:55 somehost scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Jan 19 00:32:55 somehost scsi: [ID 107833 kern.notice] Sense Key: Media Error Jan 19 00:32:55 somehost scsi: [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0 This disk participates in a pool: $ zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT tank 20.5G 1.11G 19.4G 5% ONLINE - $ zpool status pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s3 ONLINE 0 0 0 c0t1d0s3 ONLINE 0 0 0 c1t0d0s3 ONLINE 0 0 0 c1t1d0s3 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s5 ONLINE 0 0 0 c0t1d0s5 ONLINE 0 0 0 c1t0d0s5 ONLINE 0 0 0 c1t1d0s5 ONLINE 0 0 0 errors: No known data errors The filesystem is mounted like this: $ mount ... /config on tank/config read/write/setuid/devices/exec/atime/dev=2d50003 on Fri Jan 19 00:39:31 2007 ... I created two files and waited 60 seconds, thinking this would be enough time for the data to sync to disk before bouncing the node. $ echo hi > /config/file $ cat /config/file hi $ ls -l /config/file -rw-r--r-- 1 root root 3 Jan 19 00:35 /config/file $ echo bye > /config/otherfile $ ls -l /config/otherfile -rw-r--r-- 1 root root 4 Jan 19 00:35 /config/otherfile $ more /config/otherfile bye $ date Fri Jan 19 00:36:06 GMT 2007 $ sleep 60 $ date Fri Jan 19 00:37:13 GMT 2007 $ cat /config/file hi $ cat /config/otherfile bye I caused the system to reboot abruptly (using remote power control, so no sync during reboot happened). What I noticed is that the file was not there after the node bounce: $ Read from remote host somehost: Connection reset by peer Connection to somehost closed. $ ssh somehost Sun Microsystems Inc. SunOS 5.10 Generic January 2005 $ ls -l /config/file /config/file: No such file or directory $ ls -l /config/otherfile /config/otherfile: No such file or directory $ zpool status pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s3 ONLINE 0 0 0 c0t1d0s3 ONLINE 0 0 0 c1t0d0s3 ONLINE 0 0 0 c1t1d0s3 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s5 ONLINE 0 0 0 c0t1d0s5 ONLINE 0 0 0 c1t0d0s5 ONLINE 0 0 0 c1t1d0s5 ONLINE 0 0 0 errors: No known data errors $ zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT tank 20.5G 1.11G 19.4G 5% ONLINE - Note that the bad disk on the node caused a normal reboot to hang. I also verified that sync from the command line hung. I don''t know how ZFS (or Solaris) handles situations involving bad disks...does a bad disk block proper ZFS/OS handling of all IO, even to the other healthy disks? Is it reasonable to have assumed that after 60 seconds the data would have been on persistent disk even without an explicit sync? I confess I don''t know how the underlying layers are implemented. Are there mount options or other config parameters we should tweak to get more reliable behavior in this case? So far as I''ve seen, this behavior is reproducible, if someone on the ZFS team wishes to take a closer look at this scenario.
Tomas Ă–gren
2007-Jan-23 17:18 UTC
[zfs-discuss] file not persistent after node bounce when there is a bad disk?
On 22 January, 2007 - Peter Buckingham sent me these 5,2K bytes:> $ zpool status > pool: tank > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c0t0d0s3 ONLINE 0 0 0 > c0t1d0s3 ONLINE 0 0 0 > c1t0d0s3 ONLINE 0 0 0 > c1t1d0s3 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c0t0d0s5 ONLINE 0 0 0 > c0t1d0s5 ONLINE 0 0 0 > c1t0d0s5 ONLINE 0 0 0 > c1t1d0s5 ONLINE 0 0 0 > > errors: No known data errorsYou know that this is a stripe over two 4-way mirrors, right? A more common use is mirroring disks in groups of 2 and a stripe over 4 such mirrors. More like this: tank ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s3 ONLINE 0 0 0 c1t0d0s3 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t1d0s3 ONLINE 0 0 0 c1t1d0s3 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s5 ONLINE 0 0 0 c1t0d0s5 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t1d0s5 ONLINE 0 0 0 c1t1d0s5 ONLINE 0 0 0 /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Peter Buckingham
2007-Jan-23 18:57 UTC
[zfs-discuss] file not persistent after node bounce when there is a bad disk?
Tomas ?gren wrote:> You know that this is a stripe over two 4-way mirrors, right?yes. performance isn''t really a concern for us in this setup. persistence is. we want to be able to have access to files when disks fail. we need to be able to handle up to three disk failures. The slice layout is unfortunately something we have to live with.. peter
eric kustarz
2007-Jan-23 23:01 UTC
[zfs-discuss] file not persistent after node bounce when there is a bad disk?
> > Note that the bad disk on the node caused a normal reboot to hang. > I also verified that sync from the command line hung. I don''t know > how ZFS (or Solaris) handles situations involving bad disks...does > a bad disk block proper ZFS/OS handling of all IO, even to the > other healthy disks? > > Is it reasonable to have assumed that after 60 seconds the data > would have been on persistent disk even without an explicit sync? > I confess I don''t know how the underlying layers are implemented. > Are there mount options or other config parameters we should tweak > to get more reliable behavior in this case?Hey Peter, The first thing i would do is see if any I/O is happening (''zpool iostat 1''). If there''s none, then perhaps the machine is hung (which you then would want to grab a couple of ''::threadlist -v 10''s from mdb to figure out if there are hung threads). 60 seconds should be plenty of time for the async write(s) to complete. We try to push out txg (transaction groups) every 5 seconds. However, if the system is overloaded, then the txgs could take longer. They ''sync'' hanging is intriguing. Perhaps the system is just overloaded and sync command is making it worse. Seeing what ''fsync'' would do would be interesting.> > So far as I''ve seen, this behavior is reproducible, if someone on > the ZFS team wishes to take a closer look at this scenario.What else is the machine doing? eric
Peter Buckingham
2007-Jan-24 00:57 UTC
[zfs-discuss] file not persistent after node bounce when there is a bad disk?
Hi Eric, eric kustarz wrote:> The first thing i would do is see if any I/O is happening (''zpool iostat > 1''). If there''s none, then perhaps the machine is hung (which you then > would want to grab a couple of ''::threadlist -v 10''s from mdb to figure > out if there are hung threads).there seems to be no IO after the initial IO according to zpool iostat. When we run zpool status it hangs: HON hcb116 ~ $ zpool status pool: tank state: ONLINE scrub: none requested <hang> I''ll send you the mdb output privately since it''s quite big.> 60 seconds should be plenty of time for the async write(s) to complete. > We try to push out txg (transaction groups) every 5 seconds. However, > if the system is overloaded, then the txgs could take longer.That''s what I would have thought.> They ''sync'' hanging is intriguing. Perhaps the system is just > overloaded and sync command is making it worse. Seeing what ''fsync'' > would do would be interesting.I''ve not tried this yet.> What else is the machine doing?we are running the honeycomb environment (you can see when I send you the mdb output). is there some issue for the zpool mirrors if one of the slices disappears or is unresponsive after the pool has been brought online? thanks, peter
Mark Maybee
2007-Feb-01 00:19 UTC
[zfs-discuss] file not persistent after node bounce when there is a bad disk?
Peter Buckingham wrote:> Hi Eric, > > eric kustarz wrote: >> The first thing i would do is see if any I/O is happening (''zpool >> iostat 1''). If there''s none, then perhaps the machine is hung (which >> you then would want to grab a couple of ''::threadlist -v 10''s from mdb >> to figure out if there are hung threads). > > there seems to be no IO after the initial IO according to zpool iostat. > When we run zpool status it hangs: > > HON hcb116 ~ $ zpool status > pool: tank state: ONLINE > scrub: none requested > <hang> > > I''ll send you the mdb output privately since it''s quite big. > >> 60 seconds should be plenty of time for the async write(s) to >> complete. We try to push out txg (transaction groups) every 5 >> seconds. However, if the system is overloaded, then the txgs could >> take longer. > > That''s what I would have thought. > >> They ''sync'' hanging is intriguing. Perhaps the system is just >> overloaded and sync command is making it worse. Seeing what ''fsync'' >> would do would be interesting. > > I''ve not tried this yet. > >> What else is the machine doing? > > we are running the honeycomb environment (you can see when I send you > the mdb output). > > is there some issue for the zpool mirrors if one of the slices > disappears or is unresponsive after the pool has been brought online? >This can be a problem if an IO issued to the device never completes (i.e., hangs). This can hang up the pool. A well-behaved device/driver should eventually time out the IO, but we have seen instances where this never seems to happen. -Mark