thr3ads.net - zfs discuss - [zfs-discuss] file not persistent after node bounce when there is a bad disk? [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Peter Buckingham

2007-Jan-22 22:12 UTC

[zfs-discuss] file not persistent after node bounce when there is a bad disk?

Hi All,

I noticed a behavior on a ZFS filesystem that was confusing to me and 
was hoping someone can shed some light on it.  The summary is that I 
created two files, waited one minute, bounced the node, and noticed the 
files weren''t there when the node came back.  There was a bad disk at 
the time, which I believe is contributing to this problem. Details below.

thanks,

peter

--



Our platform is a modified x2100 system with 4 disks.  We are running 
this version of Solaris:

$ more /etc/release
                        Solaris 10 11/06 s10x_u3wos_05a X86
            Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
                         Use is subject to license terms.
                            Assembled 13 September 2006


One of my 4 disks is a flaky disk (/dev/dsk/c1t0d0) that is emitting 
these sorts of errors:

Jan 19 00:32:55 somehost scsi: [ID 107833 kern.warning] WARNING: 
/pci at 0,0/pci108e,5348 at 8/disk at 0,0 (sd1):
Jan 19 00:32:55 somehost  Error for Command: read(10) Error Level: Retryable
Jan 19 00:32:55 somehost scsi: [ID 107833 kern.notice]    Requested 
Block: 23676213                  Error Block: 1761607680
Jan 19 00:32:55 somehost scsi: [ID 107833 kern.notice]    Vendor: ATA 
                        Serial Number:
Jan 19 00:32:55 somehost scsi: [ID 107833 kern.notice]    Sense Key: 
Media Error
Jan 19 00:32:55 somehost scsi: [ID 107833 kern.notice]    ASC: 0x11 
(unrecovered read error), ASCQ: 0x0, FRU: 0x0


This disk participates in a pool:

$ zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
tank                   20.5G   1.11G   19.4G     5%  ONLINE     -

$ zpool status
   pool: tank
  state: ONLINE
  scrub: none requested
config:

         NAME          STATE     READ WRITE CKSUM
         tank          ONLINE       0     0     0
           mirror      ONLINE       0     0     0
             c0t0d0s3  ONLINE       0     0     0
             c0t1d0s3  ONLINE       0     0     0
             c1t0d0s3  ONLINE       0     0     0
             c1t1d0s3  ONLINE       0     0     0
           mirror      ONLINE       0     0     0
             c0t0d0s5  ONLINE       0     0     0
             c0t1d0s5  ONLINE       0     0     0
             c1t0d0s5  ONLINE       0     0     0
             c1t1d0s5  ONLINE       0     0     0

errors: No known data errors


The filesystem is mounted like this:

$ mount
...
/config on tank/config read/write/setuid/devices/exec/atime/dev=2d50003 
on Fri Jan 19 00:39:31 2007
...


I created two files and waited 60 seconds, thinking this would be enough 
time for the data to sync to disk before bouncing the node.

$ echo hi > /config/file
$ cat /config/file
hi
$ ls -l  /config/file
-rw-r--r--   1 root     root           3 Jan 19 00:35 /config/file
$ echo bye > /config/otherfile
$ ls -l /config/otherfile
-rw-r--r--   1 root     root           4 Jan 19 00:35 /config/otherfile
$ more /config/otherfile
bye
$ date
Fri Jan 19 00:36:06 GMT 2007
$ sleep 60
$ date
Fri Jan 19 00:37:13 GMT 2007
$ cat /config/file
hi
$ cat  /config/otherfile
bye


I caused the system to reboot abruptly (using remote power control, so 
no sync during reboot happened).  What I noticed is that the file was 
not there after the node bounce:

$ Read from remote host somehost: Connection reset by peer
Connection to somehost closed.
$ ssh somehost
Sun Microsystems Inc.   SunOS 5.10      Generic January 2005
$ ls -l  /config/file
/config/file: No such file or directory
$ ls -l /config/otherfile
/config/otherfile: No such file or directory
$ zpool status
   pool: tank
  state: ONLINE
  scrub: none requested
config:

         NAME          STATE     READ WRITE CKSUM
         tank          ONLINE       0     0     0
           mirror      ONLINE       0     0     0
             c0t0d0s3  ONLINE       0     0     0
             c0t1d0s3  ONLINE       0     0     0
             c1t0d0s3  ONLINE       0     0     0
             c1t1d0s3  ONLINE       0     0     0
           mirror      ONLINE       0     0     0
             c0t0d0s5  ONLINE       0     0     0
             c0t1d0s5  ONLINE       0     0     0
             c1t0d0s5  ONLINE       0     0     0
             c1t1d0s5  ONLINE       0     0     0

errors: No known data errors
$ zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
tank                   20.5G   1.11G   19.4G     5%  ONLINE     -



Note that the bad disk on the node caused a normal reboot to hang.  I 
also verified that sync from the command line hung.  I don''t know how 
ZFS (or Solaris) handles situations involving bad disks...does a bad 
disk block proper ZFS/OS handling of all IO, even to the other healthy 
disks?

Is it reasonable to have assumed that after 60 seconds the data would 
have been on persistent disk even without an explicit sync?  I confess I 
don''t know how the underlying layers are implemented.  Are there mount 
options or other config parameters we should tweak to get more reliable 
behavior in this case?

So far as I''ve seen, this behavior is reproducible, if someone on the 
ZFS team wishes to take a closer look at this scenario.

Tomas Ögren

2007-Jan-23 17:18 UTC

head link

[zfs-discuss] file not persistent after node bounce when there is a bad disk?

On 22 January, 2007 - Peter Buckingham sent me these 5,2K bytes:
> $ zpool status
>   pool: tank
>  state: ONLINE
>  scrub: none requested
> config:
> 
>         NAME          STATE     READ WRITE CKSUM
>         tank          ONLINE       0     0     0
>           mirror      ONLINE       0     0     0
>             c0t0d0s3  ONLINE       0     0     0
>             c0t1d0s3  ONLINE       0     0     0
>             c1t0d0s3  ONLINE       0     0     0
>             c1t1d0s3  ONLINE       0     0     0
>           mirror      ONLINE       0     0     0
>             c0t0d0s5  ONLINE       0     0     0
>             c0t1d0s5  ONLINE       0     0     0
>             c1t0d0s5  ONLINE       0     0     0
>             c1t1d0s5  ONLINE       0     0     0
> 
> errors: No known data errors
You know that this is a stripe over two 4-way mirrors, right?

A more common use is mirroring disks in groups of 2 and a stripe over 4
such mirrors.

More like this:

         tank          ONLINE       0     0     0
           mirror      ONLINE       0     0     0
             c0t0d0s3  ONLINE       0     0     0
             c1t0d0s3  ONLINE       0     0     0
           mirror      ONLINE       0     0     0
             c0t1d0s3  ONLINE       0     0     0
             c1t1d0s3  ONLINE       0     0     0
           mirror      ONLINE       0     0     0
             c0t0d0s5  ONLINE       0     0     0
             c1t0d0s5  ONLINE       0     0     0
           mirror      ONLINE       0     0     0
             c0t1d0s5  ONLINE       0     0     0
             c1t1d0s5  ONLINE       0     0     0

/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Peter Buckingham

2007-Jan-23 18:57 UTC

head link

[zfs-discuss] file not persistent after node bounce when there is a bad disk?

Tomas ?gren wrote:> You know that this is a stripe over two 4-way mirrors, right?
yes. performance isn''t really a concern for us in this setup. 
persistence is. we want to be able to have access to files when disks 
fail. we need to be able to handle up to three disk failures. The slice 
layout is unfortunately something we have to live with..

peter

eric kustarz

2007-Jan-23 23:01 UTC

head link

[zfs-discuss] file not persistent after node bounce when there is a bad disk?

>
> Note that the bad disk on the node caused a normal reboot to hang.   
> I also verified that sync from the command line hung.  I don''t
know
> how ZFS (or Solaris) handles situations involving bad disks...does  
> a bad disk block proper ZFS/OS handling of all IO, even to the  
> other healthy disks?
>
> Is it reasonable to have assumed that after 60 seconds the data  
> would have been on persistent disk even without an explicit sync?   
> I confess I don''t know how the underlying layers are implemented.
> Are there mount options or other config parameters we should tweak  
> to get more reliable behavior in this case?
Hey Peter,

The first thing i would do is see if any I/O is happening (''zpool  
iostat 1'').  If there''s none, then perhaps the machine is hung
(which
you then would want to grab a couple of ''::threadlist -v 10''s
from
mdb to figure out if there are hung threads).

60 seconds should be plenty of time for the async write(s) to  
complete.  We try to push out txg (transaction groups) every 5  
seconds.  However, if the system is overloaded, then the txgs could  
take longer.

They ''sync'' hanging is intriguing.  Perhaps the system is just
overloaded and sync command is making it worse.  Seeing what
''fsync''
would do would be interesting.
>
> So far as I''ve seen, this behavior is reproducible, if someone on
> the ZFS team wishes to take a closer look at this scenario.
What else is the machine doing?

eric

Peter Buckingham

2007-Jan-24 00:57 UTC

head link

[zfs-discuss] file not persistent after node bounce when there is a bad disk?

Hi Eric,

eric kustarz wrote:> The first thing i would do is see if any I/O is happening (''zpool
iostat
> 1'').  If there''s none, then perhaps the machine is hung
(which you then
> would want to grab a couple of ''::threadlist -v 10''s from
mdb to figure
> out if there are hung threads).
there seems to be no IO after the initial IO according to zpool iostat. 
When we run zpool status it hangs:

HON hcb116 ~ $ zpool status
    pool: tank  state: ONLINE
    scrub: none requested
    <hang>

I''ll send you the mdb output privately since it''s quite big.
> 60 seconds should be plenty of time for the async write(s) to complete.  
> We try to push out txg (transaction groups) every 5 seconds.  However, 
> if the system is overloaded, then the txgs could take longer.
That''s what I would have thought.
> They ''sync'' hanging is intriguing.  Perhaps the system is
just
> overloaded and sync command is making it worse.  Seeing what
''fsync''
> would do would be interesting.
I''ve not tried this yet.
> What else is the machine doing?
we are running the honeycomb environment (you can see when I send you 
the mdb output).

is there some issue for the zpool mirrors if one of the slices
disappears or is unresponsive after the pool has been brought online?

thanks,

peter

Mark Maybee

2007-Feb-01 00:19 UTC

head link

[zfs-discuss] file not persistent after node bounce when there is a bad disk?

Peter Buckingham wrote:> Hi Eric,
> 
> eric kustarz wrote:
>> The first thing i would do is see if any I/O is happening
(''zpool
>> iostat 1'').  If there''s none, then perhaps the
machine is hung (which
>> you then would want to grab a couple of ''::threadlist -v
10''s from mdb
>> to figure out if there are hung threads).
> 
> there seems to be no IO after the initial IO according to zpool iostat. 
> When we run zpool status it hangs:
> 
> HON hcb116 ~ $ zpool status
>    pool: tank  state: ONLINE
>    scrub: none requested
>    <hang>
> 
> I''ll send you the mdb output privately since it''s quite
big.
> 
>> 60 seconds should be plenty of time for the async write(s) to 
>> complete.  We try to push out txg (transaction groups) every 5 
>> seconds.  However, if the system is overloaded, then the txgs could 
>> take longer.
> 
> That''s what I would have thought.
> 
>> They ''sync'' hanging is intriguing.  Perhaps the
system is just
>> overloaded and sync command is making it worse.  Seeing what
''fsync''
>> would do would be interesting.
> 
> I''ve not tried this yet.
> 
>> What else is the machine doing?
> 
> we are running the honeycomb environment (you can see when I send you 
> the mdb output).
> 
> is there some issue for the zpool mirrors if one of the slices
> disappears or is unresponsive after the pool has been brought online?
> This can be a problem if an IO issued to the device never completes
(i.e., hangs).  This can hang up the pool.  A well-behaved device/driver
should eventually time out the IO, but we have seen instances where
this never seems to happen.

-Mark

zfs discuss - Jan 2007 - file not persistent after node bounce when there is a bad disk?

[zfs-discuss] file not persistent after node bounce when there is a bad disk?

[zfs-discuss] file not persistent after node bounce when there is a bad disk?

[zfs-discuss] file not persistent after node bounce when there is a bad disk?

[zfs-discuss] file not persistent after node bounce when there is a bad disk?

[zfs-discuss] file not persistent after node bounce when there is a bad disk?

[zfs-discuss] file not persistent after node bounce when there is a bad disk?