thr3ads.net - zfs discuss - [zfs-discuss] How to find poor performing disks [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Simon Gao

2009-Aug-26 16:09 UTC

[zfs-discuss] How to find poor performing disks

Hi,

I''d appreciate if anyone can point me how to identify poor performing
disks that might have dragged down performance of the pool. Also the system
logged following error about one of the drives. Does it show the disk was having
problem?

Aug 17 13:45:56 zfs1.domain.com scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci8086,25f9 at 6/pci1000,3140 at 0 (mpt1):
Aug 17 13:45:56 zfs1.domain.com  Disconnected command timeout for Target 10
Aug 17 13:45:56 zfs1.domain.com scsi: [ID 365881 kern.info] /pci at
0,0/pci8086,25f9 at 6/pci1000,3140 at 0 (mpt1):
Aug 17 13:45:56 zfs1.domain.com  Log info 31140000 received for target 10.
Aug 17 13:45:56 zfs1.domain.com  scsi_status=0, ioc_status=8048, scsi_state=c
Aug 17 13:45:56 zfs1.domain.com scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci8086,25f9 at 6/pci1000,3140 at 0/sd at a,0 (sd15):
Aug 17 13:45:56 zfs1.domain.com  SCSI transport failed: reason
''reset'': retrying command
Aug 17 13:45:59 zfs1.domain.com scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci8086,25f9 at 6/pci1000,3140 at 0/sd at a,0 (sd15):
Aug 17 13:45:59 zfs1.domain.com  Error for Command: read(10)               
Error Level: Retryable
Aug 17 13:45:59 zfs1.domain.com scsi: [ID 107833 kern.notice]    Requested
Block: 715872929                 Error Block: 715872929
Aug 17 13:45:59 zfs1.domain.com scsi: [ID 107833 kern.notice]    Vendor: ATA    
Serial Number:      WD-WCAP
Aug 17 13:45:59 zfs1.domain.com scsi: [ID 107833 kern.notice]    Sense Key: Unit
Attention
Aug 17 13:45:59 zfs1.domain.com scsi: [ID 107833 kern.notice]    ASC: 0x29
(power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
-- 
This message posted from opensolaris.org

Scott Meilicke

2009-Aug-26 18:12 UTC

head link

[zfs-discuss] How to find poor performing disks

You can try:

zpool iostat pool_name -v 1

This will show you IO on each vdev at one second intervals. Perhaps you will see
different IO behavior on any suspect drive.

-Scott
-- 
This message posted from opensolaris.org

Scott Lawson

2009-Aug-26 20:41 UTC

head link

[zfs-discuss] How to find poor performing disks

Also you may wish to look at the output of ''iostat -xnce 1'' as
well.

You can post those to the list if you have a specific problem.

You want to be looking for error counts increasing and specifically
''asvc_t''
for the service times on the disks. I higher number for asvc_t  may help to
 isolate poorly performing individual disks.

Scott Meilicke wrote:> You can try:
>
> zpool iostat pool_name -v 1
>
> This will show you IO on each vdev at one second intervals. Perhaps you
will see different IO behavior on any suspect drive.
>
> -Scott
>

Dave Koelmeyer

2009-Aug-27 00:06 UTC

head link

[zfs-discuss] How to find poor performing disks

Maybe you can run a Dtrace probe using Chime?

http://blogs.sun.com/observatory/entry/chime

Initial Traces -> Device IO
-- 
This message posted from opensolaris.org

Simon Gao

2009-Aug-27 05:12 UTC

head link

[zfs-discuss] How to find poor performing disks

Running "iostat -nxce 1", I saw write sizes alternate between two
raidz groups in the same pool.

At one time, drives on cotroller 1 have larger writes (3-10 times) than ones on
controller2:

                            extended device statistics       ---- errors ---
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot
device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 fd0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c1t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c0t10d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c0t11d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c3t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c4t0d0
    0.0    9.0    0.0    4.0  0.0  0.0    0.0    0.5   0   0   1   0   0   1
c0t12d0
    0.0    9.0    0.0    4.0  0.0  0.0    0.0    0.1   0   0   1   0   0   1
c0t13d0
    0.0    9.0    0.0    4.5  0.0  0.0    0.0    0.1   0   0   1   0   0   1
c0t14d0
    0.0    8.0    0.0    4.5  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c0t15d0
    0.0    9.0    0.0    3.5  0.0  0.0    0.0    0.1   0   0   1   0   0   1
c0t16d0
    0.0    9.0    0.0    3.5  0.0  0.0    0.0    0.1   0   0   1   0   0   1
c0t17d0
    0.0   20.0    0.0   56.5  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t6d0
    0.0   20.0    0.0   55.0  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c2t7d0
    0.0   20.0    0.0   53.5  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t8d0
    0.0   20.0    0.0   53.0  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c2t9d0
    0.0   20.0    0.0   55.5  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t10d0
    0.0   20.0    0.0   55.0  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c2t11d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c2t12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c2t13d0
     cpu
 us sy wt id
  0 47  0 53
                            extended device statistics       ---- errors ---
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot
device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 fd0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c1t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c0t10d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c0t11d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c3t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c4t0d0
    0.0    8.0    0.0   18.5  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c0t12d0
    0.0    8.0    0.0   18.5  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c0t13d0
    0.0   11.0    0.0   20.5  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c0t14d0
    0.0   12.0    0.0   20.5  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c0t15d0
    0.0    8.0    0.0   19.0  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c0t16d0
    0.0    8.0    0.0   18.5  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c0t17d0
    0.0   21.0    0.0   66.0  0.0  0.0    0.0    0.4   0   1   1   0   0   1
c2t6d0
    0.0   21.0    0.0   66.0  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c2t7d0
    0.0   21.0    0.0   65.5  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c2t8d0
    0.0   20.0    0.0   64.0  0.0  0.0    0.0    0.4   0   0   1   0   0   1
c2t9d0
    0.0   21.0    0.0   65.0  0.0  0.0    0.0    0.4   0   0   1   0   0   1
c2t10d0
    0.0   21.0    0.0   64.0  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c2t11d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c2t12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c2t13d0
     cpu
 us sy wt id
  0 23  0 77

....

At other time, drives on controller2  have larger writes (3-10 times) than the
ones on
controller1:
                            extended device statistics       ---- errors ---
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot
device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 fd0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c1t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c0t10d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c0t11d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c3t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c4t0d0
    0.0   24.0    0.0   65.5  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c0t12d0
    0.0   24.0    0.0   64.0  0.0  0.0    0.0    0.4   0   0   1   0   0   1
c0t13d0
    0.0   25.0    0.0   67.0  0.0  0.0    0.0    0.5   0   0   1   0   0   1
c0t14d0
    0.0   25.0    0.0   66.5  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c0t15d0
    0.0   26.0    0.0   69.0  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c0t16d0
    0.0   26.0    0.0   69.0  0.0  0.0    0.0    0.5   0   0   1   0   0   1
c0t17d0
    0.0   12.0    0.0   20.5  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c2t6d0
    0.0   12.0    0.0   20.0  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c2t7d0
    0.0   13.0    0.0   20.5  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t8d0
    0.0   13.0    0.0   20.0  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t9d0
    0.0   14.0    0.0   22.0  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t10d0
    0.0   14.0    0.0   22.0  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t11d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c2t12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c2t13d0
     cpu
 us sy wt id
  0 42  0 58
                            extended device statistics       ---- errors ---
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot
device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 fd0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c1t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c0t10d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c0t11d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c3t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   2   0   0   2
c4t0d0
    0.0   20.0    0.0   56.0  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c0t12d0
    0.0   20.0    0.0   55.5  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c0t13d0
    0.0   19.0    0.0   54.0  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c0t14d0
    0.0   19.0    0.0   53.0  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c0t15d0
    0.0   18.0    0.0   54.5  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c0t16d0
    0.0   18.0    0.0   54.5  0.0  0.0    0.0    0.4   0   0   1   0   0   1
c0t17d0
    0.0   14.0    0.0   28.5  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t6d0
    0.0   14.0    0.0   28.5  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t7d0
    0.0   14.0    0.0   30.5  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t8d0
    0.0   14.0    0.0   30.0  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t9d0
    0.0   14.0    0.0   30.0  0.0  0.0    0.0    0.2   0   0   1   0   0   1
c2t10d0
    0.0   14.0    0.0   29.0  0.0  0.0    0.0    0.3   0   0   1   0   0   1
c2t11d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c2t12d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   1   0   0   1
c2t13d0

Is this expected behavior?  Shouldn''t writes be spread out evenly
across all the drives all
the time? Does it only apply to the drives on the same raid controller?  

Here is the pool structure:

  pool: zpool
 state: ONLINE 
 scrub: none requested
config:               

        NAME         STATE     READ WRITE CKSUM
          zpool      ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c0t12d0  ONLINE       0     0     0
            c0t13d0  ONLINE       0     0     0
            c0t14d0  ONLINE       0     0     0
            c0t15d0  ONLINE       0     0     0
            c0t16d0  ONLINE       0     0     0
            c0t17d0  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c2t6d0   ONLINE       0     0     0
            c2t7d0   ONLINE       0     0     0
            c2t8d0   ONLINE       0     0     0
            c2t9d0   ONLINE       0     0     0
            c2t10d0  ONLINE       0     0     0
            c2t11d0  ONLINE       0     0     0
        spares
          c2t12d0    AVAIL
          c2t13d0    AVAIL
-- 
This message posted from opensolaris.org

Roch

2009-Sep-04 09:06 UTC

head link

[zfs-discuss] How to find poor performing disks

Scott Lawson writes:
 > Also you may wish to look at the output of ''iostat -xnce
1'' as well.
 > 
 > You can post those to the list if you have a specific problem.
 > 
 > You want to be looking for error counts increasing and specifically
''asvc_t''
 > for the service times on the disks. I higher number for asvc_t  may help
to
 >  isolate poorly performing individual disks.
 > 
 > 

I blast the pool with dd, and look for drives that are
*always* active, while others in the same group have
completed their transaction group and get no more activity.
Within a group drives should be getting the same amount of
data per 5 second (zfs_txg_synctime) and the ones that are
always active are the ones slowing you down.

If whole groups are unbalanced that''s a sign that they have
different amount of free space and the expectation is that
you will be gated by the speed on the group that needs to
catch up. 

-r

 > 
 > Scott Meilicke wrote:
 > > You can try:
 > >
 > > zpool iostat pool_name -v 1
 > >
 > > This will show you IO on each vdev at one second intervals. Perhaps
you will see different IO behavior on any suspect drive.
 > >
 > > -Scott
 > >   
 > 
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Aug 2009 - How to find poor performing disks

[zfs-discuss] How to find poor performing disks

[zfs-discuss] How to find poor performing disks

[zfs-discuss] How to find poor performing disks

[zfs-discuss] How to find poor performing disks

[zfs-discuss] How to find poor performing disks

[zfs-discuss] How to find poor performing disks