thr3ads.net - zfs discuss - [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43 [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Ben Rockwood

2006-Dec-07 13:01 UTC

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

I''ve got a Thumper doing nothing but serving NFS.  Its using B43 with
zil_disabled.  The system is being consumed in waves, but by what I
don''t know.  Notice vmstat:

 3 0 0 25693580 2586268 0 0  0  0  0  0  0  0  0  0  0  926   91  703  0 25 75
 21 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 13 14 1720   21 1105  0 92  8
 20 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 17 18 2538   70  834  0 100 0
 25 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0  0  0  745   18  179  0 100 0
 37 0 0 25693552 2586240 0 0 0  0  0  0  0  0  0  7  7 1152   52  313  0 100 0
 16 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0 15 13 1543   52  767  0 100 0
 17 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0  2  2  890   72  192  0 100 0
 27 0 0 25693572 2586260 0 0 0  0  0  0  0  0  0 15 15 3271   19 3103  0 98  2
 0 0 0 25693456 2586144 0 11 0  0  0  0  0  0  0 281 249 34335 242 37289 0 46 54
 0 0 0 25693448 2586136 0 2  0  0  0  0  0  0  0  0  0 2470  103 2900  0 27 73
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1062  105  822  0 26 74
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1076   91  857  0 25 75
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0  917  126  674  0 25 75

These spikes of sys load come in waves like this.  While there are close to a
hundred systems mounting NFS shares on the Thumper, the amount of traffic is
really low.  Nothing to justify this.  We''re talking less than 10MB/s.

NFS is pathetically slow.  We''re using NFSv3 TCP shared via ZFS
sharenfs on a 3Gbps aggregation (3*1Gbps).

I''ve been slamming my head against this problem for days and
can''t make headway.  I''ll post some of my notes below.  Any
thoughts or ideas are welcome!

benr.

==
Step 1 was to disable any ZFS features that might consume large amounts of CPU:

# zfs set compression=off joyous
# zfs set atime=off joyous
# zfs set checksum=off joyous

These changes had no effect.

Next was to consider that perhaps NFS was doing name lookups when it
shouldn''t. Indeed "dns" was specified in /etc/nsswitch.conf
which won''t work given that no DNS servers are accessable from the
storage or private networks, but again, no improvement. In this process I
removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the
dns/client service in SMF.

Turning back to CPU usage, we can see the activity is all SYStem time and comes
in waves:

[private:/tmp] root# sar 1 100

SunOS private.thumper1 5.11 snv_43 i86pc    12/07/2006

10:38:05    %usr    %sys    %wio   %idle
10:38:06       0      27       0      73
10:38:07       0      27       0      73
10:38:09       0      27       0      73
10:38:10       1      26       0      73
10:38:11       0      26       0      74
10:38:12       0      26       0      74
10:38:13       0      24       0      76
10:38:14       0       6       0      94
10:38:15       0       7       0      93
10:38:22       0      99       0       1  <--
10:38:23       0      94       0       6  <--
10:38:24       0      28       0      72
10:38:25       0      27       0      73
10:38:26       0      27       0      73
10:38:27       0      27       0      73
10:38:28       0      27       0      73
10:38:29       1      30       0      69
10:38:30       0      27       0      73

And so we consider whether or not there is a pattern to the frequency. The
following is sar output from any lines in which sys is above 90%:

10:40:04    %usr    %sys    %wio   %idle    Delta
10:40:11       0      97       0       3
10:40:45       0      98       0       2   34 seconds
10:41:02       0      94       0       6   17 seconds
10:41:26       0     100       0       0   24 seconds
10:42:00       0     100       0       0   34 seconds
10:42:25       (end of sample)             >25 seconds

Looking at the congestion in the run queue:

[private:/tmp] root# sar -q 5 100

10:45:43 runq-sz %runocc swpq-sz %swpocc
10:45:51    27.0      85     0.0       0
10:45:57     1.0      20     0.0       0
10:46:02     2.0      60     0.0       0
10:46:13    19.8      99     0.0       0
10:46:23    17.7      99     0.0       0
10:46:34    24.4      99     0.0       0
10:46:41    22.1      97     0.0       0
10:46:48    13.0      96     0.0       0
10:46:55    25.3     102     0.0       0

Looking at the per-CPU breakdown:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   324  224    0    0    0  154    0     0    0 100   0   0
  1    0   0    0   114    0  226    0   10   13    0    86    0   1   0  99
  2    0   0    0   162  138  149    0    5    4    0     0    0   1   0  99
  3    0   0    0    55    6   46    0    4    3    0     0    0   1   0  99
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   310  210   34    0   17  171    7     5    0 100   0   0
  1    0   0    0   152    1  200    0   17   26    5    59    1  65   0  34
  2    0   0    0   271  197  175    1   13   20    2     0    0  66   0  34
  3    0   0    0   120    6   76    0    6   14    2     0    0  66   0  34
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    1   318  204   46    1   19   22    5    18    0 100   0   0
  1    0   0    0   112    0   40    0   12   15    3     4    0 100   0   0
  2    0   0    0   195   93   31    1   11   27    3     0    0 100   0   0
  3    0   0    0   108    1   20    0    9   31    3     2    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   569  204  809    3  133  432    1    21    0  99   0   1
  1    0   0    0   502    0 1159    1  201  453    1     8    0  97   0   3
  2    0   0    0  2305 2197  147    6   51  279    0     0    0 100   0   0
  3    0   0    0   242    2  433   10   75  301    1     8    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   371  227  110    0   31 1470    0    80    0  99   0   1
  1    0   0    3  1034    0 2079    0  102  241    0     9    0  10   0  90
  2    0   0    0  1892 1718 1558    3  129  214    0    36    0  11   0  89
  3    0   0    0   303    6  549    0   50  329    0     5    0   8   0  92
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    1   312  205   24    0   10   39    4     4    0 100   0   0
  1    0   0    1   134    0   99    0   10   20    3     6    0  90   0  10
  2    0   0    0   187   90   40    0    8   15    4    10    0  90   0  10
  3    0   0    1   100    1   19    0    8    9    4     6    0  90   0  10
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    2   528  206  916    8  161  312    1    39    0  99   0   1
  1    0   0    1   206    1  350    6  120  147    1    68    0 100   0   0
  2   83   0    0  1929 1816  192    9   87  176    0    50    0 100   0   0
  3    0   0    0   142    4   66   21   12  460    1    17    1  99   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   360  228  163   11   25  743    0    13    0 100   0   0
  1    0   0    0  1587    0 4126   13  520 1435    0   122    0  24   0  76
  2    3   0    0  7785 7669 1499   29  225  933    0    48    0  27   0  73
  3    1   0    0  1856    6 4683    0  467 1912    0    37    0  23   0  77
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   323  223    0    0    0  244    0     0    0 100   0   0
  1    0   0    0   141    0  277    0   14   11    0     1    0   0   0 100
  2    0   0    0   307  259  312    0   16   15    0     3    0   1   0  99
  3    0   0    0   113    6  165    0    5    5    0    91    0   0   0 100
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   324  224    0    0    0  203    0     0    0 100   0   0
  1    0   0    0   143    0  283    0   11    2    0    13    0   1   0  99
  2    0   0    0   312  278  298    0    8   11    0     3    0   1   0  99
  3    0   0    0   119    6  175    0    3    5    0    69    0   1   0  99

Looking at prstat we can see the number of nfsd threads are low, much lower than
the 1024 limit:

[private:/tmp] root# prstat
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
 22643 daemon   2228K 1776K sleep   60  -20   2:55:52  55% nfsd/18

Interestingly, using prstat -mL to monitor thread latency, we see that a handful
of threads are the culprates for consuming mass CPU:

[private:/tmp] root# prstat -mL
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 416   1   0   0 nfsd/1506
 22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 415   0   0   0 nfsd/1563
 22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 417   0   0   0 nfsd/1554
 22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 419   0   0   0 nfsd/1551
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0  26  74 418   0   0   0 nfsd/1553
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1536
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1555
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1539
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1562
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1545
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1559
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419   1   0   0 nfsd/1541
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1546
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1543
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1560
Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86

Above SLP represents "The percentage of time the process has spent
sleeping", LAT represents "percentage of time the process has spent
waiting for CPU", VCX is "number of voluntary context switches"
and ICX is for involuntary context switches.

Using pstack to try and determine what these threads are doing we see:

[private:/tmp] root# pstack 22643/1693
22643:  /usr/lib/nfs/nfsd
-----------------  lwp# 1693 / thread# 1693  --------------------
 fee589b7 nfssys   (e, feba0fc8)
 fee5839f _thr_setup (fed4d400) + 51
 fee585f0 _lwp_start (fed4d400, 0, 0, 0, 0, 0)
[private:/tmp] root# pstack 22643/1708
22643:  /usr/lib/nfs/nfsd
-----------------  lwp# 1708  --------------------------------
 fee589b7 nfssys   (e, febb0fc8)
 fee5839f _thr_setup (fed44c00) + 51
 fee585f0 _lwp_start (fed44c00, 0, 0, 0, 0, 0)
[private:/tmp] root# pstack 22643/1664
22643:  /usr/lib/nfs/nfsd
-----------------  lwp# 1664 / thread# 1664  --------------------
 fee589b7 nfssys   (e, fec90fc8)
 fee5839f _thr_setup (fed43800) + 51
 fee585f0 _lwp_start (fed43800, 0, 0, 0, 0, 0)

Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I
get an idea of what ZFS is doing:

[private:/] root# dtrace -s  zvop_times.d
dtrace: script ''zvop_times.d'' matched 66 probes
^C
CPU     ID                    FUNCTION:NAME
  1      2                             :END 
ZFS COUNT

  zfs_getsecattr                                                    4
  zfs_space                                                        70
  zfs_rename                                                      111
  zfs_readdir                                                     284
  zfs_read                                                        367
  zfs_mkdir                                                       670
  zfs_setattr                                                    1054
  zfs_frlock                                                     1562
  zfs_putpage                                                    3916
  zfs_write                                                      4110
  zfs_create                                                     4178
  zfs_remove                                                     7794
  zfs_fid                                                       14960
  zfs_inactive                                                  17637
  zfs_access                                                    20809
  zfs_fsync                                                     29668
  zfs_lookup                                                    31273
  zfs_getattr                                                  175457



ZFS AVG TIME

  zfs_fsync                                                      2337
  zfs_getattr                                                    2735
  zfs_access                                                     2774
  zfs_fid                                                        2948
  zfs_inactive                                                   4081
  zfs_lookup                                                     6067
  zfs_frlock                                                     6750
  zfs_putpage                                                    6840
  zfs_mkdir                                                      9259
  zfs_getsecattr                                                10940
  zfs_setattr                                                   15364
  zfs_space                                                     25214
  zfs_readdir                                                   25726
  zfs_remove                                                    26117
  zfs_write                                                     31942
  zfs_rename                                                    60504
  zfs_read                                                     144449
  zfs_create                                                 71215587



ZFS SUM TIME

  zfs_getsecattr                                                43762
  zfs_space                                                   1765026
  zfs_mkdir                                                   6204101
  zfs_rename                                                  6715987
  zfs_readdir                                                 7306233
  zfs_frlock                                                 10544902
  zfs_setattr                                                16194261
  zfs_putpage                                                26787465
  zfs_fid                                                    44111892
  zfs_read                                                   53013073
  zfs_access                                                 57740468
  zfs_fsync                                                  69354886
  zfs_inactive                                               71990292
  zfs_write                                                 131282275
  zfs_lookup                                                189748320
  zfs_remove                                                203558835
  zfs_getattr                                               479998059
  zfs_create                                             297538724997


[private:/] root# zfs get all joyous
NAME             PROPERTY       VALUE                      SOURCE
joyous           type           filesystem                 -                
joyous           creation       Fri Jul 28 17:08 2006      -                
joyous           used           255G                       -                
joyous           available      15.7T                      -                
joyous           referenced     81.7K                      -                
joyous           compressratio  1.21x                      -                
joyous           mounted        yes                        -                
joyous           quota          none                       default          
joyous           reservation    none                       default          
joyous           recordsize     128K                       default          
joyous           mountpoint     /joyous                    default          
joyous           sharenfs       off                        default          
joyous           checksum       off                        local            
joyous           compression    off                        local            
joyous           atime          off                        local            
joyous           devices        on                         default          
joyous           exec           on                         default          
joyous           setuid         on                         default          
joyous           readonly       off                        default          
joyous           zoned          off                        default          
joyous           snapdir        hidden                     default          
joyous           aclmode        groupmask                  default          
joyous           aclinherit     secure                     default 

So what exactly is nfsd doing???

[private:/] root# time dtrace -n ''syscall:::entry /execname ==
"nfsd"/ { @[probefunc] = count(); }''
dtrace: description ''syscall:::entry '' matched 225 probes
^C

  open                                                              3
  close                                                             5
  lwp_mutex_timedlock                                               7
  lwp_mutex_wakeup                                                  7
  pollsys                                                           7
  putmsg                                                            8
  getmsg                                                           11
  fcntl                                                            12
  lwp_park                                                         30
  mmap                                                             41
  munmap                                                           42
  fxstat                                                           47
  ioctl                                                            58
  lwp_continue                                                    100
  lwp_create                                                      100
  setcontext                                                      100
  lwp_kill                                                        101
  lwp_exit                                                        110
  schedctl                                                        110
  lwp_sigmask                                                     138
  nfs                                                             203

real    3m22.464s
user    0m0.206s
sys     0m0.222s

And the answer is, not much of anything. Where is this CPU disappearing too!!!
Looking at the stacks:

[private:/] root# time dtrace -n ''syscall::nfs:entry /execname ==
"nfsd"/ { @[ustack()] = count(); }''
dtrace: description ''syscall::nfs:entry '' matched 1 probe
^C


              libc.so.1`_nfssys+0x7
              libc.so.1`_thr_setup+0x51
              libc.so.1`_lwp_start
               22

Not very helpful... lets try this:

[private:/] root# truss -c -p 22643
^C
syscall               seconds   calls  errors
close                    .000       1
ioctl                    .000       2
lwp_park                 .000       1
lwp_unpark               .000       1
getmsg                   .000       1
putmsg                   .000       1
setustack                .000      30
nfssys                106.146      56
mmap                     .000      20
munmap                   .000       3
fxstat                   .000       4
lwp_create               .001      30
lwp_exit                 .000      12
lwp_continue             .000      30
lwp_kill                 .000      13      13
lwp_sigmask              .000      14
pollsys                  .000       1
schedctl                 .000      12
                     --------  ------   ----
sys totals:           106.149     232     13
usr time:                .001
elapsed:               47.230

Better. Clearly nfssys is a problem, but why?

[private:/] root# dtrace -n ''io:genunix::start { @[pid, execname] =
count(); }''
dtrace: description ''io:genunix::start '' matched 3 probes
^C

        3  fsflush                                                           1
      344  syslogd                                                           2
    22643  nfsd                                                             59
        0  sched                                                         30346

The fucking scheduler is doing more IO than nfsd!

Lets try iotop (http://www.brendangregg.com/DTrace/iotop). Below are 10 second
intervals, disktime is 1000000=1s:

2006 Dec  7 12:06:47,  load: 8.08,  disk_r:   1408 KB,  disk_w:  14482 KB

  UID    PID   PPID CMD              DEVICE  MAJ MIN D         DISKTIME
    0      3      0 fsflush          sd6      54 389 W               242
    0      3      0 fsflush          sd6      54 384 W               251
    1  22643      1 nfsd             sd17     54 1088 R             2733
    1  22643      1 nfsd             sd14     54 896 R              4420
    1  22643      1 nfsd             sd13     54 832 R              5296
    1  22643      1 nfsd             sd18     54 1152 R             7103
    1  22643      1 nfsd             sd47     54 3008 R             7240
    1  22643      1 nfsd             sd31     54 1984 R            14827
    1  22643      1 nfsd             sd32     54 2048 R            16653
    1  22643      1 nfsd             sd45     54 2880 R            17768
    1  22643      1 nfsd             sd30     54 1920 R            17962
    1  22643      1 nfsd             sd22     54 1408 R            18687
    1  22643      1 nfsd             sd8      54 512 R             19693
    1  22643      1 nfsd             sd33     54 2112 R            20659
    1  22643      1 nfsd             sd19     54 1216 R            22933
    1  22643      1 nfsd             sd43     54 2752 R            23602
    1  22643      1 nfsd             sd7      54 448 R             27458
    1  22643      1 nfsd             sd10     54 640 R             28715
    1  22643      1 nfsd             sd49     54 3136 R            29640
    1  22643      1 nfsd             sd5      54 320 R             46201
    1  22643      1 nfsd             sd11     54 704 R             46775
    1  22643      1 nfsd             sd24     54 1536 R            60011  <--
0.06s
    0      0      0 sched            sd16     54 1024 W           140682
    0      0      0 sched            sd47     54 3008 W           142082
    0      0      0 sched            sd18     54 1152 W           151720
    0      0      0 sched            sd13     54 832 W           157107
    0      0      0 sched            sd14     54 896 W           159044
    0      0      0 sched            sd17     54 1088 W           164386
    0      0      0 sched            sd35     54 2240 W           167119
    0      0      0 sched            sd34     54 2176 W           168552
    0      0      0 sched            sd37     54 2368 W           171872
    0      0      0 sched            sd40     54 2560 W           193907
    0      0      0 sched            sd25     54 1600 W           195662
    0      0      0 sched            sd33     54 2112 W           196480

Disk response times seem really good actually.

iosnoop (http://www.brendangregg.com/DTrace/iosnoop) shows that sched is doing a
lot of writing, mostly 1k files. Testing on aeon suggests this is normal.

truss -D also points to this nfssys slowness:

[private:/tmp] root# truss -D -p `pgrep nfsd`
...
/2:      0.0001 lwp_create(0xFED71D20, LWP_DETACHED|LWP_SUSPENDED, 0xFED71F48) =
3266
/2:      0.0001 lwp_continue(3266)                              = 0
/3266:  260128.5978     lwp_create()    (returning as new lwp ...)      = 0
/3266:   0.0034 setustack(0xFED4BC60)
/3266:  nfssys(14, 0xFEC40FC8)          (sleeping...)
/3266:  43.1653 nfssys(14, 0xFEC40FC8)                          = 0
/3266:   0.0074 schedctl()                                      = 0xFEF9F020
/3266:   0.0002 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF
[0x0000FFFF]
/3266:   0.0002 lwp_kill(3268, SIG#0)                           Err#3 ESRCH
/3266:   0.0002 munmap(0xFEC20000, 4096)                        = 0
/3266:   0.0002 lwp_exit()

Locks...

[private:/tmp] root# lockstat -P sleep 10

Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec)

Count indv cuml rcnt     spin Lock                   Caller                  
-------------------------------------------------------------------------------
   32 100% 100% 0.00 59963092 0xffffffffb282a1c8     dmu_object_alloc+0x75 \
 
 
This message posted from opensolaris.org

Jim Mauro

2006-Dec-07 14:07 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Hey Ben - I need more time to look at this and connect some dots,
but real quick....

Some nfsstat data that we could use to potentially correlate to the local
server activity would be interesting. zfs_create() seems to be the
heavy hitter, but a periodic kernel profile (especially if we can catch
a 97% SYS period) would help:

#lockstat -i997 -Ik -s 10 sleep 60

Alternatively:

#dtrace -n ''profile-997hz / arg0 != 0 / { @s[stack()]=count();
}''

It would also be interesting to see what the zfs_create()''s are doing.
Perhaps a quick:

#dtrace -n ''zfs_create:entry { printf("ZFS Create: %s\n", 
stringof(args[0]->v_path)); }''

It would also be interesting to see the network stats. Grab Brendan''s 
nicstat
and collect some samples....

You''re reference to low traffic is in bandwidth, which, as you
indicate,
is really,
really low. But the data, at least up to this point, suggests the 
workload is not
data/bandwidth intensive, but more attribute intensive. Note again 
zfs_create()
is the heavy ZFS function, along with zfs_getattr. Perhaps it''s the 
attribute-intensive
nature of the load that is at the root of this.

I can spend more time on this tomorrow (traveling today).

Thanks,
/jim


Ben Rockwood wrote:> I''ve got a Thumper doing nothing but serving NFS.  Its using B43
with zil_disabled.  The system is being consumed in waves, but by what I
don''t know.  Notice vmstat:
>
>  3 0 0 25693580 2586268 0 0  0  0  0  0  0  0  0  0  0  926   91  703  0 25
75
>  21 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 13 14 1720   21 1105  0 92
8
>  20 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 17 18 2538   70  834  0
100 0
>  25 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0  0  0  745   18  179  0
100 0
>  37 0 0 25693552 2586240 0 0 0  0  0  0  0  0  0  7  7 1152   52  313  0
100 0
>  16 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0 15 13 1543   52  767  0
100 0
>  17 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0  2  2  890   72  192  0
100 0
>  27 0 0 25693572 2586260 0 0 0  0  0  0  0  0  0 15 15 3271   19 3103  0 98
2
>  0 0 0 25693456 2586144 0 11 0  0  0  0  0  0  0 281 249 34335 242 37289 0
46 54
>  0 0 0 25693448 2586136 0 2  0  0  0  0  0  0  0  0  0 2470  103 2900  0 27
73
>  0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1062  105  822  0 26
74
>  0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1076   91  857  0 25
75
>  0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0  917  126  674  0 25
75
>
> These spikes of sys load come in waves like this.  While there are close to
a hundred systems mounting NFS shares on the Thumper, the amount of traffic is
really low.  Nothing to justify this.  We''re talking less than 10MB/s.
>
> NFS is pathetically slow.  We''re using NFSv3 TCP shared via ZFS
sharenfs on a 3Gbps aggregation (3*1Gbps).
>
> I''ve been slamming my head against this problem for days and
can''t make headway.  I''ll post some of my notes below.  Any
thoughts or ideas are welcome!
>
> benr.
>
> ==>
> Step 1 was to disable any ZFS features that might consume large amounts of
CPU:
>
> # zfs set compression=off joyous
> # zfs set atime=off joyous
> # zfs set checksum=off joyous
>
> These changes had no effect.
>
> Next was to consider that perhaps NFS was doing name lookups when it
shouldn''t. Indeed "dns" was specified in /etc/nsswitch.conf
which won''t work given that no DNS servers are accessable from the
storage or private networks, but again, no improvement. In this process I
removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the
dns/client service in SMF.
>
> Turning back to CPU usage, we can see the activity is all SYStem time and
comes in waves:
>
> [private:/tmp] root# sar 1 100
>
> SunOS private.thumper1 5.11 snv_43 i86pc    12/07/2006
>
> 10:38:05    %usr    %sys    %wio   %idle
> 10:38:06       0      27       0      73
> 10:38:07       0      27       0      73
> 10:38:09       0      27       0      73
> 10:38:10       1      26       0      73
> 10:38:11       0      26       0      74
> 10:38:12       0      26       0      74
> 10:38:13       0      24       0      76
> 10:38:14       0       6       0      94
> 10:38:15       0       7       0      93
> 10:38:22       0      99       0       1  <--
> 10:38:23       0      94       0       6  <--
> 10:38:24       0      28       0      72
> 10:38:25       0      27       0      73
> 10:38:26       0      27       0      73
> 10:38:27       0      27       0      73
> 10:38:28       0      27       0      73
> 10:38:29       1      30       0      69
> 10:38:30       0      27       0      73
>
> And so we consider whether or not there is a pattern to the frequency. The
following is sar output from any lines in which sys is above 90%:
>
> 10:40:04    %usr    %sys    %wio   %idle    Delta
> 10:40:11       0      97       0       3
> 10:40:45       0      98       0       2   34 seconds
> 10:41:02       0      94       0       6   17 seconds
> 10:41:26       0     100       0       0   24 seconds
> 10:42:00       0     100       0       0   34 seconds
> 10:42:25       (end of sample)             >25 seconds
>
> Looking at the congestion in the run queue:
>
> [private:/tmp] root# sar -q 5 100
>
> 10:45:43 runq-sz %runocc swpq-sz %swpocc
> 10:45:51    27.0      85     0.0       0
> 10:45:57     1.0      20     0.0       0
> 10:46:02     2.0      60     0.0       0
> 10:46:13    19.8      99     0.0       0
> 10:46:23    17.7      99     0.0       0
> 10:46:34    24.4      99     0.0       0
> 10:46:41    22.1      97     0.0       0
> 10:46:48    13.0      96     0.0       0
> 10:46:55    25.3     102     0.0       0
>
> Looking at the per-CPU breakdown:
>
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   324  224    0    0    0  154    0     0    0 100   0  
0
>   1    0   0    0   114    0  226    0   10   13    0    86    0   1   0 
99
>   2    0   0    0   162  138  149    0    5    4    0     0    0   1   0 
99
>   3    0   0    0    55    6   46    0    4    3    0     0    0   1   0 
99
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   310  210   34    0   17  171    7     5    0 100   0  
0
>   1    0   0    0   152    1  200    0   17   26    5    59    1  65   0 
34
>   2    0   0    0   271  197  175    1   13   20    2     0    0  66   0 
34
>   3    0   0    0   120    6   76    0    6   14    2     0    0  66   0 
34
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    1   318  204   46    1   19   22    5    18    0 100   0  
0
>   1    0   0    0   112    0   40    0   12   15    3     4    0 100   0  
0
>   2    0   0    0   195   93   31    1   11   27    3     0    0 100   0  
0
>   3    0   0    0   108    1   20    0    9   31    3     2    0 100   0  
0
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   569  204  809    3  133  432    1    21    0  99   0  
1
>   1    0   0    0   502    0 1159    1  201  453    1     8    0  97   0  
3
>   2    0   0    0  2305 2197  147    6   51  279    0     0    0 100   0  
0
>   3    0   0    0   242    2  433   10   75  301    1     8    0 100   0  
0
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   371  227  110    0   31 1470    0    80    0  99   0  
1
>   1    0   0    3  1034    0 2079    0  102  241    0     9    0  10   0 
90
>   2    0   0    0  1892 1718 1558    3  129  214    0    36    0  11   0 
89
>   3    0   0    0   303    6  549    0   50  329    0     5    0   8   0 
92
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    1   312  205   24    0   10   39    4     4    0 100   0  
0
>   1    0   0    1   134    0   99    0   10   20    3     6    0  90   0 
10
>   2    0   0    0   187   90   40    0    8   15    4    10    0  90   0 
10
>   3    0   0    1   100    1   19    0    8    9    4     6    0  90   0 
10
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    2   528  206  916    8  161  312    1    39    0  99   0  
1
>   1    0   0    1   206    1  350    6  120  147    1    68    0 100   0  
0
>   2   83   0    0  1929 1816  192    9   87  176    0    50    0 100   0  
0
>   3    0   0    0   142    4   66   21   12  460    1    17    1  99   0  
0
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   360  228  163   11   25  743    0    13    0 100   0  
0
>   1    0   0    0  1587    0 4126   13  520 1435    0   122    0  24   0 
76
>   2    3   0    0  7785 7669 1499   29  225  933    0    48    0  27   0 
73
>   3    1   0    0  1856    6 4683    0  467 1912    0    37    0  23   0 
77
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   323  223    0    0    0  244    0     0    0 100   0  
0
>   1    0   0    0   141    0  277    0   14   11    0     1    0   0   0
100
>   2    0   0    0   307  259  312    0   16   15    0     3    0   1   0 
99
>   3    0   0    0   113    6  165    0    5    5    0    91    0   0   0
100
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   324  224    0    0    0  203    0     0    0 100   0  
0
>   1    0   0    0   143    0  283    0   11    2    0    13    0   1   0 
99
>   2    0   0    0   312  278  298    0    8   11    0     3    0   1   0 
99
>   3    0   0    0   119    6  175    0    3    5    0    69    0   1   0 
99
>
> Looking at prstat we can see the number of nfsd threads are low, much lower
than the 1024 limit:
>
> [private:/tmp] root# prstat
>    PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
>  22643 daemon   2228K 1776K sleep   60  -20   2:55:52  55% nfsd/18
>
> Interestingly, using prstat -mL to monitor thread latency, we see that a
handful of threads are the culprates for consuming mass CPU:
>
> [private:/tmp] root# prstat -mL
>    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
PROCESS/LWPID
>  22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 416   1   0   0 nfsd/1506
>  22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 415   0   0   0 nfsd/1563
>  22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 417   0   0   0 nfsd/1554
>  22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 419   0   0   0 nfsd/1551
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0  26  74 418   0   0   0 nfsd/1553
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1536
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1555
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1539
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1562
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1545
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1559
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419   1   0   0 nfsd/1541
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1546
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1543
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1560
> Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86
>
> Above SLP represents "The percentage of time the process has spent
sleeping", LAT represents "percentage of time the process has spent
waiting for CPU", VCX is "number of voluntary context switches"
and ICX is for involuntary context switches.
>
> Using pstack to try and determine what these threads are doing we see:
>
> [private:/tmp] root# pstack 22643/1693
> 22643:  /usr/lib/nfs/nfsd
> -----------------  lwp# 1693 / thread# 1693  --------------------
>  fee589b7 nfssys   (e, feba0fc8)
>  fee5839f _thr_setup (fed4d400) + 51
>  fee585f0 _lwp_start (fed4d400, 0, 0, 0, 0, 0)
> [private:/tmp] root# pstack 22643/1708
> 22643:  /usr/lib/nfs/nfsd
> -----------------  lwp# 1708  --------------------------------
>  fee589b7 nfssys   (e, febb0fc8)
>  fee5839f _thr_setup (fed44c00) + 51
>  fee585f0 _lwp_start (fed44c00, 0, 0, 0, 0, 0)
> [private:/tmp] root# pstack 22643/1664
> 22643:  /usr/lib/nfs/nfsd
> -----------------  lwp# 1664 / thread# 1664  --------------------
>  fee589b7 nfssys   (e, fec90fc8)
>  fee5839f _thr_setup (fed43800) + 51
>  fee585f0 _lwp_start (fed43800, 0, 0, 0, 0, 0)
>
> Running zvop_times.d
(http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an idea of what
ZFS is doing:
>
> [private:/] root# dtrace -s  zvop_times.d
> dtrace: script ''zvop_times.d'' matched 66 probes
> ^C
> CPU     ID                    FUNCTION:NAME
>   1      2                             :END 
> ZFS COUNT
>
>   zfs_getsecattr                                                    4
>   zfs_space                                                        70
>   zfs_rename                                                      111
>   zfs_readdir                                                     284
>   zfs_read                                                        367
>   zfs_mkdir                                                       670
>   zfs_setattr                                                    1054
>   zfs_frlock                                                     1562
>   zfs_putpage                                                    3916
>   zfs_write                                                      4110
>   zfs_create                                                     4178
>   zfs_remove                                                     7794
>   zfs_fid                                                       14960
>   zfs_inactive                                                  17637
>   zfs_access                                                    20809
>   zfs_fsync                                                     29668
>   zfs_lookup                                                    31273
>   zfs_getattr                                                  175457
>
>
>
> ZFS AVG TIME
>
>   zfs_fsync                                                      2337
>   zfs_getattr                                                    2735
>   zfs_access                                                     2774
>   zfs_fid                                                        2948
>   zfs_inactive                                                   4081
>   zfs_lookup                                                     6067
>   zfs_frlock                                                     6750
>   zfs_putpage                                                    6840
>   zfs_mkdir                                                      9259
>   zfs_getsecattr                                                10940
>   zfs_setattr                                                   15364
>   zfs_space                                                     25214
>   zfs_readdir                                                   25726
>   zfs_remove                                                    26117
>   zfs_write                                                     31942
>   zfs_rename                                                    60504
>   zfs_read                                                     144449
>   zfs_create                                                 71215587
>
>
>
> ZFS SUM TIME
>
>   zfs_getsecattr                                                43762
>   zfs_space                                                   1765026
>   zfs_mkdir                                                   6204101
>   zfs_rename                                                  6715987
>   zfs_readdir                                                 7306233
>   zfs_frlock                                                 10544902
>   zfs_setattr                                                16194261
>   zfs_putpage                                                26787465
>   zfs_fid                                                    44111892
>   zfs_read                                                   53013073
>   zfs_access                                                 57740468
>   zfs_fsync                                                  69354886
>   zfs_inactive                                               71990292
>   zfs_write                                                 131282275
>   zfs_lookup                                                189748320
>   zfs_remove                                                203558835
>   zfs_getattr                                               479998059
>   zfs_create                                             297538724997
>
>
> [private:/] root# zfs get all joyous
> NAME             PROPERTY       VALUE                      SOURCE
> joyous           type           filesystem                 -
> joyous           creation       Fri Jul 28 17:08 2006      -
> joyous           used           255G                       -
> joyous           available      15.7T                      -
> joyous           referenced     81.7K                      -
> joyous           compressratio  1.21x                      -
> joyous           mounted        yes                        -
> joyous           quota          none                       default
> joyous           reservation    none                       default
> joyous           recordsize     128K                       default
> joyous           mountpoint     /joyous                    default
> joyous           sharenfs       off                        default
> joyous           checksum       off                        local
> joyous           compression    off                        local
> joyous           atime          off                        local
> joyous           devices        on                         default
> joyous           exec           on                         default
> joyous           setuid         on                         default
> joyous           readonly       off                        default
> joyous           zoned          off                        default
> joyous           snapdir        hidden                     default
> joyous           aclmode        groupmask                  default
> joyous           aclinherit     secure                     default 
>
> So what exactly is nfsd doing???
>
> [private:/] root# time dtrace -n ''syscall:::entry /execname ==
"nfsd"/ { @[probefunc] = count(); }''
> dtrace: description ''syscall:::entry '' matched 225 probes
> ^C
>
>   open                                                              3
>   close                                                             5
>   lwp_mutex_timedlock                                               7
>   lwp_mutex_wakeup                                                  7
>   pollsys                                                           7
>   putmsg                                                            8
>   getmsg                                                           11
>   fcntl                                                            12
>   lwp_park                                                         30
>   mmap                                                             41
>   munmap                                                           42
>   fxstat                                                           47
>   ioctl                                                            58
>   lwp_continue                                                    100
>   lwp_create                                                      100
>   setcontext                                                      100
>   lwp_kill                                                        101
>   lwp_exit                                                        110
>   schedctl                                                        110
>   lwp_sigmask                                                     138
>   nfs                                                             203
>
> real    3m22.464s
> user    0m0.206s
> sys     0m0.222s
>
> And the answer is, not much of anything. Where is this CPU disappearing
too!!! Looking at the stacks:
>
> [private:/] root# time dtrace -n ''syscall::nfs:entry /execname ==
"nfsd"/ { @[ustack()] = count(); }''
> dtrace: description ''syscall::nfs:entry '' matched 1 probe
> ^C
>
>
>               libc.so.1`_nfssys+0x7
>               libc.so.1`_thr_setup+0x51
>               libc.so.1`_lwp_start
>                22
>
> Not very helpful... lets try this:
>
> [private:/] root# truss -c -p 22643
> ^C
> syscall               seconds   calls  errors
> close                    .000       1
> ioctl                    .000       2
> lwp_park                 .000       1
> lwp_unpark               .000       1
> getmsg                   .000       1
> putmsg                   .000       1
> setustack                .000      30
> nfssys                106.146      56
> mmap                     .000      20
> munmap                   .000       3
> fxstat                   .000       4
> lwp_create               .001      30
> lwp_exit                 .000      12
> lwp_continue             .000      30
> lwp_kill                 .000      13      13
> lwp_sigmask              .000      14
> pollsys                  .000       1
> schedctl                 .000      12
>                      --------  ------   ----
> sys totals:           106.149     232     13
> usr time:                .001
> elapsed:               47.230
>
> Better. Clearly nfssys is a problem, but why?
>
> [private:/] root# dtrace -n ''io:genunix::start { @[pid, execname]
= count(); }''
> dtrace: description ''io:genunix::start '' matched 3 probes
> ^C
>
>         3  fsflush                                                         
1
>       344  syslogd                                                         
2
>     22643  nfsd                                                            
59
>         0  sched                                                        
30346
>
> The fucking scheduler is doing more IO than nfsd!
>
> Lets try iotop (http://www.brendangregg.com/DTrace/iotop). Below are 10
second intervals, disktime is 1000000=1s:
>
> 2006 Dec  7 12:06:47,  load: 8.08,  disk_r:   1408 KB,  disk_w:  14482 KB
>
>   UID    PID   PPID CMD              DEVICE  MAJ MIN D         DISKTIME
>     0      3      0 fsflush          sd6      54 389 W               242
>     0      3      0 fsflush          sd6      54 384 W               251
>     1  22643      1 nfsd             sd17     54 1088 R             2733
>     1  22643      1 nfsd             sd14     54 896 R              4420
>     1  22643      1 nfsd             sd13     54 832 R              5296
>     1  22643      1 nfsd             sd18     54 1152 R             7103
>     1  22643      1 nfsd             sd47     54 3008 R             7240
>     1  22643      1 nfsd             sd31     54 1984 R            14827
>     1  22643      1 nfsd             sd32     54 2048 R            16653
>     1  22643      1 nfsd             sd45     54 2880 R            17768
>     1  22643      1 nfsd             sd30     54 1920 R            17962
>     1  22643      1 nfsd             sd22     54 1408 R            18687
>     1  22643      1 nfsd             sd8      54 512 R             19693
>     1  22643      1 nfsd             sd33     54 2112 R            20659
>     1  22643      1 nfsd             sd19     54 1216 R            22933
>     1  22643      1 nfsd             sd43     54 2752 R            23602
>     1  22643      1 nfsd             sd7      54 448 R             27458
>     1  22643      1 nfsd             sd10     54 640 R             28715
>     1  22643      1 nfsd             sd49     54 3136 R            29640
>     1  22643      1 nfsd             sd5      54 320 R             46201
>     1  22643      1 nfsd             sd11     54 704 R             46775
>     1  22643      1 nfsd             sd24     54 1536 R            60011 
<-- 0.06s
>     0      0      0 sched            sd16     54 1024 W           140682
>     0      0      0 sched            sd47     54 3008 W           142082
>     0      0      0 sched            sd18     54 1152 W           151720
>     0      0      0 sched            sd13     54 832 W           157107
>     0      0      0 sched            sd14     54 896 W           159044
>     0      0      0 sched            sd17     54 1088 W           164386
>     0      0      0 sched            sd35     54 2240 W           167119
>     0      0      0 sched            sd34     54 2176 W           168552
>     0      0      0 sched            sd37     54 2368 W           171872
>     0      0      0 sched            sd40     54 2560 W           193907
>     0      0      0 sched            sd25     54 1600 W           195662
>     0      0      0 sched            sd33     54 2112 W           196480
>
> Disk response times seem really good actually.
>
> iosnoop (http://www.brendangregg.com/DTrace/iosnoop) shows that sched is
doing a lot of writing, mostly 1k files. Testing on aeon suggests this is
normal.
>
> truss -D also points to this nfssys slowness:
>
> [private:/tmp] root# truss -D -p `pgrep nfsd`
> ...
> /2:      0.0001 lwp_create(0xFED71D20, LWP_DETACHED|LWP_SUSPENDED,
0xFED71F48) = 3266
> /2:      0.0001 lwp_continue(3266)                              = 0
> /3266:  260128.5978     lwp_create()    (returning as new lwp ...)      = 0
> /3266:   0.0034 setustack(0xFED4BC60)
> /3266:  nfssys(14, 0xFEC40FC8)          (sleeping...)
> /3266:  43.1653 nfssys(14, 0xFEC40FC8)                          = 0
> /3266:   0.0074 schedctl()                                      =
0xFEF9F020
> /3266:   0.0002 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) =
0xFFBFFEFF [0x0000FFFF]
> /3266:   0.0002 lwp_kill(3268, SIG#0)                           Err#3 ESRCH
> /3266:   0.0002 munmap(0xFEC20000, 4096)                        = 0
> /3266:   0.0002 lwp_exit()
>
> Locks...
>
> [private:/tmp] root# lockstat -P sleep 10
>
> Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec)
>
> Count indv cuml rcnt     spin Lock                   Caller
>
-------------------------------------------------------------------------------
>    32 100% 100% 0.00 59963092 0xffffffffb282a1c8     dmu_object_alloc+0x75
\
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Alan Romeril

2006-Dec-07 17:08 UTC

head link

[zfs-discuss] Re: A Plea for Help: Thumper/ZFS/NFS/B43

Hi Ben,
Your sar output shows one core pegged pretty much constantly!  From the Solaris
Performance and Tools book that SLP state value has "The remainder of
important events such as disk and network waits..... along with other kernel
wait events......  kernel locks or condition variables also accumilate time in
this state"

ZFS COUNT
zfs_create 4178
ZFS AVG TIME
zfs_create 71215587
ZFS SUM TIME
zfs_create 297538724997

I think it looks like the system must be spinning in zfs_create(), looking in
usr/src/uts/common/fs/zfs/zfs_vnops.c there are a couple of places it could
loop:-

   1129 		/*
   1130 		 * Create a new file object and update the directory
   1131 		 * to reference it.
   1132 		 */
....
   1154 		error = dmu_tx_assign(tx, zfsvfs->z_assign);
   1155 		if (error) {
   1156 			zfs_dirent_unlock(dl);
   1157 			if (error == ERESTART &&
   1158 			    zfsvfs->z_assign == TXG_NOWAIT) {
   1159 				dmu_tx_wait(tx);
   1160 				dmu_tx_abort(tx);
   1161 				goto top;
   1162 			}

and

   1201 		/*
   1202 		 * Truncate regular files if requested.
   1203 		 */
   1204 		if ((ZTOV(zp)->v_type == VREG) &&
   1205 		    (zp->z_phys->zp_size != 0) &&
   1206 		    (vap->va_mask & AT_SIZE) && (vap->va_size == 0))
{
   1207 			error = zfs_freesp(zp, 0, 0, mode, TRUE);
   1208 			if (error == ERESTART &&
   1209 			    zfsvfs->z_assign == TXG_NOWAIT) {
   1210 				/* NB: we already did dmu_tx_wait() */
   1211 				zfs_dirent_unlock(dl);
   1212 				VN_RELE(ZTOV(zp));
   1213 				goto top;

I think the snoop would be very useful to pour over.

Cheers,
Alan
 
 
This message posted from opensolaris.org

Eric Kustarz

2006-Dec-07 18:28 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Ben Rockwood wrote:
 > I''ve got a Thumper doing nothing but serving NFS.  Its using B43
with
 > zil_disabled.  The system is being consumed in waves, but by what I
 > don''t know.  Notice vmstat:

We made several performance fixes in the NFS/ZFS area in recent builds, 
so if possible it would be great to upgrade you from snv_43.  That said, 
there might be something else going on that we haven''t accounted for.
...
> 
> Step 1 was to disable any ZFS features that might consume large amounts of
CPU:
> 
> # zfs set compression=off joyous
> # zfs set atime=off joyous
> # zfs set checksum=off joyous
In our performance testing, we haven''t found checksums to be anywhere 
near a large consumer of CPU, so i would recommend leaving that on (due 
to its benefits).

I suspect your apps/clients don''t depend on atime, so i think its a
good
idea to turn that off.  We''ve gotten better NFS performance with this
off.

More of a heads up as it sounds like compression on/off isn''t your 
problem.  If you are not getting good I/O BW with compression turned on, 
its most likely due to:
6460622 zio_nowait() doesn''t live up to its name

As Jim, mentioned, using lockstat to figure out where your CPU is being 
spent is the first step.  I''ve been using ''lockstat -kgIW -D
60 sleep
60''.  That collects data for the top 60 callers for a 1 minute period. 
If you see ''mutex_enter'' high up in the results, then we have
at least
mutex lock contention.


...

> Interestingly, using prstat -mL to monitor thread latency, we see that a
handful of threads are the culprates for consuming mass CPU:
> 
> [private:/tmp] root# prstat -mL
>    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
PROCESS/LWPID
>  22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 416   1   0   0 nfsd/1506
>  22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 415   0   0   0 nfsd/1563
>  22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 417   0   0   0 nfsd/1554
>  22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 419   0   0   0 nfsd/1551
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0  26  74 418   0   0   0 nfsd/1553
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1536
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1555
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1539
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1562
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1545
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1559
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419   1   0   0 nfsd/1541
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1546
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1543
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1560
> Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86
The high SYS times being charged to the userland nfsd threads is 
representative of what the kernel threads are doing (which is most 
likely going to be NFS and ZFS).

> Running zvop_times.d
(http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an idea of what
ZFS is doing:
> 
> [private:/] root# dtrace -s  zvop_times.d
> dtrace: script ''zvop_times.d'' matched 66 probes
> ^C
> CPU     ID                    FUNCTION:NAME
>   1      2                             :END 
> ZFS COUNT
> 
>   zfs_getsecattr                                                    4
>   zfs_space                                                        70
>   zfs_rename                                                      111
>   zfs_readdir                                                     284
>   zfs_read                                                        367
>   zfs_mkdir                                                       670
>   zfs_setattr                                                    1054
>   zfs_frlock                                                     1562
>   zfs_putpage                                                    3916
>   zfs_write                                                      4110
>   zfs_create                                                     4178
>   zfs_remove                                                     7794
>   zfs_fid                                                       14960
>   zfs_inactive                                                  17637
>   zfs_access                                                    20809
>   zfs_fsync                                                     29668
>   zfs_lookup                                                    31273
>   zfs_getattr                                                  175457
> 
> 
> 
> ZFS AVG TIME
> 
>   zfs_fsync                                                      2337
>   zfs_getattr                                                    2735
>   zfs_access                                                     2774
>   zfs_fid                                                        2948
>   zfs_inactive                                                   4081
>   zfs_lookup                                                     6067
>   zfs_frlock                                                     6750
>   zfs_putpage                                                    6840
>   zfs_mkdir                                                      9259
>   zfs_getsecattr                                                10940
>   zfs_setattr                                                   15364
>   zfs_space                                                     25214
>   zfs_readdir                                                   25726
>   zfs_remove                                                    26117
>   zfs_write                                                     31942
>   zfs_rename                                                    60504
>   zfs_read                                                     144449
>   zfs_create                                                 71215587
An average of 71 msecs per create seems high.  Doing a quantize instead 
of just an average on zfs_create() would be interesting (should have 
included that in my script)...

...

> Locks...
> 
> [private:/tmp] root# lockstat -P sleep 10
> 
> Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec)
> 
> Count indv cuml rcnt     spin Lock                   Caller
>
-------------------------------------------------------------------------------
>    32 100% 100% 0.00 59963092 0xffffffffb282a1c8     dmu_object_alloc+0x75
\
That''s interesting.  This is build snv_43, right?  Doesn''t
exactly line
up with current bits, but i would guess this is " 
mutex_enter(&osi->os_obj_lock);".

Can you show us the output for a longer sampling period, such as 
''lockstat -P -D 30 sleep 60''?  The "Adaptive mutex
spin" category is
usually the most/only interesting one in my experience.

I would be curious to know your memory usage.  What is the physical 
memory on the machine?

And can you run:
# echo ''arc::print -d'' | mdb -k
# echo ''::kmastat'' | mdb -k | sort -n -k 5 | tail -40

Matt has a nice little dscript that figures out how long txgs are taking 
(i''ll attach that).  I believe it will work with old bits (i defined 
"OLD" the for that to work).  Can you try running that?

Can you run ''zpool list'' and ''zpool status''?
I''d like to know how full
your pool is.

Let''s start with that info,
eric
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: zfs_sync.d
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061207/9b11f986/attachment.ksh>

Neil Perrin

2006-Dec-07 18:45 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Ben,

The attached dscript might help determining the zfs_create issue.
It prints:
	- a count of all functions called from zfs_create
	- average wall count time of the 30 highest functions
	- average cpu time of the 30 highest functions

Note, please ignore warnings of the following type:

dtrace: 1346 dynamic variable drops with non-empty dirty list

Neil.

Ben Rockwood wrote On 12/07/06 06:01,:> I''ve got a Thumper doing nothing but serving NFS.  Its using B43
with zil_disabled.  The system is being consumed in waves, but by what I
don''t know.  Notice vmstat:
> 
>  3 0 0 25693580 2586268 0 0  0  0  0  0  0  0  0  0  0  926   91  703  0 25
75
>  21 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 13 14 1720   21 1105  0 92
8
>  20 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 17 18 2538   70  834  0
100 0
>  25 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0  0  0  745   18  179  0
100 0
>  37 0 0 25693552 2586240 0 0 0  0  0  0  0  0  0  7  7 1152   52  313  0
100 0
>  16 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0 15 13 1543   52  767  0
100 0
>  17 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0  2  2  890   72  192  0
100 0
>  27 0 0 25693572 2586260 0 0 0  0  0  0  0  0  0 15 15 3271   19 3103  0 98
2
>  0 0 0 25693456 2586144 0 11 0  0  0  0  0  0  0 281 249 34335 242 37289 0
46 54
>  0 0 0 25693448 2586136 0 2  0  0  0  0  0  0  0  0  0 2470  103 2900  0 27
73
>  0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1062  105  822  0 26
74
>  0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1076   91  857  0 25
75
>  0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0  917  126  674  0 25
75
> 
> These spikes of sys load come in waves like this.  While there are close to
a hundred systems mounting NFS shares on the Thumper, the amount of traffic is
really low.  Nothing to justify this.  We''re talking less than 10MB/s.
> 
> NFS is pathetically slow.  We''re using NFSv3 TCP shared via ZFS
sharenfs on a 3Gbps aggregation (3*1Gbps).
> 
> I''ve been slamming my head against this problem for days and
can''t make headway.  I''ll post some of my notes below.  Any
thoughts or ideas are welcome!
> 
> benr.
> 
> ==> 
> Step 1 was to disable any ZFS features that might consume large amounts of
CPU:
> 
> # zfs set compression=off joyous
> # zfs set atime=off joyous
> # zfs set checksum=off joyous
> 
> These changes had no effect.
> 
> Next was to consider that perhaps NFS was doing name lookups when it
shouldn''t. Indeed "dns" was specified in /etc/nsswitch.conf
which won''t work given that no DNS servers are accessable from the
storage or private networks, but again, no improvement. In this process I
removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the
dns/client service in SMF.
> 
> Turning back to CPU usage, we can see the activity is all SYStem time and
comes in waves:
> 
> [private:/tmp] root# sar 1 100
> 
> SunOS private.thumper1 5.11 snv_43 i86pc    12/07/2006
> 
> 10:38:05    %usr    %sys    %wio   %idle
> 10:38:06       0      27       0      73
> 10:38:07       0      27       0      73
> 10:38:09       0      27       0      73
> 10:38:10       1      26       0      73
> 10:38:11       0      26       0      74
> 10:38:12       0      26       0      74
> 10:38:13       0      24       0      76
> 10:38:14       0       6       0      94
> 10:38:15       0       7       0      93
> 10:38:22       0      99       0       1  <--
> 10:38:23       0      94       0       6  <--
> 10:38:24       0      28       0      72
> 10:38:25       0      27       0      73
> 10:38:26       0      27       0      73
> 10:38:27       0      27       0      73
> 10:38:28       0      27       0      73
> 10:38:29       1      30       0      69
> 10:38:30       0      27       0      73
> 
> And so we consider whether or not there is a pattern to the frequency. The
following is sar output from any lines in which sys is above 90%:
> 
> 10:40:04    %usr    %sys    %wio   %idle    Delta
> 10:40:11       0      97       0       3
> 10:40:45       0      98       0       2   34 seconds
> 10:41:02       0      94       0       6   17 seconds
> 10:41:26       0     100       0       0   24 seconds
> 10:42:00       0     100       0       0   34 seconds
> 10:42:25       (end of sample)             >25 seconds
> 
> Looking at the congestion in the run queue:
> 
> [private:/tmp] root# sar -q 5 100
> 
> 10:45:43 runq-sz %runocc swpq-sz %swpocc
> 10:45:51    27.0      85     0.0       0
> 10:45:57     1.0      20     0.0       0
> 10:46:02     2.0      60     0.0       0
> 10:46:13    19.8      99     0.0       0
> 10:46:23    17.7      99     0.0       0
> 10:46:34    24.4      99     0.0       0
> 10:46:41    22.1      97     0.0       0
> 10:46:48    13.0      96     0.0       0
> 10:46:55    25.3     102     0.0       0
> 
> Looking at the per-CPU breakdown:
> 
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   324  224    0    0    0  154    0     0    0 100   0  
0
>   1    0   0    0   114    0  226    0   10   13    0    86    0   1   0 
99
>   2    0   0    0   162  138  149    0    5    4    0     0    0   1   0 
99
>   3    0   0    0    55    6   46    0    4    3    0     0    0   1   0 
99
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   310  210   34    0   17  171    7     5    0 100   0  
0
>   1    0   0    0   152    1  200    0   17   26    5    59    1  65   0 
34
>   2    0   0    0   271  197  175    1   13   20    2     0    0  66   0 
34
>   3    0   0    0   120    6   76    0    6   14    2     0    0  66   0 
34
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    1   318  204   46    1   19   22    5    18    0 100   0  
0
>   1    0   0    0   112    0   40    0   12   15    3     4    0 100   0  
0
>   2    0   0    0   195   93   31    1   11   27    3     0    0 100   0  
0
>   3    0   0    0   108    1   20    0    9   31    3     2    0 100   0  
0
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   569  204  809    3  133  432    1    21    0  99   0  
1
>   1    0   0    0   502    0 1159    1  201  453    1     8    0  97   0  
3
>   2    0   0    0  2305 2197  147    6   51  279    0     0    0 100   0  
0
>   3    0   0    0   242    2  433   10   75  301    1     8    0 100   0  
0
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   371  227  110    0   31 1470    0    80    0  99   0  
1
>   1    0   0    3  1034    0 2079    0  102  241    0     9    0  10   0 
90
>   2    0   0    0  1892 1718 1558    3  129  214    0    36    0  11   0 
89
>   3    0   0    0   303    6  549    0   50  329    0     5    0   8   0 
92
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    1   312  205   24    0   10   39    4     4    0 100   0  
0
>   1    0   0    1   134    0   99    0   10   20    3     6    0  90   0 
10
>   2    0   0    0   187   90   40    0    8   15    4    10    0  90   0 
10
>   3    0   0    1   100    1   19    0    8    9    4     6    0  90   0 
10
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    2   528  206  916    8  161  312    1    39    0  99   0  
1
>   1    0   0    1   206    1  350    6  120  147    1    68    0 100   0  
0
>   2   83   0    0  1929 1816  192    9   87  176    0    50    0 100   0  
0
>   3    0   0    0   142    4   66   21   12  460    1    17    1  99   0  
0
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   360  228  163   11   25  743    0    13    0 100   0  
0
>   1    0   0    0  1587    0 4126   13  520 1435    0   122    0  24   0 
76
>   2    3   0    0  7785 7669 1499   29  225  933    0    48    0  27   0 
73
>   3    1   0    0  1856    6 4683    0  467 1912    0    37    0  23   0 
77
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   323  223    0    0    0  244    0     0    0 100   0  
0
>   1    0   0    0   141    0  277    0   14   11    0     1    0   0   0
100
>   2    0   0    0   307  259  312    0   16   15    0     3    0   1   0 
99
>   3    0   0    0   113    6  165    0    5    5    0    91    0   0   0
100
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0   324  224    0    0    0  203    0     0    0 100   0  
0
>   1    0   0    0   143    0  283    0   11    2    0    13    0   1   0 
99
>   2    0   0    0   312  278  298    0    8   11    0     3    0   1   0 
99
>   3    0   0    0   119    6  175    0    3    5    0    69    0   1   0 
99
> 
> Looking at prstat we can see the number of nfsd threads are low, much lower
than the 1024 limit:
> 
> [private:/tmp] root# prstat
>    PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
>  22643 daemon   2228K 1776K sleep   60  -20   2:55:52  55% nfsd/18
> 
> Interestingly, using prstat -mL to monitor thread latency, we see that a
handful of threads are the culprates for consuming mass CPU:
> 
> [private:/tmp] root# prstat -mL
>    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
PROCESS/LWPID
>  22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 416   1   0   0 nfsd/1506
>  22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 415   0   0   0 nfsd/1563
>  22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 417   0   0   0 nfsd/1554
>  22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 419   0   0   0 nfsd/1551
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0  26  74 418   0   0   0 nfsd/1553
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1536
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1555
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1539
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1562
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1545
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1559
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419   1   0   0 nfsd/1541
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1546
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1543
>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1560
> Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86
> 
> Above SLP represents "The percentage of time the process has spent
sleeping", LAT represents "percentage of time the process has spent
waiting for CPU", VCX is "number of voluntary context switches"
and ICX is for involuntary context switches.
> 
> Using pstack to try and determine what these threads are doing we see:
> 
> [private:/tmp] root# pstack 22643/1693
> 22643:  /usr/lib/nfs/nfsd
> -----------------  lwp# 1693 / thread# 1693  --------------------
>  fee589b7 nfssys   (e, feba0fc8)
>  fee5839f _thr_setup (fed4d400) + 51
>  fee585f0 _lwp_start (fed4d400, 0, 0, 0, 0, 0)
> [private:/tmp] root# pstack 22643/1708
> 22643:  /usr/lib/nfs/nfsd
> -----------------  lwp# 1708  --------------------------------
>  fee589b7 nfssys   (e, febb0fc8)
>  fee5839f _thr_setup (fed44c00) + 51
>  fee585f0 _lwp_start (fed44c00, 0, 0, 0, 0, 0)
> [private:/tmp] root# pstack 22643/1664
> 22643:  /usr/lib/nfs/nfsd
> -----------------  lwp# 1664 / thread# 1664  --------------------
>  fee589b7 nfssys   (e, fec90fc8)
>  fee5839f _thr_setup (fed43800) + 51
>  fee585f0 _lwp_start (fed43800, 0, 0, 0, 0, 0)
> 
> Running zvop_times.d
(http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an idea of what
ZFS is doing:
> 
> [private:/] root# dtrace -s  zvop_times.d
> dtrace: script ''zvop_times.d'' matched 66 probes
> ^C
> CPU     ID                    FUNCTION:NAME
>   1      2                             :END 
> ZFS COUNT
> 
>   zfs_getsecattr                                                    4
>   zfs_space                                                        70
>   zfs_rename                                                      111
>   zfs_readdir                                                     284
>   zfs_read                                                        367
>   zfs_mkdir                                                       670
>   zfs_setattr                                                    1054
>   zfs_frlock                                                     1562
>   zfs_putpage                                                    3916
>   zfs_write                                                      4110
>   zfs_create                                                     4178
>   zfs_remove                                                     7794
>   zfs_fid                                                       14960
>   zfs_inactive                                                  17637
>   zfs_access                                                    20809
>   zfs_fsync                                                     29668
>   zfs_lookup                                                    31273
>   zfs_getattr                                                  175457
> 
> 
> 
> ZFS AVG TIME
> 
>   zfs_fsync                                                      2337
>   zfs_getattr                                                    2735
>   zfs_access                                                     2774
>   zfs_fid                                                        2948
>   zfs_inactive                                                   4081
>   zfs_lookup                                                     6067
>   zfs_frlock                                                     6750
>   zfs_putpage                                                    6840
>   zfs_mkdir                                                      9259
>   zfs_getsecattr                                                10940
>   zfs_setattr                                                   15364
>   zfs_space                                                     25214
>   zfs_readdir                                                   25726
>   zfs_remove                                                    26117
>   zfs_write                                                     31942
>   zfs_rename                                                    60504
>   zfs_read                                                     144449
>   zfs_create                                                 71215587
> 
> 
> 
> ZFS SUM TIME
> 
>   zfs_getsecattr                                                43762
>   zfs_space                                                   1765026
>   zfs_mkdir                                                   6204101
>   zfs_rename                                                  6715987
>   zfs_readdir                                                 7306233
>   zfs_frlock                                                 10544902
>   zfs_setattr                                                16194261
>   zfs_putpage                                                26787465
>   zfs_fid                                                    44111892
>   zfs_read                                                   53013073
>   zfs_access                                                 57740468
>   zfs_fsync                                                  69354886
>   zfs_inactive                                               71990292
>   zfs_write                                                 131282275
>   zfs_lookup                                                189748320
>   zfs_remove                                                203558835
>   zfs_getattr                                               479998059
>   zfs_create                                             297538724997
> 
> 
> [private:/] root# zfs get all joyous
> NAME             PROPERTY       VALUE                      SOURCE
> joyous           type           filesystem                 -
> joyous           creation       Fri Jul 28 17:08 2006      -
> joyous           used           255G                       -
> joyous           available      15.7T                      -
> joyous           referenced     81.7K                      -
> joyous           compressratio  1.21x                      -
> joyous           mounted        yes                        -
> joyous           quota          none                       default
> joyous           reservation    none                       default
> joyous           recordsize     128K                       default
> joyous           mountpoint     /joyous                    default
> joyous           sharenfs       off                        default
> joyous           checksum       off                        local
> joyous           compression    off                        local
> joyous           atime          off                        local
> joyous           devices        on                         default
> joyous           exec           on                         default
> joyous           setuid         on                         default
> joyous           readonly       off                        default
> joyous           zoned          off                        default
> joyous           snapdir        hidden                     default
> joyous           aclmode        groupmask                  default
> joyous           aclinherit     secure                     default 
> 
> So what exactly is nfsd doing???
> 
> [private:/] root# time dtrace -n ''syscall:::entry /execname ==
"nfsd"/ { @[probefunc] = count(); }''
> dtrace: description ''syscall:::entry '' matched 225 probes
> ^C
> 
>   open                                                              3
>   close                                                             5
>   lwp_mutex_timedlock                                               7
>   lwp_mutex_wakeup                                                  7
>   pollsys                                                           7
>   putmsg                                                            8
>   getmsg                                                           11
>   fcntl                                                            12
>   lwp_park                                                         30
>   mmap                                                             41
>   munmap                                                           42
>   fxstat                                                           47
>   ioctl                                                            58
>   lwp_continue                                                    100
>   lwp_create                                                      100
>   setcontext                                                      100
>   lwp_kill                                                        101
>   lwp_exit                                                        110
>   schedctl                                                        110
>   lwp_sigmask                                                     138
>   nfs                                                             203
> 
> real    3m22.464s
> user    0m0.206s
> sys     0m0.222s
> 
> And the answer is, not much of anything. Where is this CPU disappearing
too!!! Looking at the stacks:
> 
> [private:/] root# time dtrace -n ''syscall::nfs:entry /execname ==
"nfsd"/ { @[ustack()] = count(); }''
> dtrace: description ''syscall::nfs:entry '' matched 1 probe
> ^C
> 
> 
>               libc.so.1`_nfssys+0x7
>               libc.so.1`_thr_setup+0x51
>               libc.so.1`_lwp_start
>                22
> 
> Not very helpful... lets try this:
> 
> [private:/] root# truss -c -p 22643
> ^C
> syscall               seconds   calls  errors
> close                    .000       1
> ioctl                    .000       2
> lwp_park                 .000       1
> lwp_unpark               .000       1
> getmsg                   .000       1
> putmsg                   .000       1
> setustack                .000      30
> nfssys                106.146      56
> mmap                     .000      20
> munmap                   .000       3
> fxstat                   .000       4
> lwp_create               .001      30
> lwp_exit                 .000      12
> lwp_continue             .000      30
> lwp_kill                 .000      13      13
> lwp_sigmask              .000      14
> pollsys                  .000       1
> schedctl                 .000      12
>                      --------  ------   ----
> sys totals:           106.149     232     13
> usr time:                .001
> elapsed:               47.230
> 
> Better. Clearly nfssys is a problem, but why?
> 
> [private:/] root# dtrace -n ''io:genunix::start { @[pid, execname]
= count(); }''
> dtrace: description ''io:genunix::start '' matched 3 probes
> ^C
> 
>         3  fsflush                                                         
1
>       344  syslogd                                                         
2
>     22643  nfsd                                                            
59
>         0  sched                                                        
30346
> 
> The fucking scheduler is doing more IO than nfsd!
> 
> Lets try iotop (http://www.brendangregg.com/DTrace/iotop). Below are 10
second intervals, disktime is 1000000=1s:
> 
> 2006 Dec  7 12:06:47,  load: 8.08,  disk_r:   1408 KB,  disk_w:  14482 KB
> 
>   UID    PID   PPID CMD              DEVICE  MAJ MIN D         DISKTIME
>     0      3      0 fsflush          sd6      54 389 W               242
>     0      3      0 fsflush          sd6      54 384 W               251
>     1  22643      1 nfsd             sd17     54 1088 R             2733
>     1  22643      1 nfsd             sd14     54 896 R              4420
>     1  22643      1 nfsd             sd13     54 832 R              5296
>     1  22643      1 nfsd             sd18     54 1152 R             7103
>     1  22643      1 nfsd             sd47     54 3008 R             7240
>     1  22643      1 nfsd             sd31     54 1984 R            14827
>     1  22643      1 nfsd             sd32     54 2048 R            16653
>     1  22643      1 nfsd             sd45     54 2880 R            17768
>     1  22643      1 nfsd             sd30     54 1920 R            17962
>     1  22643      1 nfsd             sd22     54 1408 R            18687
>     1  22643      1 nfsd             sd8      54 512 R             19693
>     1  22643      1 nfsd             sd33     54 2112 R            20659
>     1  22643      1 nfsd             sd19     54 1216 R            22933
>     1  22643      1 nfsd             sd43     54 2752 R            23602
>     1  22643      1 nfsd             sd7      54 448 R             27458
>     1  22643      1 nfsd             sd10     54 640 R             28715
>     1  22643      1 nfsd             sd49     54 3136 R            29640
>     1  22643      1 nfsd             sd5      54 320 R             46201
>     1  22643      1 nfsd             sd11     54 704 R             46775
>     1  22643      1 nfsd             sd24     54 1536 R            60011 
<-- 0.06s
>     0      0      0 sched            sd16     54 1024 W           140682
>     0      0      0 sched            sd47     54 3008 W           142082
>     0      0      0 sched            sd18     54 1152 W           151720
>     0      0      0 sched            sd13     54 832 W           157107
>     0      0      0 sched            sd14     54 896 W           159044
>     0      0      0 sched            sd17     54 1088 W           164386
>     0      0      0 sched            sd35     54 2240 W           167119
>     0      0      0 sched            sd34     54 2176 W           168552
>     0      0      0 sched            sd37     54 2368 W           171872
>     0      0      0 sched            sd40     54 2560 W           193907
>     0      0      0 sched            sd25     54 1600 W           195662
>     0      0      0 sched            sd33     54 2112 W           196480
> 
> Disk response times seem really good actually.
> 
> iosnoop (http://www.brendangregg.com/DTrace/iosnoop) shows that sched is
doing a lot of writing, mostly 1k files. Testing on aeon suggests this is
normal.
> 
> truss -D also points to this nfssys slowness:
> 
> [private:/tmp] root# truss -D -p `pgrep nfsd`
> ...
> /2:      0.0001 lwp_create(0xFED71D20, LWP_DETACHED|LWP_SUSPENDED,
0xFED71F48) = 3266
> /2:      0.0001 lwp_continue(3266)                              = 0
> /3266:  260128.5978     lwp_create()    (returning as new lwp ...)      = 0
> /3266:   0.0034 setustack(0xFED4BC60)
> /3266:  nfssys(14, 0xFEC40FC8)          (sleeping...)
> /3266:  43.1653 nfssys(14, 0xFEC40FC8)                          = 0
> /3266:   0.0074 schedctl()                                      =
0xFEF9F020
> /3266:   0.0002 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) =
0xFFBFFEFF [0x0000FFFF]
> /3266:   0.0002 lwp_kill(3268, SIG#0)                           Err#3 ESRCH
> /3266:   0.0002 munmap(0xFEC20000, 4096)                        = 0
> /3266:   0.0002 lwp_exit()
> 
> Locks...
> 
> [private:/tmp] root# lockstat -P sleep 10
> 
> Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec)
> 
> Count indv cuml rcnt     spin Lock                   Caller
>
-------------------------------------------------------------------------------
>    32 100% 100% 0.00 59963092 0xffffffffb282a1c8     dmu_object_alloc+0x75
\
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: zfs_create.d
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061207/34b9112f/attachment.ksh>

Neil Perrin

2006-Dec-07 21:46 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

0.4 seconds of CPU on average to do a dmu_object_alloc is a wee sluggish!
I suspect, however, it''s contention on osi->os_obj_lock as we
don''t seem
to be looping in dmu_object_alloc. I''m still confused though, I believe
that locking an adaptive mutex will spin for a short period then context
switch and so they shouldn''t be burning CPU - at least not .4s worth!
EricK suggested running "lockstat -kgIW -D 60 sleep 60" which I think
is the next step.

Looking back at your original email, I see you had observed the
same contention on os_obj_lock. I don''t enough much about the dmu
to know if anything has been fixed in that area.

Neil.

Ben Rockwood wrote On 12/07/06 13:53,:> Neil Perrin wrote:
> 
>> Ben,
>>
>> The attached dscript might help determining the zfs_create issue.
>> It prints:
>>     - a count of all functions called from zfs_create
>>     - average wall count time of the 30 highest functions
>>     - average cpu time of the 30 highest functions
>>
>> Note, please ignore warnings of the following type:
>>
>> dtrace: 1346 dynamic variable drops with non-empty dirty list
> 
> 
> The results are below.  I keep bumping into dmu_object_alloc, but
I''m
> uncertain what is or isn''t normal in that reguard.
> 
> benr.
> 
> 
> [private:/tmp] root# dtrace -s zfs_create.d
> dtrace: script ''zfs_create.d'' matched 2524 probes
> dtrace: 508 dynamic variable drops with non-empty dirty list
> ^C
> CPU     ID                    FUNCTION:NAME
>   1      2                             :END
> counts
>   arc_read                                                          1
>   arc_set_callback                                                  1
>   buf_hash_find                                                     1
>   dbuf_read_done                                                    1
>   dmu_object_next                                                   2
>   dmu_zfetch_dofetch                                                2
>   dmu_zfetch_fetch                                                  2
>   dmu_zfetch_fetchsz                                                2
>   dnode_next_offset                                                 2
>   dnode_next_offset_level                                           2
>   dbuf_hash_insert                                                  3
>   dbuf_set_data                                                     3
>   dmu_zfetch                                                        3
>   dmu_zfetch_find                                                   3
>   arc_buf_alloc                                                     6
>   zfs_v4_to_unix                                                    6
>   dbuf_prefetch                                                    10
>   mzap_addent                                                      11
>   mze_find                                                         11
>   mze_find_unused_cd                                               11
>   mze_insert                                                       11
>   dbuf_findbp                                                      13
>   dmu_tx_check_ioerr                                               13
>   dnode_block_freed                                                13
>   spa_get_dsl                                                      13
>   buf_hash_remove                                                  36
>   arc_change_state                                                 37
>   arc_release                                                      40
>   dbuf_fix_old_data                                                40
>   fzap_add                                                         41
>   fzap_add_cd                                                      41
>   zap_entry_create                                                 41
>   zap_increment_num_entries                                        41
>   zap_put_leaf_maybe_grow_ptrtbl                                   41
>   arc_tempreserve_clear                                            52
>   dmu_tx_commit                                                    52
>   dsl_dir_tempreserve_clear                                        52
>   txg_rele_to_sync                                                 52
>   zap_add                                                          52
>   zfs_dirent_unlock                                                52
>   zfs_link_create                                                  52
>   zfs_log_create                                                   52
>   zil_commit                                                       52
>   zfs_mknode                                                       55
>   zfs_acl_chmod                                                    56
>   zfs_acl_fixup_canonical_six                                      56
>   zfs_aclset_common                                                56
>   zfs_mode_compute                                                 56
>   zfs_perm_init                                                    56
>   zfs_znode_dmu_init                                               56
>   dbuf_create_bonus                                                57
>   dmu_bonus_hold                                                   57
>   dmu_object_alloc                                                 57
>   zfs_acl_append                                                   57
>   zfs_znode_alloc                                                  57
>   dmu_tx_add_new_object                                            58
>   dnode_allocate                                                   58
>   dmu_buf_set_user                                                 59
>   dmu_buf_update_user                                              59
>   dbuf_create                                                      60
>   dbuf_read_impl                                                   60
>   dmu_zfetch_init                                                  61
>   dnode_create                                                     61
>   dnode_new_blkid                                                  61
>   dnode_setdblksz                                                  61
>   dnode_clear_range                                                62
>   zio_buf_alloc                                                    67
>   zap_table_load                                                   68
>   dsl_dataset_dirty                                                69
>   dbuf_fix_old_bonus_data                                          71
>   fzap_lookup                                                      74
>   zap_lookup                                                       74
>   zap_leaf_array_create                                            82
>   dmu_tx_hold_zap                                                  85
>   arc_tempreserve_space                                            86
>   dmu_tx_assign                                                    86
>   dmu_tx_count_write                                               86
>   dmu_tx_get_txg                                                   86
>   dmu_tx_try_assign                                                86
>   dsl_dir_tempreserve_impl                                         86
>   dsl_dir_tempreserve_space                                        86
>   dsl_pool_adjustedsize                                            86
>   spa_get_dspace                                                   86
>   txg_hold_open                                                    86
>   txg_rele_to_quiesce                                              86
>   dmu_tx_create                                                    95
>   dmu_tx_create_dd                                                 95
>   zfs_acl_node_read                                                95
>   zfs_acl_node_read_internal                                       95
>   zfs_dirent_lock                                                  95
>   zfs_zaccess                                                      95
>   zfs_zaccess_common                                               95
>   zfs_time_stamper_locked                                         109
>   dbuf_hold_level                                                 112
>   bp_get_dasize                                                   115
>   dbuf_new_block                                                  115
>   dmu_tx_is_syncing                                               115
>   fzap_checksize                                                  115
>   zap_deref_leaf                                                  115
>   zap_get_leaf_byblk                                              115
>   zap_idx_to_blk                                                  115
>   zap_leaf_lookup                                                 115
>   zap_put_leaf                                                    115
>   dnode_cons                                                      122
>   zap_hash                                                        126
>   zap_lockdir                                                     126
>   zap_unlockdir                                                   126
>   dnode_add_ref                                                   130
>   zio_next_stage_async                                            133
>   zio_pop_transform                                               133
>   zio_ready                                                       133
>   zio_wait                                                        133
>   zio_wait_children_ready                                         133
>   spa_config_enter                                                134
>   spa_config_exit                                                 134
>   zio_clear_transform_stack                                       134
>   zio_create                                                      134
>   zio_done                                                        134
>   zio_null                                                        134
>   zio_push_transform                                              134
>   zio_root                                                        134
>   zio_wait_children_done                                          134
>   zfs_acl_free                                                    151
>   zfs_acl_alloc                                                   152
>   zap_leaf_chunk_alloc                                            165
>   mze_compare                                                     168
>   arc_buf_remove_ref                                              170
>   remove_reference                                                170
>   arc_released                                                    173
>   zfs_ace_access                                                  184
>   dmu_tx_hold_bonus                                               187
>   arc_buf_add_ref                                                 206
>   add_reference                                                   207
>   arc_access_and_exit                                             207
>   dmu_tx_willuse_space                                            229
>   dnode_willuse_space                                             229
>   dsl_dir_willuse_space                                           229
>   dnode_setdirty                                                  233
>   dbuf_add_ref                                                    256
>   zio_wait_for_children                                           267
>   dmu_tx_count_dnode                                              276
>   spa_get_asize                                                   287
>   dsl_dataset_block_freeable                                      296
>   dmu_buf_hold                                                    309
>   dmu_tx_hold_object_impl                                         333
>   zfs_set_ace                                                     336
>   vdev_lookup_top                                                 345
>   zio_next_stage                                                  401
>   dbuf_will_dirty                                                 424
>   buf_hash                                                        453
>   dsl_dataset_prev_snap_txg                                       477
>   dnode_rele                                                      580
>   dbuf_dirty                                                      588
>   dnode_hold                                                      604
>   dnode_hold_impl                                                 664
>   dsl_dir_dirty                                                   817
>   txg_list_add                                                    886
>   dmu_buf_get_user                                                907
>   dbuf_hold                                                       975
>   dbuf_whichblock                                                 978
>   dbuf_rele                                                      1099
>   dbuf_hold_impl                                                 1103
>   dbuf_find                                                      1113
>   dbuf_hash                                                      1116
>   parent_delta                                                   1119
>   dsl_dir_estimated_space                                        1120
>   dbuf_update_data                                               1222
>   ldv                                                            1301
>   dbuf_read                                                      1487
> 
> av wall clock
>   dnode_allocate                                               185006
>   zap_table_load                                               193232
>   dbuf_will_dirty                                              231433
>   zap_entry_create                                             255027
>   zap_lockdir                                                  258613
>   dmu_zfetch_find                                              259608
>   dmu_zfetch                                                   269971
>   dnode_setdirty                                               309299
>   dmu_zfetch_fetch                                             367224
>   dmu_buf_hold                                                 368589
>   dmu_zfetch_dofetch                                           376191
>   dbuf_dirty                                                   386128
>   dnode_hold_impl                                              535814
>   dnode_hold                                                   566143
>   dmu_tx_hold_bonus                                            617734
>   dmu_tx_hold_object_impl                                      663582
>   dmu_tx_add_new_object                                        673663
>   dmu_tx_hold_zap                                              706378
>   dnode_new_blkid                                              761775
>   fzap_lookup                                                  782656
>   dmu_bonus_hold                                               791209
>   zap_get_leaf_byblk                                           841362
>   zap_deref_leaf                                               973677
>   zap_lookup                                                  1085355
>   zap_add                                                     1837000
>   fzap_add_cd                                                 1916822
>   fzap_add                                                    1932422
>   zfs_link_create                                             2353198
>   dmu_object_alloc                                          403070558
>   zfs_mknode                                                417633571
> 
> av cpu
>   dmu_tx_add_new_object                                         14083
>   dbuf_will_dirty                                               14796
>   dmu_buf_hold                                                  17393
>   dmu_tx_count_write                                            19129
>   zfs_acl_chmod                                                 19416
>   dbuf_dirty                                                    19525
>   dmu_tx_try_assign                                             20067
>   zap_lockdir                                                   20951
>   zap_table_load                                                21232
>   dmu_tx_assign                                                 21946
>   zfs_perm_init                                                 23304
>   zap_increment_num_entries                                     25355
>   dnode_allocate                                                25526
>   zap_entry_create                                              28102
>   dmu_zfetch_find                                               28629
>   dmu_zfetch                                                    30387
>   dmu_bonus_hold                                                37614
>   dmu_zfetch_fetch                                              38222
>   dmu_zfetch_dofetch                                            39842
>   zap_get_leaf_byblk                                            40255
>   fzap_lookup                                                   40518
>   zap_deref_leaf                                                54850
>   zap_lookup                                                    65315
>   dmu_tx_hold_zap                                               86567
>   zap_add                                                      157876
>   fzap_add_cd                                                  158310
>   fzap_add                                                     160400
>   zfs_link_create                                              191034
>   dmu_object_alloc                                          401750326
>   zfs_mknode                                                416439036
>

Anton B. Rang

2006-Dec-07 23:11 UTC

head link

[zfs-discuss] Re: A Plea for Help: Thumper/ZFS/NFS/B43

> I''m still confused though, I believe that locking an adaptive
mutex will spin for a short
> period then context switch and so they shouldn''t be burning CPU -
at least not .4s worth!
An adaptive mutex will spin as long as the thread which holds the mutex is on
CPU.  If the lock is moderately contended, you can wind up with threads spinning
for quite a while as ownership of the lock passes from thread to thread across
CPUs.

Mutexes in Solaris tend to be most useful when they''re held for very
short periods of time; they also work pretty well if the owning thread blocks.
If somebody is computing for quite a while while holding them (e.g. if
dnode_next_offset is repeatedly called and is slow), they can waste a lot of
time on other CPUs. In this case an rwlock usually works better.

Anton
 
 
This message posted from opensolaris.org

eric kustarz

2006-Dec-08 03:54 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Ben Rockwood wrote:> Eric Kustarz wrote:
> 
>> Ben Rockwood wrote:
>> > I''ve got a Thumper doing nothing but serving NFS.  Its
using B43 with
>> > zil_disabled.  The system is being consumed in waves, but by what
I
>> > don''t know.  Notice vmstat:
>>
>> We made several performance fixes in the NFS/ZFS area in recent 
>> builds, so if possible it would be great to upgrade you from snv_43.  
>> That said, there might be something else going on that we
haven''t
>> accounted for.
>> ...
>>
>>>
>>> Step 1 was to disable any ZFS features that might consume large 
>>> amounts of CPU:
>>>
>>> # zfs set compression=off joyous
>>> # zfs set atime=off joyous
>>> # zfs set checksum=off joyous
>>
>>
>> In our performance testing, we haven''t found checksums to be
anywhere
>> near a large consumer of CPU, so i would recommend leaving that on 
>> (due to its benefits).
>>
>> I suspect your apps/clients don''t depend on atime, so i think
its a
>> good idea to turn that off.  We''ve gotten better NFS
performance with
>> this off.
>>
>> More of a heads up as it sounds like compression on/off isn''t
your
>> problem.  If you are not getting good I/O BW with compression turned 
>> on, its most likely due to:
>> 6460622 zio_nowait() doesn''t live up to its name
>>
>> As Jim, mentioned, using lockstat to figure out where your CPU is 
>> being spent is the first step.  I''ve been using
''lockstat -kgIW -D 60
>> sleep 60''.  That collects data for the top 60 callers for a 1
minute
>> period. If you see ''mutex_enter'' high up in the
results, then we have
>> at least mutex lock contention.
>>
>>
>> ...
>>
>>
>>> Interestingly, using prstat -mL to monitor thread latency, we see 
>>> that a handful of threads are the culprates for consuming mass CPU:
>>>
>>> [private:/tmp] root# prstat -mL
>>>    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG 
>>> PROCESS/LWPID  22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 416
>>> 1   0   0 nfsd/1506
>>>  22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 415   0   0   0 
>>> nfsd/1563
>>>  22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 417   0   0   0 
>>> nfsd/1554
>>>  22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 419   0   0   0 
>>> nfsd/1551
>>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0  26  74 418   0   0   0 
>>> nfsd/1553
>>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 
>>> nfsd/1536
>>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 
>>> nfsd/1555
>>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 
>>> nfsd/1539
>>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 
>>> nfsd/1562
>>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 
>>> nfsd/1545
>>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 
>>> nfsd/1559
>>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419   1   0   0 
>>> nfsd/1541
>>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 
>>> nfsd/1546
>>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 
>>> nfsd/1543
>>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 
>>> nfsd/1560
>>> Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86
>>
>>
>> The high SYS times being charged to the userland nfsd threads is 
>> representative of what the kernel threads are doing (which is most 
>> likely going to be NFS and ZFS).
>>
>>
>>> Running zvop_times.d 
>>> (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an 
>>> idea of what ZFS is doing:
>>>
>>> [private:/] root# dtrace -s  zvop_times.d
>>> dtrace: script ''zvop_times.d'' matched 66 probes
>>> ^C
>>> CPU     ID                    FUNCTION:NAME
>>>   1      2                             :END ZFS COUNT
>>>
>>>   zfs_getsecattr                                                   
4
>>>   zfs_space                                                       
70
>>>   zfs_rename                                                     
111
>>>   zfs_readdir                                                    
284
>>>   zfs_read                                                       
367
>>>   zfs_mkdir                                                      
670
>>>   zfs_setattr                                                   
1054
>>>   zfs_frlock                                                    
1562
>>>   zfs_putpage                                                   
3916
>>>   zfs_write                                                     
4110
>>>   zfs_create                                                    
4178
>>>   zfs_remove                                                    
7794
>>>   zfs_fid                                                      
14960
>>>   zfs_inactive                                                 
17637
>>>   zfs_access                                                   
20809
>>>   zfs_fsync                                                    
29668
>>>   zfs_lookup                                                   
31273
>>>   zfs_getattr                                                 
175457
>>>
>>>
>>>
>>> ZFS AVG TIME
>>>
>>>   zfs_fsync                                                     
2337
>>>   zfs_getattr                                                   
2735
>>>   zfs_access                                                    
2774
>>>   zfs_fid                                                       
2948
>>>   zfs_inactive                                                  
4081
>>>   zfs_lookup                                                    
6067
>>>   zfs_frlock                                                    
6750
>>>   zfs_putpage                                                   
6840
>>>   zfs_mkdir                                                     
9259
>>>   zfs_getsecattr                                               
10940
>>>   zfs_setattr                                                  
15364
>>>   zfs_space                                                    
25214
>>>   zfs_readdir                                                  
25726
>>>   zfs_remove                                                   
26117
>>>   zfs_write                                                    
31942
>>>   zfs_rename                                                   
60504
>>>   zfs_read                                                    
144449
>>>   zfs_create                                                
71215587
>>
>>
>> An average of 71 msecs per create seems high.  Doing a quantize 
>> instead of just an average on zfs_create() would be interesting 
>> (should have included that in my script)...
>>
>> ...
>>
>>
>>> Locks...
>>>
>>> [private:/tmp] root# lockstat -P sleep 10
>>>
>>> Adaptive mutex spin: 20406 events in 10.026 seconds (2035
events/sec)
>>>
>>> Count indv cuml rcnt     spin Lock                   
>>> Caller                  
>>>
-------------------------------------------------------------------------------
>>>
>>>    32 100% 100% 0.00 59963092 0xffffffffb282a1c8     
>>> dmu_object_alloc+0x75 \
>>
>>
>> That''s interesting.  This is build snv_43, right? 
Doesn''t exactly
>> line up with current bits, but i would guess this is " 
>> mutex_enter(&osi->os_obj_lock);".
>>
>> Can you show us the output for a longer sampling period, such as 
>> ''lockstat -P -D 30 sleep 60''?  The "Adaptive
mutex spin" category is
>> usually the most/only interesting one in my experience.
>>
>> I would be curious to know your memory usage.  What is the physical 
>> memory on the machine?
>>
>> And can you run:
>> # echo ''arc::print -d'' | mdb -k
>> # echo ''::kmastat'' | mdb -k | sort -n -k 5 | tail -40
>>
>> Matt has a nice little dscript that figures out how long txgs are 
>> taking (i''ll attach that).  I believe it will work with old
bits (i
>> defined "OLD" the for that to work).  Can you try running
that?
>>
>> Can you run ''zpool list'' and ''zpool
status''?  I''d like to know how
>> full your pool is.
>>
>> Let''s start with that info,
>> eric
> 
> 
> 
> Here is a bunch of the data you asked for.  I can''t tell you how
much I
> appreciate the help from everyone.
> 
> Lockstat:
> 
> [private:/tmp] root# lockstat -kgIW -D 60 sleep 60
Hmm, looks like we caught the wrong 60 seconds :/ ... from the below, we 
have 78% idle CPU, 20% at and "below" nfssys.  Though we still have
good
data here to go on.  Most of the time of what the NFS server is doing 
CPU-wise is in zfs_mknode()/dmu_object_alloc() - 17%.  Also, seeing 8% 
of the CPU in mutex_vector_enter() suggests lock contention (which from 
your previous posts is probably ''osi->os_obj_lock'').

dmu_object_alloc() is being a pig at 17% because each of the three main 
functions called under the ''osi->os_obj_lock'' are being
mini-pigs: each
at 4%.  Those three functions are: dnode_next_offset(), 
dnode_hold_impl(), dmu_object_next().

So i''m guessing there''s lots of files being created over NFS
in one
particular dataset?

We should figure out how many creates/second you are doing over NFS (i 
should have put a timeout on the script).  Here''s a real simple one 
(from your snoop it looked like you''re only doing NFSv3, so
i''m not
tracking NFSv4):
"
#!/usr/sbin/dtrace -s

rfs3_create:entry,
zfs_create:entry
{
         @creates[probefunc] = count();
}

tick-60s
{
         exit(0);
}
"


If i''m reading the code right and if you are just creating a lot of 
files in just one dataset, perhaps you could create another dataset or 
two in place of just the one - not sure if that is possible for you.  It 
would be interesting data to see if this alleviates pain.
> 
> Profiling interrupt: 23296 events in 60.046 seconds (388 events/sec)
> 
> Count genr cuml rcnt     nsec Hottest CPU+PIL        
> Caller                 
>
-------------------------------------------------------------------------------
> 
> 18290  79% ---- 0.00    21994 cpu[3]                 
> thread_start           18102  78% ---- 0.00    21988 
> cpu[3]                 idle                   18024  77% ---- 0.00    
> 21983 cpu[3]                 cpu_halt               4626  20% ---- 
> 0.00    25530 cpu[2]                 sys_syscall32          4626  20% 
> ---- 0.00    25530 cpu[2]                 nfssys                 4626  
> 20% ---- 0.00    25530 cpu[2]                 svc_do_run             
> 4624  20% ---- 0.00    25539 cpu[2]                 
> svc_run                4522  19% ---- 0.00    25642 
> cpu[2]                 svc_getreq             4502  19% ---- 0.00    
> 25662 cpu[2]                 rfs_dispatch           4495  19% ---- 
> 0.00    25660 cpu[2]                 common_dispatch        3951  17% 
> ---- 0.00    26143 cpu[2]                 rfs3_create            3942  
> 17% ---- 0.00    26156 cpu[2]                 fop_create             
> 3941  17% ---- 0.00    26154 cpu[2]                 
> zfs_create             3922  17% ---- 0.00    26150 
> cpu[2]                 zfs_mknode             3913  17% ---- 0.00    
> 26163 cpu[2]                 dmu_object_alloc       1845   8% ---- 
> 0.00    20288 cpu[0]                 mutex_vector_enter      970   4% 
> ---- 0.00    31591 cpu[2]                 dmu_object_next         917   
> 4% ---- 0.00    31673 cpu[2]                 dnode_next_offset      
>  899   4% ---- 0.00    30573 cpu[2]                 
> dnode_hold_impl         883   4% ---- 0.00    31639 
> cpu[2]                 dnode_next_offset_level
>  725   3% ---- 0.00    30896 cpu[2]                 
> dbuf_hold_impl          627   3% ---- 0.00    28908 
> cpu[2]                 mutex_enter             361   2% ---- 0.00    
> 31135 cpu[2]                 dmu_buf_rele            350   2% ---- 
> 0.00    29945 cpu[2]                 dbuf_hold               350   2% 
> ---- 0.00    31147 cpu[2]+6               mac_rx                  341   
> 1% ---- 0.00    31344 cpu[2]                 arc_buf_add_ref        
>  303   1% ---- 0.00    31154 cpu[2]+6               
> intr_thread             290   1% ---- 0.00    31074 
> cpu[2]+6               av_dispatch_autovect    259   1% ---- 0.00    
> 24534 cpu[1]                 zio_next_stage          257   1% ---- 
> 0.00    31007 cpu[2]                 arc_buf_remove_ref      244   1% 
> ---- 0.00    31165 cpu[2]+6               e1000g_intr             243   
> 1% ---- 0.00    31507 cpu[2]                 buf_hash               
>  218   1% ---- 0.00    25154 cpu[0]                 
> putnext                 186   1% ---- 0.00    21154 
> cpu[0]                 svc_sendreply           184   1% ---- 0.00    
> 21194 cpu[0]                 svc_cots_ksend          175   1% ---- 
> 0.00    31147 cpu[2]+6               aggr_recv_cb            172   1% 
> ---- 0.00    31151 cpu[2]+6               i_dls_link_ether_rx     159   
> 1% ---- 0.00    31155 cpu[2]+6               ip_input               
>  158   1% ---- 0.00    21275 cpu[0]                 
> put                     155   1% ---- 0.00    30127 
> cpu[2]                 dbuf_find               152   1% ---- 0.00    
> 21597 cpu[0]                 rmm_wput                147   1% ---- 
> 0.00    21688 cpu[0]                 mir_wput                141   1% 
> ---- 0.00    31153 cpu[2]+6               squeue_enter_chain      138   
> 1% ---- 0.00    26152 cpu[3]                 zio_next_stage_async   
>  137   1% ---- 0.00    31165 cpu[2]+6               
> tcp_rput_data           134   1% ---- 0.00    21688 
> cpu[0]                 tcp_wput                128   1% ---- 0.00    
> 21804 cpu[0]                 squeue_enter            123   1% ---- 
> 0.00    21659 cpu[0]                 tcp_output              104   0% 
> ---- 0.00    25306 cpu[1]                 zio_nowait              102   
> 0% ---- 0.00    30574 cpu[2]                 atomic_add_ptr_nv        
> 91   0% ---- 0.00    22605 cpu[0]                 tcp_send_data          
>   89   0% ---- 0.00    21472 cpu[1]                 
> nfs3_fhtovp              87   0% ---- 0.00    27427 
> cpu[2]                 rw_enter                 86   0% ---- 0.00    
> 21176 cpu[1]                 fsop_vget                84   0% ---- 
> 0.00    24945 cpu[1]                 zio_vdev_io_start        81   0% 
> ---- 0.00    21110 cpu[1]                 zfs_vget                 81   
> 0% ---- 0.00    25006 cpu[3]                 zio_wait_for_children    
> 78   0% ---- 0.00    31423 cpu[2]                 atomic_add_ptr         
>   74   0% ---- 0.00    31148 cpu[2]+11              
> rmm_rput                 73   0% ---- 0.00    22876 
> cpu[0]                 dls_tx                 
>
-------------------------------------------------------------------------------
> 
> 
> 
> NFS Stats:
> 
> [private:/tmp] root# nfsstat -s -v 3
> 
> Server rpc:
> Connection oriented:
> calls      badcalls   nullrecv   badlen     xdrcall    dupchecks  
> dupreqs   154161041  0          0          0          0          
> 35669015   0         Connectionless:
> calls      badcalls   nullrecv   badlen     xdrcall    dupchecks  
> dupreqs   0          0          0          0          0          
> 0          0        
> Server NFSv3:
> calls     badcalls 153760861 0        Version 3: (153482035 calls)
> null         getattr      setattr      lookup       access       
> readlink    1 0%         68265095 44% 2334522 1%   33186002 21% 9373979 
> 6%   4430 0%     read         write        create       mkdir        
> symlink      mknod       776093 0%    6496524 4%   6353313 4%   1883254 
> 1%   814 0%       2 0%        remove       rmdir        rename       
> link         readdir      readdirplus 12084296 7%  2692 0%      208683 
> 0%    5969808 3%   23751 0%     335079 0%   fsstat       fsinfo       
> pathconf     commit      6 0%         1 0%         2 0%         6183688 4% 
> Server nfs_acl:
> Version 3: (278383 calls)
> null        getacl      setacl      getxattrdir
> 0 0%        32060 11%   5 0%        246318 88%
> 
> Memory:
> 
> [private:/tmp] root# echo ''arc::print -d'' | mdb -k
> {
>    anon = ARC_anon
>    mru = ARC_mru
>    mru_ghost = ARC_mru_ghost
>    mfu = ARC_mfu
>    mfu_ghost = ARC_mfu_ghost
>    size = 0t7337733120
6.8GB in the ARC out of a possible 16GB (i think thats how much you 
have) is pretty decent.
>    p = 0t3673915285
>    c = 0t7338496468
>    c_min = 0t533448320
>    c_max = 0t15996604416
>    hits = 0t6864679166
>    misses = 0t518711
A 99% cache hit ratio - impressive!
>    deleted = 0t295199
>    skipped = 0t1701
>    hash_elements = 0t500010
>    hash_elements_max = 0t511362
>    hash_collisions = 0t1820702
>    hash_chains = 0t149161
>    hash_chain_max = 0t10
>    no_grow = 0
> }
> [private:/tmp] root# echo ''::kmastat'' | mdb -k | sort -n
-k 5 | tail -40
> zio_buf_10240              10240   1524   1642  16814080     46367     0
> zio_buf_3072                3072   4663   5604  17215488    354215     0
> streams_dblk_1936           2048     43   8418  17240064 159700970     0
> zio_buf_77824              77824    221    226  17588224      1464     0
> zio_buf_28672              28672    660    664  19038208      9233     0
> zio_buf_32768              32768    578    595  19496960     74273     0
> zio_buf_3584                3584   4492   5952  21331968    186767     0
> zio_buf_24576              24576    876    885  21749760     15378     0
> zio_buf_14336              14336   1008   1556  22306816   2101355     0
> kmem_va_8192                8192   2361   2752  22544384      4067     0
> kmem_alloc_128               128 161894 173042  22863872  41686966     0
> zio_buf_4096                4096   6002   6016  24641536    193926     0
> kmem_magazine_15             128 207407 212939  28135424    237076     0
> zio_buf_5120                5120   5031   6144  31457280    126567     0
> zio_buf_20480              20480   1557   1567  32092160     39694     0
> kmem_alloc_8                   8 4064087 4064240  32964608  15892691     0
> kmem_va_32768              32768    595   1028  33685504      1731     0
> kmem_va_24576              24576    885   1560  40894464      2278     0
> kmem_alloc_112               112 289900 366192  41664512   7743803     0
> kmem_alloc_40                 40 806400 1048077  42504192  14060780     0
> kmem_alloc_256               256 187308 191970  52420608  33326285     0
> arc_buf_hdr_t                128 500219 512089  67661824  14104876     0
> kmem_slab_cache               56 1239939 1250640  71147520   1612006     0
> kmem_va_12288              12288   4609   5690  74579968      8137     0
> zfs_znode_cache              192 544134 545643 106426368   7771071     0
ok, cool, the ''znode_cache'' is not a problem (we''re
actually using most
of the buffers allocated to the cache).  That''s good news.
> kmem_va_28672              28672   2651   3292 107872256      6270     0
> rootnex_dmahdl              2504  42668  42832 109649920  32118262     0
> kmem_io_64G_4096            4096  32768  32768 134217728     32768     0
> vn_cache                     240 552163 552255 150802432    634861     0
> kmem_bufctl_cache             24 7085442 7085568 172752896   8534442     0
> kmem_va_20480              20480   9939  11568 252706816     14227     0
> zio_buf_512                  512 578348 604080 309288960  29468142     0
> dmu_buf_impl_t               328 874033 908160 309985280  21869642     0
Same thing for the ''dbuf_cache''.
> kmem_alloc_96                 96 3879869 3879918 378384384  21906309     0
> zio_buf_65536              65536   7269   7273 476643328   1368534     0
> kmem_va_4096                4096 1178724 1178752 533200896   1485372     0
> dnode_t                      640 4641005 4641006 3168260096   8165695     0
And the ''dnode_cache''.
> zio_buf_131072            131072  25251  25256 3310354432    245308     0
> zio_buf_16384              16384 214643 214667 3517104128    747095     0
> kmem_va_16384              16384 214783 217944 3570794496    375400     0
> ZPool (I''m omit''ing zfs list output because the
filesystems are named by
> customer numbers):
Ooops, i think the ''zpool list'' command got cut out.  Mostly
curious at
what percentage of capacity you''re at.
> 
> [private:/tmp] root# zpool status
>  pool: joyous
> state: ONLINE
> scrub: none requested
> config:
> 
>        NAME        STATE     READ WRITE CKSUM
>        joyous      ONLINE       0     0     0
>          raidz2    ONLINE       0     0     0
>            c5t4d0  ONLINE       0     0     0
>            c4t4d0  ONLINE       0     0     0
>            c7t4d0  ONLINE       0     0     0
>            c6t4d0  ONLINE       0     0     0
>            c1t4d0  ONLINE       0     0     0
>            c0t4d0  ONLINE       0     0     0
>            c4t0d0  ONLINE       0     0     0
>            c7t0d0  ONLINE       0     0     0
>            c6t0d0  ONLINE       0     0     0
>            c1t0d0  ONLINE       0     0     0
>            c0t0d0  ONLINE       0     0     0
>          raidz2    ONLINE       0     0     0
>            c5t5d0  ONLINE       0     0     0
>            c4t5d0  ONLINE       0     0     0
>            c7t5d0  ONLINE       0     0     0
>            c6t5d0  ONLINE       0     0     0
>            c1t5d0  ONLINE       0     0     0
>            c0t5d0  ONLINE       0     0     0
>            c4t1d0  ONLINE       0     0     0
>            c7t1d0  ONLINE       0     0     0
>            c6t1d0  ONLINE       0     0     0
>            c1t1d0  ONLINE       0     0     0
>            c0t1d0  ONLINE       0     0     0
>          raidz2    ONLINE       0     0     0
>            c5t6d0  ONLINE       0     0     0
>            c4t6d0  ONLINE       0     0     0
>            c7t6d0  ONLINE       0     0     0
>            c6t6d0  ONLINE       0     0     0
>            c1t6d0  ONLINE       0     0     0
>            c0t6d0  ONLINE       0     0     0
>            c4t2d0  ONLINE       0     0     0
>            c7t2d0  ONLINE       0     0     0
>            c6t2d0  ONLINE       0     0     0
>            c1t2d0  ONLINE       0     0     0
>            c0t2d0  ONLINE       0     0     0
>          raidz2    ONLINE       0     0     0
>            c5t7d0  ONLINE       0     0     0
>            c4t7d0  ONLINE       0     0     0
>            c7t7d0  ONLINE       0     0     0
>            c6t7d0  ONLINE       0     0     0
>            c1t7d0  ONLINE       0     0     0
>            c0t7d0  ONLINE       0     0     0
>            c4t3d0  ONLINE       0     0     0
>            c7t3d0  ONLINE       0     0     0
>            c6t3d0  ONLINE       0     0     0
>            c1t3d0  ONLINE       0     0     0
>            c0t3d0  ONLINE       0     0     0
>        spares
>          c5t1d0    AVAIL           c5t2d0    AVAIL           c5t3d0    
> AVAIL 
> errors: No known data errors
> 
> 
> ZFS_Sync Dtrace:
> 
> [private:/tmp] root# ./zfs_sync.d
>   txg     time     sync-write     async-write       sync-read     
> async-read
>      0  5206ms.    0/s   0MB/s;    0/s   0MB/s;   10/s   0MB/s;    0/s   
> 0MB/s
> 2234681   281ms.  156/s   0MB/s; 32249/s  29MB/s;   14/s   0MB/s;    
> 0/s   0MB/s
> 2234682   249ms.  176/s   0MB/s; 23216/s  23MB/s;   40/s   2MB/s;    
> 0/s   0MB/s
>      0  5840ms.    0/s   0MB/s;    0/s   0MB/s;   24/s   1MB/s;    0/s   
> 0MB/s
>      0  6239ms.    0/s   0MB/s;    0/s   0MB/s;   67/s   4MB/s;    0/s   
> 0MB/s
> 2234683   427ms.  103/s   0MB/s; 22346/s  24MB/s;  117/s   7MB/s;    
> 0/s   0MB/s
>      0  6729ms.    0/s   0MB/s;    0/s   0MB/s;    6/s   0MB/s;    0/s   
> 0MB/s
> 2234684   228ms.  192/s   0MB/s; 24052/s  23MB/s;   17/s   1MB/s;    
> 0/s   0MB/s
>      0  4770ms.    0/s   0MB/s;    0/s   0MB/s;    7/s   0MB/s;    0/s   
> 0MB/s
> 2234685   212ms.  207/s   0MB/s; 25382/s  20MB/s;    0/s   0MB/s;    
> 0/s   0MB/s
>      0  4787ms.    0/s   0MB/s;    0/s   0MB/s;   31/s   1MB/s;    0/s   
> 0MB/s
> 2234686   215ms.  200/s   0MB/s; 19544/s  17MB/s;   18/s   1MB/s;    
> 0/s   0MB/s
> 2234687   197ms.  223/s   0MB/s; 25253/s  23MB/s;   15/s   0MB/s;    
> 0/s   0MB/s
>      0  8010ms.    0/s   0MB/s;    0/s   0MB/s;    3/s   0MB/s;    0/s   
> 0MB/s
>      0  5611ms.    0/s   0MB/s;    0/s   0MB/s;    4/s   0MB/s;    0/s   
> 0MB/s
> 2234688   255ms.  172/s   0MB/s; 23250/s  21MB/s;   19/s   1MB/s;    
> 0/s   0MB/s
>      0  5357ms.    0/s   0MB/s;    0/s   0MB/s;    8/s   0MB/s;    0/s   
> 0MB/s
> 2234689   249ms.  176/s   0MB/s; 21678/s  27MB/s;  100/s   6MB/s;    
> 0/s   0MB/s
>      0  6465ms.    0/s   0MB/s;    0/s   0MB/s;    1/s   0MB/s;    0/s   
> 0MB/s
> 2234690   240ms.  183/s   0MB/s; 20162/s  16MB/s;   25/s   1MB/s;    
> 0/s   0MB/s
Ok, the txgs look really good.  Its taking them about 1/4 of a second to 
push out all the dirty data.  We like txgs to take 5 seconds or under.

In summary, at this data collection point everything looked good except 
dmu_object_alloc().

If you can grab the same lockstat data as before (''lockstat -kgIW -D 30
sleep 60'' and ''lockstat -P -D 30 sleep 60'') at the
sametime ''vmstat 1''
shows no idle CPU, that would be great as we could verify.  Perhaps the 
sleep times for the lockstat collection need to be bumped, or a couple 
successive runs grabbed, as i believe you were seeing spikey CPU usage - 
so we might have grabbed data in between the spikes.

eric
> 
> 
> And finally, 30 seconds of nicstat:
> 
>    Time   Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs   
> %Util     Sat
> 21:00:58 aggr0/1   130.3   136.2   657.5   629.0   203.0   221.8    
> 0.22    0.00
> 21:00:58 e1000g1/0   19.00   17.87   112.6   103.4   172.7   177.0    
> 0.03    0.00
> 21:00:58 aggr1   130.3   136.2   657.4   629.0   203.0   221.8    
> 0.22    0.00
> 21:00:58 e1000g3/0   51.48   27.27   272.0   156.1   193.8   178.9    
> 0.06    0.00
> 21:00:58 e1000g0    0.16    0.47    2.39    3.99   67.50   120.6    
> 0.00    0.00
> 21:00:58 e1000g2/0   59.83   91.11   272.8   369.6   224.6   252.4    
> 0.12    0.00
> 21:00:58 e1000g0/0    0.16    0.47    2.39    3.99   67.50   120.5    
> 0.00    0.00
>    Time   Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs   
> %Util     Sat
> 21:01:08 aggr0/1  1158.0  1213.2  6497.7  5838.5   182.5   212.8    
> 1.94    0.00
> 21:01:08 e1000g1/0   506.0   158.4  2851.2   368.6   181.7   440.0    
> 0.54    0.00
> 21:01:08 aggr1  1158.0  1213.1  6497.3  5838.0   182.5   212.8    
> 1.94    0.00
> 21:01:08 e1000g3/0   567.9   31.65  3296.7   175.4   176.4   184.8    
> 0.49    0.00
> 21:01:08 e1000g0    0.12    0.42    1.99    3.49   64.00   123.4    
> 0.00    0.00
> 21:01:08 e1000g2/0   83.36  1023.5   345.1  5296.2   247.4   197.9    
> 0.91    0.00
> 21:01:08 e1000g0/0    0.12    0.42    1.99    3.49   64.00   123.4    
> 0.00    0.00
>    Time   Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs   
> %Util     Sat
> 21:01:18 aggr0/1   762.1   876.4  4133.2  3775.1   188.8   237.7    
> 1.34    0.00
> 21:01:18 e1000g1/0   492.2   171.6  2733.4   671.8   184.4   261.6    
> 0.54    0.00
> 21:01:18 aggr1   762.1   876.5  4133.6  3775.6   188.8   237.7    
> 1.34    0.00
> 21:01:18 e1000g3/0   191.2   25.95  1081.8   141.3   181.0   188.1    
> 0.18    0.00
> 21:01:18 e1000g0    0.14    0.41    2.19    3.39   64.00   123.9    
> 0.00    0.00
> 21:01:18 e1000g2/0   79.32   678.5   322.3  2960.3   252.0   234.7    
> 0.62    0.00
> 21:01:18 e1000g0/0    0.14    0.41    2.19    3.39   64.00   123.9    
> 0.00    0.00
> 
> 
> 
> We know that there are things that can be improved in our network and 
> we''re rushing to fix those things, but more and more I''ve
become less
> certain that this is a network issue at its heart.
> 
> Again, all the help is greatly appreciated.
> 
> benr.
> 
>

Ben Rockwood

2006-Dec-08 08:15 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

eric kustarz wrote:> So i''m guessing there''s lots of files being created over
NFS in one
> particular dataset?
>
> We should figure out how many creates/second you are doing over NFS (i 
> should have put a timeout on the script).  Here''s a real simple
one
> (from your snoop it looked like you''re only doing NFSv3, so
i''m not
> tracking NFSv4):
> "
> #!/usr/sbin/dtrace -s
>
> rfs3_create:entry,
> zfs_create:entry
> {
>         @creates[probefunc] = count();
> }
>
> tick-60s
> {
>         exit(0);
> }
> "

Eric, I love you. 

Running this bit of DTrace reveled more than 4,000 files being created 
in almost any given 60 second window.  And I''ve only got one system
that
would fit that sort of mass file creation: our Joyent Connector products 
Courier IMAP server which uses Maildir.  As a test I simply shutdown 
Courier and unmounted the mail NFS share for good measure and sure 
enough the problem vanished and could not be reproduced.  10 minutes 
later I re-enabled Courier and our problem came back. 

Clearly ZFS file creation is just amazingly heavy even with ZIL 
disabled.  If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron 
cores we''re in big trouble in the longer term.  In the meantime
I''m
going to find a new home for our IMAP Mail so that the other things 
served from that NFS server at least aren''t effected.

You asked for the zpool and zfs info, which I don''t want to share 
because its confidential (if you want it privately I''ll do so, but not 
on a public list), but I will say that its a single massive Zpool in 
which we''re using less than 2% of the capacity.   But in thinking about
this problem, even if we used 2 or more pools, the CPU consumption still 
would have choked the system, right?  This leaves me really nervous 
about what we''ll do when its not an internal mail server thats creating
all those files but a customer. 

Oddly enough, this might be a very good reason to use iSCSI instead of 
NFS on the Thumper.

Eric, I owe you a couple cases of beer for sure.  I can''t tell you how 
much I appreciate your help.  Thanks to everyone else who chimed in with 
ideas and suggestions, all of you guys are the best!

benr.

Spencer Shepler

2006-Dec-08 15:02 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

On Fri, Ben Rockwood wrote:> eric kustarz wrote:
> >So i''m guessing there''s lots of files being created
over NFS in one
> >particular dataset?
> >
> >We should figure out how many creates/second you are doing over NFS (i 
> >should have put a timeout on the script).  Here''s a real
simple one
> >(from your snoop it looked like you''re only doing NFSv3, so
i''m not
> >tracking NFSv4):
> >"
> >#!/usr/sbin/dtrace -s
> >
> >rfs3_create:entry,
> >zfs_create:entry
> >{
> >        @creates[probefunc] = count();
> >}
> >
> >tick-60s
> >{
> >        exit(0);
> >}
> >"
> 
> 
> Eric, I love you. 
> 
> Running this bit of DTrace reveled more than 4,000 files being created 
> in almost any given 60 second window.  And I''ve only got one
system that
> would fit that sort of mass file creation: our Joyent Connector products 
> Courier IMAP server which uses Maildir.  As a test I simply shutdown 
> Courier and unmounted the mail NFS share for good measure and sure 
> enough the problem vanished and could not be reproduced.  10 minutes 
> later I re-enabled Courier and our problem came back. 
> 
> Clearly ZFS file creation is just amazingly heavy even with ZIL 
> disabled.  If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron 
> cores we''re in big trouble in the longer term.  In the meantime
I''m
> going to find a new home for our IMAP Mail so that the other things 
> served from that NFS server at least aren''t effected.
> 
> You asked for the zpool and zfs info, which I don''t want to share 
> because its confidential (if you want it privately I''ll do so, but
not
> on a public list), but I will say that its a single massive Zpool in 
> which we''re using less than 2% of the capacity.   But in thinking
about
> this problem, even if we used 2 or more pools, the CPU consumption still 
> would have choked the system, right?  This leaves me really nervous 
> about what we''ll do when its not an internal mail server thats
creating
> all those files but a customer. 
> 
> Oddly enough, this might be a very good reason to use iSCSI instead of 
> NFS on the Thumper.
> 
> Eric, I owe you a couple cases of beer for sure.  I can''t tell you
how
> much I appreciate your help.  Thanks to everyone else who chimed in with 
> ideas and suggestions, all of you guys are the best!
Good to hear that you have figured out what is happening, Ben.

For future reference, there are two commands that you may want to
make use of in observing the behavior of the NFS server and individual
filesystems.

There is the trusty, nfsstat command.  In this case, you would have been
able to do something like:
	nfsstat -s -v3 60

This will provide all of the server side NFSv3 statistics on 60 second
intervals.  

Then there is a new command fsstat that will provide vnode level
activity on a per filesystem basis.  Therefore, if the NFS server
has multiple filesystems active and you want ot look at just one
something like this can be helpful:

	fsstat /export/foo 60

Fsstat has a ''full'' option that will list all of the vnode
operations
or just certain types.  It also will watch a filesystem type (e.g. zfs, nfs).
Very useful.

Spencer

Bill Moore

2006-Dec-08 15:17 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

On Fri, Dec 08, 2006 at 12:15:27AM -0800, Ben Rockwood
wrote:> Clearly ZFS file creation is just amazingly heavy even with ZIL 
> disabled.  If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron 
> cores we''re in big trouble in the longer term.  In the meantime
I''m
> going to find a new home for our IMAP Mail so that the other things 
> served from that NFS server at least aren''t effected.
For local tests, this is not true of ZFS.  It seems that file creation
only swamps us when coming over NFS.  We can do thousands of files a
second on a Thumper with room to spare if NFS isn''t involved.

Next step is to figure out why NFS kills us.


--Bill

Ben Rockwood

2006-Dec-08 19:09 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Spencer Shepler wrote:> Good to hear that you have figured out what is happening, Ben.
>
> For future reference, there are two commands that you may want to
> make use of in observing the behavior of the NFS server and individual
> filesystems.
>
> There is the trusty, nfsstat command.  In this case, you would have been
> able to do something like:
> 	nfsstat -s -v3 60
>
> This will provide all of the server side NFSv3 statistics on 60 second
> intervals.  
>
> Then there is a new command fsstat that will provide vnode level
> activity on a per filesystem basis.  Therefore, if the NFS server
> has multiple filesystems active and you want ot look at just one
> something like this can be helpful:
>
> 	fsstat /export/foo 60
>
> Fsstat has a ''full'' option that will list all of the
vnode operations
> or just certain types.  It also will watch a filesystem type (e.g. zfs,
nfs).
> Very useful.
>   

NFSstat I''ve been using, but fsstat I was unaware of.  Which
I''d used it
rather than duplicated most of its functionality with D script. :)

Thanks for the tip.

benr.

Ben Rockwood

2006-Dec-08 19:15 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Bill Moore wrote:> On Fri, Dec 08, 2006 at 12:15:27AM -0800, Ben Rockwood wrote:
>   
>> Clearly ZFS file creation is just amazingly heavy even with ZIL 
>> disabled.  If creating 4,000 files in a minute squashes 4 2.6Ghz
Opteron
>> cores we''re in big trouble in the longer term.  In the
meantime I''m
>> going to find a new home for our IMAP Mail so that the other things 
>> served from that NFS server at least aren''t effected.
>>     
>
> For local tests, this is not true of ZFS.  It seems that file creation
> only swamps us when coming over NFS.  We can do thousands of files a
> second on a Thumper with room to spare if NFS isn''t involved.
>
> Next step is to figure out why NFS kills us.
>   
Agreed.  If mass file creation was a problem locally I''d think that
we''d
have people beating down the doors with complaints.

One thought I had as a work around was to move all my mail on NFS to an 
iSCSI LUN and then put a Zpool on that.  I''m willing to bet
that''d work
fine.  Hopefully I can try it.

To round out the discussion, the root cause of this whole mess was 
Courier IMAP Locking.   After isolating the problem last night and 
writing a little d script to find out what files were being create it 
was obviously lock files, turn off locking and file creations dropped to 
a reasonable level and our problem vanished.

If I can help at all with testing or analysis please let me know.

benr.

eric kustarz

2006-Dec-08 19:33 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Ben Rockwood wrote:> Bill Moore wrote:
> 
>> On Fri, Dec 08, 2006 at 12:15:27AM -0800, Ben Rockwood wrote:
>>  
>>
>>> Clearly ZFS file creation is just amazingly heavy even with ZIL 
>>> disabled.  If creating 4,000 files in a minute squashes 4 2.6Ghz 
>>> Opteron cores we''re in big trouble in the longer term.  In
the
>>> meantime I''m going to find a new home for our IMAP Mail so
that the
>>> other things served from that NFS server at least aren''t
effected.
>>>     
>>
>>
>> For local tests, this is not true of ZFS.  It seems that file creation
>> only swamps us when coming over NFS.  We can do thousands of files a
>> second on a Thumper with room to spare if NFS isn''t involved.
>>
>> Next step is to figure out why NFS kills us.
>>   
> 
> 
> Agreed.  If mass file creation was a problem locally I''d think
that we''d
> have people beating down the doors with complaints.
> 
> One thought I had as a work around was to move all my mail on NFS to an 
> iSCSI LUN and then put a Zpool on that.  I''m willing to bet
that''d work
> fine.  Hopefully I can try it.
Could be NFS synchronous semantics on file create (followed by repeated 
flushing of the write cache).  What kind of storage are you using (feel 
free to send privately if you need to) - is it a thumper?
> 
> 
> To round out the discussion, the root cause of this whole mess was 
> Courier IMAP Locking.   After isolating the problem last night and 
> writing a little d script to find out what files were being create it 
> was obviously lock files, turn off locking and file creations dropped to 
> a reasonable level and our problem vanished.
Gotta love that D.
> 
> If I can help at all with testing or analysis please let me know.
I want to create a test case in-house first and see if its reproducable.

eric

ps: i like lots of different types of beers

Jim Mauro

2006-Dec-09 16:59 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

> Could be NFS synchronous semantics on file create (followed by 
> repeated flushing of the write cache).  What kind of storage are you 
> using (feel free to send privately if you need to) - is it a thumper? It''s not clear why NFS-enforced synchronous semantics would induce 
different behavior than the same
load to a local ZFS.

File creates are metadata intensive, right? And these operations need to 
be synchronous to guarantee
file system consistency (yes, I am familiar with the ZFS COW model).

Anyway....I''m feeling rather naive'' here, but I''ve
seen the "NFS
enforced synchronous semantics" phrase
kicked around many times as the explanation for suboptimal performance 
for metadata-intensive
operations when ZFS is the underlying file system, but I never really 
understood what''s "unsynchronous"
about doing the same thing to a local ZFS.

And yes, there is certainly a network latency component to the NFS 
configuration, so for any
synchronous operation, I would expect things to be slower when done over 
NFS.

Awaiting enlightment....

:^)

/jim

eric kustarz

2006-Dec-09 18:07 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Jim Mauro wrote:> 
>> Could be NFS synchronous semantics on file create (followed by 
>> repeated flushing of the write cache).  What kind of storage are you 
>> using (feel free to send privately if you need to) - is it a thumper? 
> 
> It''s not clear why NFS-enforced synchronous semantics would induce
> different behavior than the same
> load to a local ZFS.
Actually i forgot he had ''zil_disable'' turned on, so it
won''t matter in
this case.

The big difference is that the NFS server does a VOP_FSYNC on each file 
create as it needs to guarantee that the changes are on stable storage. 
  Locally, we don''t need that same guarantee so you don''t have
to wait
for a disk I/O to complete on each create.

eric
> 
> File creates are metadata intensive, right? And these operations need to 
> be synchronous to guarantee
> file system consistency (yes, I am familiar with the ZFS COW model).
> 
> Anyway....I''m feeling rather naive'' here, but
I''ve seen the "NFS
> enforced synchronous semantics" phrase
> kicked around many times as the explanation for suboptimal performance 
> for metadata-intensive
> operations when ZFS is the underlying file system, but I never really 
> understood what''s "unsynchronous"
> about doing the same thing to a local ZFS.
> 
> And yes, there is certainly a network latency component to the NFS 
> configuration, so for any
> synchronous operation, I would expect things to be slower when done over 
> NFS.
> 
> Awaiting enlightment....
> 
> :^)
> 
> /jim
>

Ed Gould

2006-Dec-09 18:58 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

On Dec 9, 2006, at 8:59 , Jim Mauro wrote:> Anyway....I''m feeling rather naive'' here, but
I''ve seen the "NFS
> enforced synchronous semantics" phrase
> kicked around many times as the explanation for suboptimal  
> performance for metadata-intensive
> operations when ZFS is the underlying file system, but I never  
> really understood what''s "unsynchronous"
> about doing the same thing to a local ZFS
If I remember correctly, the difference is that NFS requires that the  
operation be committed to stable storage before the return to the  
client.  This is definitely a heavier operation than the local case,  
where the return to the caller may happen as soon as the operation is  
cached.  If there''s a crash, the local system does not guarantee to  
the caller that the operation is on disk, but NFS does.

Both guarantee consistency but NFS makes stronger guarantees of  
completeness.

	--Ed

Casper.Dik at Sun.COM

2006-Dec-09 19:57 UTC

head link

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

>On Dec 9, 2006, at 8:59 , Jim Mauro wrote:
>> Anyway....I''m feeling rather naive'' here, but
I''ve seen the "NFS
>> enforced synchronous semantics" phrase
>> kicked around many times as the explanation for suboptimal  
>> performance for metadata-intensive
>> operations when ZFS is the underlying file system, but I never  
>> really understood what''s "unsynchronous"
>> about doing the same thing to a local ZFS
>
>If I remember correctly, the difference is that NFS requires that the  
>operation be committed to stable storage before the return to the  
>client.  This is definitely a heavier operation than the local case,  
>where the return to the caller may happen as soon as the operation is  
>cached.  If there''s a crash, the local system does not guarantee to
>the caller that the operation is on disk, but NFS does.
>
>Both guarantee consistency but NFS makes stronger guarantees of  
>completeness.

The currenet way to solve this from the NFS server''s perspective
is to call something equivalent to "sync()".  It was my understanding
that disable ZIL would make such operations more or less
"asynchronous"
but it still seems that NFS is doing a sync() like thing.

Perhaps there should be a different mechanism for making VOP_*
calls, calls which do not return until the specific call was
committed to stable storage.  (Interested parties get notified
when such events happen).

Unfortunately, that would cause the NFS client system calls to sleep
until the calls are completed (as they do now), but there doesn''t seem
to be a need for that; writes currently need a specific commit and
it would perhaps be worthwhile to investigate more asynchronous
behaviour.  As long as the client remembers the non-committed calls,
there is no "data loss" issue; when the client crashes, you expect
data loss; when the server crashes, the client resends.

Casper

Robert Milkowski

2006-Dec-11 00:33 UTC

head link

[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Hello eric,

Saturday, December 9, 2006, 7:07:49 PM, you wrote:

ek> Jim Mauro wrote:>> 
>>> Could be NFS synchronous semantics on file create (followed by 
>>> repeated flushing of the write cache).  What kind of storage are
you
>>> using (feel free to send privately if you need to) - is it a
thumper?
>> 
>> It''s not clear why NFS-enforced synchronous semantics would
induce
>> different behavior than the same
>> load to a local ZFS.
ek> Actually i forgot he had ''zil_disable'' turned on, so it
won''t matter in
ek> this case.


Ben, are you sure zil_disable was set to 1 BEFORE pool was imported?



-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Ben Rockwood

2006-Dec-11 20:34 UTC

head link

[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Robert Milkowski wrote:> Hello eric,
>
> Saturday, December 9, 2006, 7:07:49 PM, you wrote:
>
> ek> Jim Mauro wrote:
>   
>>>> Could be NFS synchronous semantics on file create (followed by 
>>>> repeated flushing of the write cache).  What kind of storage
are you
>>>> using (feel free to send privately if you need to) - is it a
thumper?
>>>>         
>>> It''s not clear why NFS-enforced synchronous semantics
would induce
>>> different behavior than the same
>>> load to a local ZFS.
>>>       
>
> ek> Actually i forgot he had ''zil_disable'' turned on,
so it won''t matter in
> ek> this case.
>
>
> Ben, are you sure zil_disable was set to 1 BEFORE pool was imported?
>   
Yes, absolutely.  Set var in /etc/system, reboot, system come up.  That 
happened almost 2 months ago, long before this lock insanity problem 
popped up.

To be clear, the ZIL issue was a problem for creation of a handful of 
files of any size.  Untar''ing a file was a massive performance drain.  
This issue, other the other hand, deals with thousands of little files 
being created all the time (IMAP Locks).  These are separate issues from 
my point of view.  With ZIL slowness NFS performance was just slow but 
we didn''t see massive CPU usage, with this issue on the other hand we 
were seeing waves in 10 second-ish cycles where the run queue would go 
sky high with 0 idle.  Please see the earlier mails for examples of the 
symptoms.

benr.

Robert Milkowski

2006-Dec-12 01:56 UTC

head link

[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Hello Ben,

Monday, December 11, 2006, 9:34:18 PM, you wrote:

BR> Robert Milkowski wrote:>> Hello eric,
>>
>> Saturday, December 9, 2006, 7:07:49 PM, you wrote:
>>
>> ek> Jim Mauro wrote:
>>   
>>>>> Could be NFS synchronous semantics on file create (followed
by
>>>>> repeated flushing of the write cache).  What kind of
storage are you
>>>>> using (feel free to send privately if you need to) - is it
a thumper?
>>>>>         
>>>> It''s not clear why NFS-enforced synchronous semantics
would induce
>>>> different behavior than the same
>>>> load to a local ZFS.
>>>>       
>>
>> ek> Actually i forgot he had ''zil_disable'' turned
on, so it won''t matter in
>> ek> this case.
>>
>>
>> Ben, are you sure zil_disable was set to 1 BEFORE pool was imported?
>>   
BR> Yes, absolutely.  Set var in /etc/system, reboot, system come up. That
BR> happened almost 2 months ago, long before this lock insanity problem 
BR> popped up.

BR> To be clear, the ZIL issue was a problem for creation of a handful of 
BR> files of any size.  Untar''ing a file was a massive performance
drain.
BR> This issue, other the other hand, deals with thousands of little files
BR> being created all the time (IMAP Locks).  These are separate issues from
BR> my point of view.  With ZIL slowness NFS performance was just slow but
BR> we didn''t see massive CPU usage, with this issue on the other
hand we
BR> were seeing waves in 10 second-ish cycles where the run queue would go
BR> sky high with 0 idle.  Please see the earlier mails for examples of the
BR> symptoms.

Ok.

And just another question - is nfs file system mounted with noac
options on imap server and application is doing chdir() to nfs
directories? I''m not sure if it''s fixed - if not your nfs
client on
every chdir() will generate lot of small traffic to nfs server
starving it of cpu.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Richard Elling

2006-Dec-12 05:15 UTC

head link

[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

BR> Yes, absolutely.  Set var in /etc/system, reboot, system come up. That
BR> happened almost 2 months ago, long before this lock insanity problem
BR> popped up.

For the archives, a high level of lock activity can always be a problem.
The worst cases I''ve experienced were with record locking over NFS.
The worst offenders were programs written with a local file system in
mind, especially PC-based applications.  This has become one of the
first things I check in such instances, and the network traffic will
be consistently full of lock activity.  Just one of those things you
need to watch out for in a networked world :-)
  -- richard

Ben Rockwood

2007-Jan-10 16:26 UTC

head link

[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Eric Kustarz wrote:> Ben Rockwood wrote:
> > I''ve got a Thumper doing nothing but serving NFS.  Its using
B43 with
> > zil_disabled.  The system is being consumed in waves, but by what I
> > don''t know.  Notice vmstat:
>
> We made several performance fixes in the NFS/ZFS area in recent
> builds, so if possible it would be great to upgrade you from snv_43.
> That said, there might be something else going on that we haven''t
> accounted for.
> ...
>
>>
>> Step 1 was to disable any ZFS features that might consume large
>> amounts of CPU:
>>
>> # zfs set compression=off joyous
>> # zfs set atime=off joyous
>> # zfs set checksum=off joyous
>
> In our performance testing, we haven''t found checksums to be
anywhere
> near a large consumer of CPU, so i would recommend leaving that on
> (due to its benefits).
>
> I suspect your apps/clients don''t depend on atime, so i think its
a
> good idea to turn that off.  We''ve gotten better NFS performance
with
> this off.
>
> More of a heads up as it sounds like compression on/off isn''t your
> problem.  If you are not getting good I/O BW with compression turned
> on, its most likely due to:
> 6460622 zio_nowait() doesn''t live up to its name
>
> As Jim, mentioned, using lockstat to figure out where your CPU is
> being spent is the first step.  I''ve been using ''lockstat
-kgIW -D 60
> sleep 60''.  That collects data for the top 60 callers for a 1
minute
> period. If you see ''mutex_enter'' high up in the results,
then we have
> at least mutex lock contention.
>
>
> ...
>
>
>> Interestingly, using prstat -mL to monitor thread latency, we see
>> that a handful of threads are the culprates for consuming mass CPU:
>>
>> [private:/tmp] root# prstat -mL
>>    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
>> PROCESS/LWPID  22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 416
>> 1   0   0 nfsd/1506
>>  22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 415   0   0   0
>> nfsd/1563
>>  22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 417   0   0   0
>> nfsd/1554
>>  22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 419   0   0   0
>> nfsd/1551
>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0  26  74 418   0   0   0
>> nfsd/1553
>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0
>> nfsd/1536
>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0
>> nfsd/1555
>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0
>> nfsd/1539
>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0
>> nfsd/1562
>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0
>> nfsd/1545
>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0
>> nfsd/1559
>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419   1   0   0
>> nfsd/1541
>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0
>> nfsd/1546
>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0
>> nfsd/1543
>>  22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0
>> nfsd/1560
>> Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86
>
> The high SYS times being charged to the userland nfsd threads is
> representative of what the kernel threads are doing (which is most
> likely going to be NFS and ZFS).
>
>
>> Running zvop_times.d
>> (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an
>> idea of what ZFS is doing:
>>
>> [private:/] root# dtrace -s  zvop_times.d
>> dtrace: script ''zvop_times.d'' matched 66 probes
>> ^C
>> CPU     ID                    FUNCTION:NAME
>>   1      2                             :END ZFS COUNT
>>
>>   zfs_getsecattr                                                    4
>>   zfs_space                                                        70
>>   zfs_rename                                                      111
>>   zfs_readdir                                                     284
>>   zfs_read                                                        367
>>   zfs_mkdir                                                       670
>>   zfs_setattr                                                    1054
>>   zfs_frlock                                                     1562
>>   zfs_putpage                                                    3916
>>   zfs_write                                                      4110
>>   zfs_create                                                     4178
>>   zfs_remove                                                     7794
>>   zfs_fid                                                       14960
>>   zfs_inactive                                                  17637
>>   zfs_access                                                    20809
>>   zfs_fsync                                                     29668
>>   zfs_lookup                                                    31273
>>   zfs_getattr                                                  175457
>>
>>
>>
>> ZFS AVG TIME
>>
>>   zfs_fsync                                                      2337
>>   zfs_getattr                                                    2735
>>   zfs_access                                                     2774
>>   zfs_fid                                                        2948
>>   zfs_inactive                                                   4081
>>   zfs_lookup                                                     6067
>>   zfs_frlock                                                     6750
>>   zfs_putpage                                                    6840
>>   zfs_mkdir                                                      9259
>>   zfs_getsecattr                                                10940
>>   zfs_setattr                                                   15364
>>   zfs_space                                                     25214
>>   zfs_readdir                                                   25726
>>   zfs_remove                                                    26117
>>   zfs_write                                                     31942
>>   zfs_rename                                                    60504
>>   zfs_read                                                     144449
>>   zfs_create                                                 71215587
>
> An average of 71 msecs per create seems high.  Doing a quantize
> instead of just an average on zfs_create() would be interesting
> (should have included that in my script)...
>
> ...
>
>
>> Locks...
>>
>> [private:/tmp] root# lockstat -P sleep 10
>>
>> Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec)
>>
>> Count indv cuml rcnt     spin Lock
>> Caller
>>
-------------------------------------------------------------------------------
>>
>>    32 100% 100% 0.00 59963092 0xffffffffb282a1c8
>> dmu_object_alloc+0x75 \
>
> That''s interesting.  This is build snv_43, right? 
Doesn''t exactly
> line up with current bits, but i would guess this is "
> mutex_enter(&osi->os_obj_lock);".
>
> Can you show us the output for a longer sampling period, such as
> ''lockstat -P -D 30 sleep 60''?  The "Adaptive mutex
spin" category is
> usually the most/only interesting one in my experience.
>
> I would be curious to know your memory usage.  What is the physical
> memory on the machine?
>
> And can you run:
> # echo ''arc::print -d'' | mdb -k
> # echo ''::kmastat'' | mdb -k | sort -n -k 5 | tail -40
>
> Matt has a nice little dscript that figures out how long txgs are
> taking (i''ll attach that).  I believe it will work with old bits
(i
> defined "OLD" the for that to work).  Can you try running that?
>
> Can you run ''zpool list'' and ''zpool
status''?  I''d like to know how
> full your pool is.
>
> Let''s start with that info,
> eric

Here is a bunch of the data you asked for.  I can''t tell you how much I
appreciate the help from everyone.

Lockstat:

[private:/tmp] root# lockstat -kgIW -D 60 sleep 60

Profiling interrupt: 23296 events in 60.046 seconds (388 events/sec)

Count genr cuml rcnt     nsec Hottest CPU+PIL
Caller
-------------------------------------------------------------------------------
18290  79% ---- 0.00    21994 cpu[3]
thread_start
18102  78% ---- 0.00    21988 cpu[3]
idle
18024  77% ---- 0.00    21983 cpu[3]
cpu_halt
 4626  20% ---- 0.00    25530 cpu[2]
sys_syscall32
 4626  20% ---- 0.00    25530 cpu[2]
nfssys
 4626  20% ---- 0.00    25530 cpu[2]
svc_do_run
 4624  20% ---- 0.00    25539 cpu[2]
svc_run
 4522  19% ---- 0.00    25642 cpu[2]
svc_getreq
 4502  19% ---- 0.00    25662 cpu[2]
rfs_dispatch
 4495  19% ---- 0.00    25660 cpu[2]
common_dispatch
 3951  17% ---- 0.00    26143 cpu[2]
rfs3_create
 3942  17% ---- 0.00    26156 cpu[2]
fop_create
 3941  17% ---- 0.00    26154 cpu[2]
zfs_create
 3922  17% ---- 0.00    26150 cpu[2]
zfs_mknode
 3913  17% ---- 0.00    26163 cpu[2]
dmu_object_alloc
 1845   8% ---- 0.00    20288 cpu[0]
mutex_vector_enter
  970   4% ---- 0.00    31591 cpu[2]
dmu_object_next
  917   4% ---- 0.00    31673 cpu[2]
dnode_next_offset
  899   4% ---- 0.00    30573 cpu[2]
dnode_hold_impl
  883   4% ---- 0.00    31639 cpu[2]
dnode_next_offset_level
  725   3% ---- 0.00    30896 cpu[2]
dbuf_hold_impl
  627   3% ---- 0.00    28908 cpu[2]
mutex_enter
  361   2% ---- 0.00    31135 cpu[2]
dmu_buf_rele
  350   2% ---- 0.00    29945 cpu[2]
dbuf_hold
  350   2% ---- 0.00    31147 cpu[2]+6
mac_rx
  341   1% ---- 0.00    31344 cpu[2]
arc_buf_add_ref
  303   1% ---- 0.00    31154 cpu[2]+6
intr_thread
  290   1% ---- 0.00    31074 cpu[2]+6
av_dispatch_autovect
  259   1% ---- 0.00    24534 cpu[1]
zio_next_stage
  257   1% ---- 0.00    31007 cpu[2]
arc_buf_remove_ref
  244   1% ---- 0.00    31165 cpu[2]+6
e1000g_intr
  243   1% ---- 0.00    31507 cpu[2]
buf_hash
  218   1% ---- 0.00    25154 cpu[0]
putnext
  186   1% ---- 0.00    21154 cpu[0]
svc_sendreply
  184   1% ---- 0.00    21194 cpu[0]
svc_cots_ksend
  175   1% ---- 0.00    31147 cpu[2]+6
aggr_recv_cb
  172   1% ---- 0.00    31151 cpu[2]+6
i_dls_link_ether_rx
  159   1% ---- 0.00    31155 cpu[2]+6
ip_input
  158   1% ---- 0.00    21275 cpu[0]
put
  155   1% ---- 0.00    30127 cpu[2]
dbuf_find
  152   1% ---- 0.00    21597 cpu[0]
rmm_wput
  147   1% ---- 0.00    21688 cpu[0]
mir_wput
  141   1% ---- 0.00    31153 cpu[2]+6
squeue_enter_chain
  138   1% ---- 0.00    26152 cpu[3]
zio_next_stage_async
  137   1% ---- 0.00    31165 cpu[2]+6
tcp_rput_data
  134   1% ---- 0.00    21688 cpu[0]
tcp_wput
  128   1% ---- 0.00    21804 cpu[0]
squeue_enter
  123   1% ---- 0.00    21659 cpu[0]
tcp_output
  104   0% ---- 0.00    25306 cpu[1]
zio_nowait
  102   0% ---- 0.00    30574 cpu[2]
atomic_add_ptr_nv
   91   0% ---- 0.00    22605 cpu[0]
tcp_send_data
   89   0% ---- 0.00    21472 cpu[1]
nfs3_fhtovp
   87   0% ---- 0.00    27427 cpu[2]
rw_enter
   86   0% ---- 0.00    21176 cpu[1]
fsop_vget
   84   0% ---- 0.00    24945 cpu[1]
zio_vdev_io_start
   81   0% ---- 0.00    21110 cpu[1]
zfs_vget
   81   0% ---- 0.00    25006 cpu[3]
zio_wait_for_children
   78   0% ---- 0.00    31423 cpu[2]
atomic_add_ptr
   74   0% ---- 0.00    31148 cpu[2]+11
rmm_rput
   73   0% ---- 0.00    22876 cpu[0]
dls_tx
-------------------------------------------------------------------------------


NFS Stats:

[private:/tmp] root# nfsstat -s -v 3

Server rpc:
Connection oriented:
calls      badcalls   nullrecv   badlen     xdrcall    dupchecks
dupreqs
154161041  0          0          0          0          35669015
0
Connectionless:
calls      badcalls   nullrecv   badlen     xdrcall    dupchecks
dupreqs
0          0          0          0          0          0
0

Server NFSv3:
calls     badcalls
153760861 0
Version 3: (153482035 calls)
null         getattr      setattr      lookup       access
readlink
1 0%         68265095 44% 2334522 1%   33186002 21% 9373979 6%   4430
0%
read         write        create       mkdir        symlink
mknod
776093 0%    6496524 4%   6353313 4%   1883254 1%   814 0%       2
0%
remove       rmdir        rename       link         readdir
readdirplus
12084296 7%  2692 0%      208683 0%    5969808 3%   23751 0%     335079
0%
fsstat       fsinfo       pathconf     commit
6 0%         1 0%         2 0%         6183688 4%

Server nfs_acl:
Version 3: (278383 calls)
null        getacl      setacl      getxattrdir
0 0%        32060 11%   5 0%        246318 88%


Memory:

[private:/tmp] root# echo ''arc::print -d'' | mdb -k
{
    anon = ARC_anon
    mru = ARC_mru
    mru_ghost = ARC_mru_ghost
    mfu = ARC_mfu
    mfu_ghost = ARC_mfu_ghost
    size = 0t7337733120
    p = 0t3673915285
    c = 0t7338496468
    c_min = 0t533448320
    c_max = 0t15996604416
    hits = 0t6864679166
    misses = 0t518711
    deleted = 0t295199
    skipped = 0t1701
    hash_elements = 0t500010
    hash_elements_max = 0t511362
    hash_collisions = 0t1820702
    hash_chains = 0t149161
    hash_chain_max = 0t10
    no_grow = 0
}
[private:/tmp] root# echo ''::kmastat'' | mdb -k | sort -n -k 5
| tail -40
zio_buf_10240              10240   1524   1642  16814080     46367     0
zio_buf_3072                3072   4663   5604  17215488    354215     0
streams_dblk_1936           2048     43   8418  17240064 159700970     0
zio_buf_77824              77824    221    226  17588224      1464     0
zio_buf_28672              28672    660    664  19038208      9233     0
zio_buf_32768              32768    578    595  19496960     74273     0
zio_buf_3584                3584   4492   5952  21331968    186767     0
zio_buf_24576              24576    876    885  21749760     15378     0
zio_buf_14336              14336   1008   1556  22306816   2101355     0
kmem_va_8192                8192   2361   2752  22544384      4067     0
kmem_alloc_128               128 161894 173042  22863872  41686966     0
zio_buf_4096                4096   6002   6016  24641536    193926     0
kmem_magazine_15             128 207407 212939  28135424    237076     0
zio_buf_5120                5120   5031   6144  31457280    126567     0
zio_buf_20480              20480   1557   1567  32092160     39694     0
kmem_alloc_8                   8 4064087 4064240  32964608  15892691     0
kmem_va_32768              32768    595   1028  33685504      1731     0
kmem_va_24576              24576    885   1560  40894464      2278     0
kmem_alloc_112               112 289900 366192  41664512   7743803     0
kmem_alloc_40                 40 806400 1048077  42504192  14060780     0
kmem_alloc_256               256 187308 191970  52420608  33326285     0
arc_buf_hdr_t                128 500219 512089  67661824  14104876     0
kmem_slab_cache               56 1239939 1250640  71147520   1612006     0
kmem_va_12288              12288   4609   5690  74579968      8137     0
zfs_znode_cache              192 544134 545643 106426368   7771071     0
kmem_va_28672              28672   2651   3292 107872256      6270     0
rootnex_dmahdl              2504  42668  42832 109649920  32118262     0
kmem_io_64G_4096            4096  32768  32768 134217728     32768     0
vn_cache                     240 552163 552255 150802432    634861     0
kmem_bufctl_cache             24 7085442 7085568 172752896   8534442     0
kmem_va_20480              20480   9939  11568 252706816     14227     0
zio_buf_512                  512 578348 604080 309288960  29468142     0
dmu_buf_impl_t               328 874033 908160 309985280  21869642     0
kmem_alloc_96                 96 3879869 3879918 378384384  21906309     0
zio_buf_65536              65536   7269   7273 476643328   1368534     0
kmem_va_4096                4096 1178724 1178752 533200896   1485372     0
dnode_t                      640 4641005 4641006 3168260096   8165695     0
zio_buf_131072            131072  25251  25256 3310354432    245308     0
zio_buf_16384              16384 214643 214667 3517104128    747095     0
kmem_va_16384              16384 214783 217944 3570794496    375400     0

ZPool (I''m omit''ing zfs list output because the filesystems
are named by
customer numbers):

[private:/tmp] root# zpool status
  pool: joyous
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        joyous      ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c5t4d0  ONLINE       0     0     0
            c4t4d0  ONLINE       0     0     0
            c7t4d0  ONLINE       0     0     0
            c6t4d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
            c0t4d0  ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0
            c7t0d0  ONLINE       0     0     0
            c6t0d0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c5t5d0  ONLINE       0     0     0
            c4t5d0  ONLINE       0     0     0
            c7t5d0  ONLINE       0     0     0
            c6t5d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
            c0t5d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0
            c7t1d0  ONLINE       0     0     0
            c6t1d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c5t6d0  ONLINE       0     0     0
            c4t6d0  ONLINE       0     0     0
            c7t6d0  ONLINE       0     0     0
            c6t6d0  ONLINE       0     0     0
            c1t6d0  ONLINE       0     0     0
            c0t6d0  ONLINE       0     0     0
            c4t2d0  ONLINE       0     0     0
            c7t2d0  ONLINE       0     0     0
            c6t2d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c5t7d0  ONLINE       0     0     0
            c4t7d0  ONLINE       0     0     0
            c7t7d0  ONLINE       0     0     0
            c6t7d0  ONLINE       0     0     0
            c1t7d0  ONLINE       0     0     0
            c0t7d0  ONLINE       0     0     0
            c4t3d0  ONLINE       0     0     0
            c7t3d0  ONLINE       0     0     0
            c6t3d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
        spares
          c5t1d0    AVAIL
          c5t2d0    AVAIL
          c5t3d0    AVAIL

errors: No known data errors


ZFS_Sync Dtrace:

[private:/tmp] root# ./zfs_sync.d
   txg     time     sync-write     async-write       sync-read
async-read
      0  5206ms.    0/s   0MB/s;    0/s   0MB/s;   10/s   0MB/s;
0/s   0MB/s
2234681   281ms.  156/s   0MB/s; 32249/s  29MB/s;   14/s   0MB/s;
0/s   0MB/s
2234682   249ms.  176/s   0MB/s; 23216/s  23MB/s;   40/s   2MB/s;
0/s   0MB/s
      0  5840ms.    0/s   0MB/s;    0/s   0MB/s;   24/s   1MB/s;
0/s   0MB/s
      0  6239ms.    0/s   0MB/s;    0/s   0MB/s;   67/s   4MB/s;
0/s   0MB/s
2234683   427ms.  103/s   0MB/s; 22346/s  24MB/s;  117/s   7MB/s;
0/s   0MB/s
      0  6729ms.    0/s   0MB/s;    0/s   0MB/s;    6/s   0MB/s;
0/s   0MB/s
2234684   228ms.  192/s   0MB/s; 24052/s  23MB/s;   17/s   1MB/s;
0/s   0MB/s
      0  4770ms.    0/s   0MB/s;    0/s   0MB/s;    7/s   0MB/s;
0/s   0MB/s
2234685   212ms.  207/s   0MB/s; 25382/s  20MB/s;    0/s   0MB/s;
0/s   0MB/s
      0  4787ms.    0/s   0MB/s;    0/s   0MB/s;   31/s   1MB/s;
0/s   0MB/s
2234686   215ms.  200/s   0MB/s; 19544/s  17MB/s;   18/s   1MB/s;
0/s   0MB/s
2234687   197ms.  223/s   0MB/s; 25253/s  23MB/s;   15/s   0MB/s;
0/s   0MB/s
      0  8010ms.    0/s   0MB/s;    0/s   0MB/s;    3/s   0MB/s;
0/s   0MB/s
      0  5611ms.    0/s   0MB/s;    0/s   0MB/s;    4/s   0MB/s;
0/s   0MB/s
2234688   255ms.  172/s   0MB/s; 23250/s  21MB/s;   19/s   1MB/s;
0/s   0MB/s
      0  5357ms.    0/s   0MB/s;    0/s   0MB/s;    8/s   0MB/s;
0/s   0MB/s
2234689   249ms.  176/s   0MB/s; 21678/s  27MB/s;  100/s   6MB/s;
0/s   0MB/s
      0  6465ms.    0/s   0MB/s;    0/s   0MB/s;    1/s   0MB/s;
0/s   0MB/s
2234690   240ms.  183/s   0MB/s; 20162/s  16MB/s;   25/s   1MB/s;
0/s   0MB/s


And finally, 30 seconds of nicstat:

    Time   Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs
%Util     Sat
21:00:58 aggr0/1   130.3   136.2   657.5   629.0   203.0   221.8
0.22    0.00
21:00:58 e1000g1/0   19.00   17.87   112.6   103.4   172.7   177.0
0.03    0.00
21:00:58 aggr1   130.3   136.2   657.4   629.0   203.0   221.8
0.22    0.00
21:00:58 e1000g3/0   51.48   27.27   272.0   156.1   193.8   178.9
0.06    0.00
21:00:58 e1000g0    0.16    0.47    2.39    3.99   67.50   120.6
0.00    0.00
21:00:58 e1000g2/0   59.83   91.11   272.8   369.6   224.6   252.4
0.12    0.00
21:00:58 e1000g0/0    0.16    0.47    2.39    3.99   67.50   120.5
0.00    0.00
    Time   Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs
%Util     Sat
21:01:08 aggr0/1  1158.0  1213.2  6497.7  5838.5   182.5   212.8
1.94    0.00
21:01:08 e1000g1/0   506.0   158.4  2851.2   368.6   181.7   440.0
0.54    0.00
21:01:08 aggr1  1158.0  1213.1  6497.3  5838.0   182.5   212.8
1.94    0.00
21:01:08 e1000g3/0   567.9   31.65  3296.7   175.4   176.4   184.8
0.49    0.00
21:01:08 e1000g0    0.12    0.42    1.99    3.49   64.00   123.4
0.00    0.00
21:01:08 e1000g2/0   83.36  1023.5   345.1  5296.2   247.4   197.9
0.91    0.00
21:01:08 e1000g0/0    0.12    0.42    1.99    3.49   64.00   123.4
0.00    0.00
    Time   Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs
%Util     Sat
21:01:18 aggr0/1   762.1   876.4  4133.2  3775.1   188.8   237.7
1.34    0.00
21:01:18 e1000g1/0   492.2   171.6  2733.4   671.8   184.4   261.6
0.54    0.00
21:01:18 aggr1   762.1   876.5  4133.6  3775.6   188.8   237.7
1.34    0.00
21:01:18 e1000g3/0   191.2   25.95  1081.8   141.3   181.0   188.1
0.18    0.00
21:01:18 e1000g0    0.14    0.41    2.19    3.39   64.00   123.9
0.00    0.00
21:01:18 e1000g2/0   79.32   678.5   322.3  2960.3   252.0   234.7
0.62    0.00
21:01:18 e1000g0/0    0.14    0.41    2.19    3.39   64.00   123.9
0.00    0.00



We know that there are things that can be improved in our network and
we''re rushing to fix those things, but more and more I''ve
become less
certain that this is a network issue at its heart.

Again, all the help is greatly appreciated.

benr.


_______________________________________________
nfs-discuss mailing list
nfs-discuss@opensolaris.org

zfs discuss - Dec 2006 - A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] Re: A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] Re: A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

[nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43