thr3ads.net - zfs discuss - [zfs-discuss] webserver zfs root lock contention under heavy load [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Aubrey Li

2012-Mar-25 05:29 UTC

[zfs-discuss] webserver zfs root lock contention under heavy load

Hi,

I''m migrating a webserver(apache+php) from RHEL to solaris. During the
stress testing comparison, I found under the same session number of client
request, CPU% is ~70% on RHEL while CPU% is full on solaris.

After some investigation, zfs root lock emerges as a major doubtful point.

Firstly, there are a bunch of mutex events.
===========================================# lockstat -D 10 sleep 2

Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)

Count indv cuml rcnt     nsec Lock                   Caller
-------------------------------------------------------------------------------
829064  45%  45% 0.00    33280 0xffffff117624a5d0     rrw_enter_read+0x1b
705001  38%  82% 0.00    30983 0xffffff117624a5d0     rrw_exit+0x1d
140678   8%  90% 0.00    10546 0xffffff117624a6e0     zfs_zget+0x46
37208   2%  92% 0.00     5403 0xffffff114b136840     vn_rele+0x1e
33926   2%  94% 0.00     5417 0xffffff114b136840     lookuppnatcred+0xc5
27188   1%  95% 0.00     1155 vn_vfslocks_buckets+0xd980
vn_vfslocks_getlock+0x3b
11073   1%  96% 0.00     1639 vn_vfslocks_buckets+0x4600
vn_vfslocks_getlock+0x3b
 9321   1%  96% 0.00     1961 0xffffff114b82a680     dnlc_lookup+0x83
 6929   0%  97% 0.00     1590 0xffffff11573b8f28
zfs_fastaccesschk_execute+0x6a
 5746   0%  97% 0.00     5935 0xffffff114b136840     lookuppnvp+0x566
-------------------------------------------------------------------------------

If we look at the lock caller''s stack, the following two are the
hotspots.
=============================================================# lockstat -D 10 -l
0xffffff117624a5d0 -s 15 sleep 2

Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec)
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller
51215   6%   6% 0.00   100023 0xffffff117624a5d0     rrw_exit+0x1d

nsec ------ Time Distribution ------ count     Stack
  256 |                               61        zfs_root+0x78
  512 |@                              1746      fsop_root+0x2d
 1024 |@@@                            5210      traverse+0x65
 2048 |@@@@@@@                        13464     kkkk+0x446
 4096 |@@@@@@@                        13553     lookuppnatcred+0x119
 8192 |@@@                            5355      lookupnameatcred+0x97
 16384 |@                              2403      lookupnameat+0x6b
 32768 |                               1331      cstatat_getvp+0x11e
 65536 |                               673       cstatat64_32+0x5d
131072 |                               961       fstatat64_32+0x4c
262144 |                               1259      _sys_sysenter_post_swapgs+0x149
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller
48494   6%  17% 0.00   145263 0xffffff117624a5d0     rrw_enter_read+0x1b

 nsec ------ Time Distribution ------ count     Stack
   256 |                               17        rrw_enter+0x2c
   512 |                               1120      zfs_root+0x3b
  1024 |@                              1718      fsop_root+0x2d
  2048 |@@                             4834      traverse+0x65
  4096 |@@@@@@@@@@@                    18569     lookuppnvp+0x446
  8192 |@@@@                           6620      lookuppnatcred+0x119
 16384 |@                              2929      lookupnameatcred+0x97
 32768 |@                              1635      lookupnameat+0x6b
 65536 |                               894       cstatat_getvp+0x11e
131072 |                               1249      cstatat64_32+0x5d
262144 |@                              1620      fstatat64_32+0x4c
524288 |@                              2474      _sys_sysenter_post_swapgs+0x149
===========================================================
The apache root documentation is apparently in zfs dataset. It looks like each
file system operation will run into zfs root mutex contention.

Is this an expected behavior? If so, is there any zfs tunable to
address this issue?

Thanks,
-Aubrey

Richard Elling

2012-Mar-25 07:55 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mar 24, 2012, at 10:29 PM, Aubrey Li wrote:
> Hi,
> 
> I''m migrating a webserver(apache+php) from RHEL to solaris. During
the
> stress testing comparison, I found under the same session number of client
> request, CPU% is ~70% on RHEL while CPU% is full on solaris.
> 
> After some investigation, zfs root lock emerges as a major doubtful point.
This data does not seem major, less than 1% of total walltime. 
What is the ratio of user/sys?
 -- richard
> 
> Firstly, there are a bunch of mutex events.
> ===========================================> # lockstat -D 10 sleep 2
> 
> Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)
> 
> Count indv cuml rcnt     nsec Lock                   Caller
>
-------------------------------------------------------------------------------
> 829064  45%  45% 0.00    33280 0xffffff117624a5d0     rrw_enter_read+0x1b
> 705001  38%  82% 0.00    30983 0xffffff117624a5d0     rrw_exit+0x1d
> 140678   8%  90% 0.00    10546 0xffffff117624a6e0     zfs_zget+0x46
> 37208   2%  92% 0.00     5403 0xffffff114b136840     vn_rele+0x1e
> 33926   2%  94% 0.00     5417 0xffffff114b136840     lookuppnatcred+0xc5
> 27188   1%  95% 0.00     1155 vn_vfslocks_buckets+0xd980
> vn_vfslocks_getlock+0x3b
> 11073   1%  96% 0.00     1639 vn_vfslocks_buckets+0x4600
> vn_vfslocks_getlock+0x3b
> 9321   1%  96% 0.00     1961 0xffffff114b82a680     dnlc_lookup+0x83
> 6929   0%  97% 0.00     1590 0xffffff11573b8f28
> zfs_fastaccesschk_execute+0x6a
> 5746   0%  97% 0.00     5935 0xffffff114b136840     lookuppnvp+0x566
>
-------------------------------------------------------------------------------
> 
> If we look at the lock caller''s stack, the following two are the
hotspots.
> =============================================================> #
lockstat -D 10 -l 0xffffff117624a5d0 -s 15 sleep 2
> 
> Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec)
>
-------------------------------------------------------------------------------
> Count indv cuml rcnt     nsec Lock                   Caller
> 51215   6%   6% 0.00   100023 0xffffff117624a5d0     rrw_exit+0x1d
> 
> nsec ------ Time Distribution ------ count     Stack
>  256 |                               61        zfs_root+0x78
>  512 |@                              1746      fsop_root+0x2d
> 1024 |@@@                            5210      traverse+0x65
> 2048 |@@@@@@@                        13464     kkkk+0x446
> 4096 |@@@@@@@                        13553     lookuppnatcred+0x119
> 8192 |@@@                            5355      lookupnameatcred+0x97
> 16384 |@                              2403      lookupnameat+0x6b
> 32768 |                               1331      cstatat_getvp+0x11e
> 65536 |                               673       cstatat64_32+0x5d
> 131072 |                               961       fstatat64_32+0x4c
> 262144 |                               1259     
_sys_sysenter_post_swapgs+0x149
>
-------------------------------------------------------------------------------
> Count indv cuml rcnt     nsec Lock                   Caller
> 48494   6%  17% 0.00   145263 0xffffff117624a5d0     rrw_enter_read+0x1b
> 
> nsec ------ Time Distribution ------ count     Stack
>   256 |                               17        rrw_enter+0x2c
>   512 |                               1120      zfs_root+0x3b
>  1024 |@                              1718      fsop_root+0x2d
>  2048 |@@                             4834      traverse+0x65
>  4096 |@@@@@@@@@@@                    18569     lookuppnvp+0x446
>  8192 |@@@@                           6620      lookuppnatcred+0x119
> 16384 |@                              2929      lookupnameatcred+0x97
> 32768 |@                              1635      lookupnameat+0x6b
> 65536 |                               894       cstatat_getvp+0x11e
> 131072 |                               1249      cstatat64_32+0x5d
> 262144 |@                              1620      fstatat64_32+0x4c
> 524288 |@                              2474     
_sys_sysenter_post_swapgs+0x149
> ===========================================================> 
> The apache root documentation is apparently in zfs dataset. It looks like
each
> file system operation will run into zfs root mutex contention.
> 
> Is this an expected behavior? If so, is there any zfs tunable to
> address this issue?
> 
> Thanks,
> -Aubrey
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
DTrace Conference, April 3, 2012,
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422






-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120325/7373b3df/attachment.html>

Aubrey Li

2012-Mar-25 13:24 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Sun, Mar 25, 2012 at 3:55 PM, Richard Elling
<richard.elling at richardelling.com> wrote:> On Mar 24, 2012, at 10:29 PM, Aubrey Li wrote:
>
> Hi,
>
> I''m migrating a webserver(apache+php) from RHEL to solaris. During
the
> stress testing comparison, I found under the same session number of client
> request, CPU% is ~70% on RHEL while CPU% is full on solaris.
>
> After some investigation, zfs root lock emerges as a major doubtful point.
>
>
> This data does not seem major, less than 1% of total walltime.
> What is the ratio of user/sys?
> ?-- richard
Here is the vmstat output.
========================================================== kthr      memory     
page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s1 s1 s2 s2   in   sy   cs us sy id
 37 0 0 25257648 34744756 26671 40446 0 0 0 0 0 0 0 258 0 54473 833569
100171 67 31 2
 16 0 0 25270872 34745544 26416 40427 0 0 0 0 0 0 0 161 0 53804 850816
100016 68 29 3
 80 0 0 25266716 34731880 27082 41500 0 0 0 0 0 0 0 145 0 56908 901578
98194 70 28 1
 81 0 0 25318348 34763484 26436 40384 0 0 0 0 0 0 0 173 0 59977
1014699 99273 66 33 0
 104 0 0 25359888 34787868 25539 39726 0 0 0 0 0 0 0 122 0 58175
989927 93106 68 32 0
 113 0 0 25359208 34779872 25746 39992 0 0 0 0 0 0 0 161 0 57212
916427 94401 67 33 0
 123 0 0 25409656 34809704 26228 41330 0 0 0 0 0 0 0 176 0 59332
890511 97127 69 31 0
 97 0 0 25419628 34807796 26508 42013 0 0 0 0 0 0 0 148 0 59736 864022
98918 71 29 0
 111 0 0 25468728 34834968 26477 40232 0 0 0 0 0 0 0 159 0 59270
867006 98308 67 33 0
 105 0 0 25501972 34860952 26776 42204 0 0 0 0 0 0 0 163 0 70868
867899 97981 71 29 0
 78 0 0 25526492 34865416 26489 43504 0 0 0 0 0 0 0 157 0 61110 952207
99806 67 33 0
=================================================================
and "mpstat -a" matches the above measure. ~200,000 spins on mutex.
=================================================================SET minf mjf
xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 35140   0 2380 59742 19476 93056 30906 32919 256336 1104 967806
65  35   0   0  32
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 36044   0 3268 61910 19482 96287 32121 32844 245293 1122 954150
64  36   0   0  32
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 36516   0 3900 61459 19478 95502 31256 35345 221473 1106 932210
66  34   0   0  32
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 35791   0 3544 60751 19371 97982 31158 33920 209179 3115 859644
64  36   0   0  32
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 38317   0 4892 63046 19566 97316 31688 38100 220262 2058 915555
65  35   0   0  32
SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
  0 36408   0 3154 60766 19393 94343 31244 35582 248314  811 980967
65  35   0   0  32
==================================================================
Thanks,
-Aubrey>
>
> Firstly, there are a bunch of mutex events.
> ===========================================> # lockstat -D 10 sleep 2
>
> Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)
>
> Count indv cuml rcnt ????nsec Lock ??????????????????Caller
>
-------------------------------------------------------------------------------
> 829064 ?45% ?45% 0.00 ???33280 0xffffff117624a5d0 ????rrw_enter_read+0x1b
> 705001 ?38% ?82% 0.00 ???30983 0xffffff117624a5d0 ????rrw_exit+0x1d
> 140678 ??8% ?90% 0.00 ???10546 0xffffff117624a6e0 ????zfs_zget+0x46
> 37208 ??2% ?92% 0.00 ????5403 0xffffff114b136840 ????vn_rele+0x1e
> 33926 ??2% ?94% 0.00 ????5417 0xffffff114b136840 ????lookuppnatcred+0xc5
> 27188 ??1% ?95% 0.00 ????1155 vn_vfslocks_buckets+0xd980
> vn_vfslocks_getlock+0x3b
> 11073 ??1% ?96% 0.00 ????1639 vn_vfslocks_buckets+0x4600
> vn_vfslocks_getlock+0x3b
> 9321 ??1% ?96% 0.00 ????1961 0xffffff114b82a680 ????dnlc_lookup+0x83
> 6929 ??0% ?97% 0.00 ????1590 0xffffff11573b8f28
> zfs_fastaccesschk_execute+0x6a
> 5746 ??0% ?97% 0.00 ????5935 0xffffff114b136840 ????lookuppnvp+0x566
>
-------------------------------------------------------------------------------
>
> If we look at the lock caller''s stack, the following two are the
hotspots.
> =============================================================> #
lockstat -D 10 -l 0xffffff117624a5d0 -s 15 sleep 2
>
> Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec)
>
-------------------------------------------------------------------------------
> Count indv cuml rcnt ????nsec Lock ??????????????????Caller
> 51215 ??6% ??6% 0.00 ??100023 0xffffff117624a5d0 ????rrw_exit+0x1d
>
> nsec ------ Time Distribution ------ count ????Stack
> ?256 | ??????????????????????????????61 ???????zfs_root+0x78
> ?512 |@ ?????????????????????????????1746 ?????fsop_root+0x2d
> 1024 |@@@ ???????????????????????????5210 ?????traverse+0x65
> 2048 |@@@@@@@ ???????????????????????13464 ????kkkk+0x446
> 4096 |@@@@@@@ ???????????????????????13553 ????lookuppnatcred+0x119
> 8192 |@@@ ???????????????????????????5355 ?????lookupnameatcred+0x97
> 16384 |@ ?????????????????????????????2403 ?????lookupnameat+0x6b
> 32768 | ??????????????????????????????1331 ?????cstatat_getvp+0x11e
> 65536 | ??????????????????????????????673 ??????cstatat64_32+0x5d
> 131072 | ??????????????????????????????961 ??????fstatat64_32+0x4c
> 262144 | ??????????????????????????????1259
> ?????_sys_sysenter_post_swapgs+0x149
>
-------------------------------------------------------------------------------
> Count indv cuml rcnt ????nsec Lock ??????????????????Caller
> 48494 ??6% ?17% 0.00 ??145263 0xffffff117624a5d0 ????rrw_enter_read+0x1b
>
> nsec ------ Time Distribution ------ count ????Stack
> ??256 | ??????????????????????????????17 ???????rrw_enter+0x2c
> ??512 | ??????????????????????????????1120 ?????zfs_root+0x3b
> ?1024 |@ ?????????????????????????????1718 ?????fsop_root+0x2d
> ?2048 |@@ ????????????????????????????4834 ?????traverse+0x65
> ?4096 |@@@@@@@@@@@ ???????????????????18569 ????lookuppnvp+0x446
> ?8192 |@@@@ ??????????????????????????6620 ?????lookuppnatcred+0x119
> 16384 |@ ?????????????????????????????2929 ?????lookupnameatcred+0x97
> 32768 |@ ?????????????????????????????1635 ?????lookupnameat+0x6b
> 65536 | ??????????????????????????????894 ??????cstatat_getvp+0x11e
> 131072 | ??????????????????????????????1249 ?????cstatat64_32+0x5d
> 262144 |@ ?????????????????????????????1620 ?????fstatat64_32+0x4c
> 524288 |@ ?????????????????????????????2474
> ?????_sys_sysenter_post_swapgs+0x149
> ===========================================================>
> The apache root documentation is apparently in zfs dataset. It looks like
> each
> file system operation will run into zfs root mutex contention.
>
> Is this an expected behavior? If so, is there any zfs tunable to
> address this issue?
>
> Thanks,
> -Aubrey
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
> --
> DTrace Conference, April 3,
> 2012,?http://wiki.smartos.org/display/DOC/dtrace.conf
> ZFS Performance and Training
> Richard.Elling at RichardElling.com
> +1-760-896-4422
>
>
>
>
>
>

Richard Elling

2012-Mar-25 16:48 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

This is the wrong forum for general purpose performance tuning. So I
won''t
continue this much farther.  Notice the huge number of icsw, that is a bigger
symptom than locks.
 -- richard

On Mar 25, 2012, at 6:24 AM, Aubrey Li wrote:
> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl sze
>  0 35140   0 2380 59742 19476 93056 30906 32919 256336 1104 967806
> 65  35   0   0  32
> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl sze
>  0 36044   0 3268 61910 19482 96287 32121 32844 245293 1122 954150
> 64  36   0   0  32
> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl sze
>  0 36516   0 3900 61459 19478 95502 31256 35345 221473 1106 932210
> 66  34   0   0  32
> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl sze
>  0 35791   0 3544 60751 19371 97982 31158 33920 209179 3115 859644
> 64  36   0   0  32
> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl sze
>  0 38317   0 4892 63046 19566 97316 31688 38100 220262 2058 915555
> 65  35   0   0  32
> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl sze
>  0 36408   0 3154 60766 19393 94343 31244 35582 248314  811 980967
> 65  35   0   0  32
--
DTrace Conference, April 3, 2012,
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120325/28cf06b1/attachment.html>

Aubrey Li

2012-Mar-25 17:25 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling
<richard.elling at richardelling.com> wrote:> This is the wrong forum for general purpose performance tuning. So I
won''t
> continue this much farther. ?Notice the huge number of icsw, that is a
> bigger
> symptom than locks.
> ?-- richard
thanks anyway, lock must be a problem. the scenario here is, apache
causes a bunch of stat()
syscall, which leads to a bunch of zfs vnode access, byond the normal
read/write operation.

The problem is, every zfs vnode access need the **same zfs root**
lock. When the number of
httpd processes and the corresponding kernel threads becomes large,
this root lock contention
becomes horrible. This situation does not occurs on linux.

Let me see if any others have a clue for me. Any suggestions will be
highly appreciated!

Regards,
-Aubrey
>
> On Mar 25, 2012, at 6:24 AM, Aubrey Li wrote:
>
> SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt
idl
> sze
> ?0 35140 ??0 2380 59742 19476 93056 30906 32919 256336 1104 967806
> 65 ?35 ??0 ??0 ?32
> SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt
idl
> sze
> ?0 36044 ??0 3268 61910 19482 96287 32121 32844 245293 1122 954150
> 64 ?36 ??0 ??0 ?32
> SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt
idl
> sze
> ?0 36516 ??0 3900 61459 19478 95502 31256 35345 221473 1106 932210
> 66 ?34 ??0 ??0 ?32
> SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt
idl
> sze
> ?0 35791 ??0 3544 60751 19371 97982 31158 33920 209179 3115 859644
> 64 ?36 ??0 ??0 ?32
> SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt
idl
> sze
> ?0 38317 ??0 4892 63046 19566 97316 31688 38100 220262 2058 915555
> 65 ?35 ??0 ??0 ?32
> SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt
idl
> sze
> ?0 36408 ??0 3154 60766 19393 94343 31244 35582 248314 ?811 980967
> 65 ?35 ??0 ??0 ?32
>
>
> --
> DTrace Conference, April 3,
> 2012,?http://wiki.smartos.org/display/DOC/dtrace.conf
> ZFS Performance and Training
> Richard.Elling at RichardElling.com
> +1-760-896-4422
>
>
>

zfs user

2012-Mar-25 18:10 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On 3/25/12 10:25 AM, Aubrey Li wrote:> On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling
> <richard.elling at richardelling.com>  wrote:
>> This is the wrong forum for general purpose performance tuning. So I
won''t
>> continue this much farther.  Notice the huge number of icsw, that is a
>> bigger
>> symptom than locks.
>>   -- richard
>
> thanks anyway, lock must be a problem. the scenario here is, apache
> causes a bunch of stat()
> syscall, which leads to a bunch of zfs vnode access, byond the normal
> read/write operation.
>
> The problem is, every zfs vnode access need the **same zfs root**
> lock. When the number of
> httpd processes and the corresponding kernel threads becomes large,
> this root lock contention
> becomes horrible. This situation does not occurs on linux.
>
> Let me see if any others have a clue for me. Any suggestions will be
> highly appreciated!
>
> Regards,
> -Aubrey
Is noatime turned on for all your zfs filesystems?
Have you tried the latest version of httpd?

Richard Elling

2012-Mar-25 18:58 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mar 25, 2012, at 10:25 AM, Aubrey Li wrote:
> On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling
> <richard.elling at richardelling.com> wrote:
>> This is the wrong forum for general purpose performance tuning. So I
won''t
>> continue this much farther.  Notice the huge number of icsw, that is a
>> bigger
>> symptom than locks.
>>  -- richard
> 
> thanks anyway, lock must be a problem. the scenario here is, apache
> causes a bunch of stat()
> syscall, which leads to a bunch of zfs vnode access, byond the normal
> read/write operation.
That does not explain the high icsw. In a properly sized system icsw will be
on the order of 0 to 100, closer to 0.

If there are millions of files, then you should check the DLNC hit rate.
> 
> The problem is, every zfs vnode access need the **same zfs root**
> lock. When the number of
> httpd processes and the corresponding kernel threads becomes large,
> this root lock contention
> becomes horrible. This situation does not occurs on linux.
I disagree with your conclusion and I''ve seen ZFS systems do millions
of stats()
per second without issue. What does prstat -Lm show?
 -- richard
> 
> Let me see if any others have a clue for me. Any suggestions will be
> highly appreciated!
> 
> Regards,
> -Aubrey
> 
>> 
>> On Mar 25, 2012, at 6:24 AM, Aubrey Li wrote:
>> 
>> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt idl
>> sze
>>  0 35140   0 2380 59742 19476 93056 30906 32919 256336 1104 967806
>> 65  35   0   0  32
>> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt idl
>> sze
>>  0 36044   0 3268 61910 19482 96287 32121 32844 245293 1122 954150
>> 64  36   0   0  32
>> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt idl
>> sze
>>  0 36516   0 3900 61459 19478 95502 31256 35345 221473 1106 932210
>> 66  34   0   0  32
>> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt idl
>> sze
>>  0 35791   0 3544 60751 19371 97982 31158 33920 209179 3115 859644
>> 64  36   0   0  32
>> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt idl
>> sze
>>  0 38317   0 4892 63046 19566 97316 31688 38100 220262 2058 915555
>> 65  35   0   0  32
>> SET minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt idl
>> sze
>>  0 36408   0 3154 60766 19393 94343 31244 35582 248314  811 980967
>> 65  35   0   0  32
>> 
>> 
>> --
>> DTrace Conference, April 3,
>> 2012, http://wiki.smartos.org/display/DOC/dtrace.conf
>> ZFS Performance and Training
>> Richard.Elling at RichardElling.com
>> +1-760-896-4422
>> 
>> 
>> 
--
DTrace Conference, April 3, 2012,
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120325/3be4abd6/attachment.html>

Aubrey Li

2012-Mar-25 19:03 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 2:10 AM, zfs user <zfsml at itsbeen.sent.com>
wrote:> On 3/25/12 10:25 AM, Aubrey Li wrote:
>>
>> On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling
>> <richard.elling at richardelling.com> ?wrote:
>>>
>>> This is the wrong forum for general purpose performance tuning. So
I
>>> won''t
>>> continue this much farther. ?Notice the huge number of icsw, that
is a
>>> bigger
>>> symptom than locks.
>>> ?-- richard
>>
>>
>> thanks anyway, lock must be a problem. the scenario here is, apache
>> causes a bunch of stat()
>> syscall, which leads to a bunch of zfs vnode access, byond the normal
>> read/write operation.
>>
>> The problem is, every zfs vnode access need the **same zfs root**
>> lock. When the number of
>> httpd processes and the corresponding kernel threads becomes large,
>> this root lock contention
>> becomes horrible. This situation does not occurs on linux.
>>
>> Let me see if any others have a clue for me. Any suggestions will be
>> highly appreciated!
>>
>> Regards,
>> -Aubrey
>
>
> Is noatime turned on for all your zfs filesystems?
> Have you tried the latest version of httpd?
>
yeah, atime is turned off. It''s not helpful to reduce the zfs vnode
access.
I didn''t try the latest version of httpd. I want to get the same
performance
by the same user land software stack as it on Linux.

Thanks,
-Aubrey

Aubrey Li

2012-Mar-25 19:13 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 2:58 AM, Richard Elling
<richard.elling at richardelling.com> wrote:> On Mar 25, 2012, at 10:25 AM, Aubrey Li wrote:
>
> On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling
> <richard.elling at richardelling.com> wrote:
>
> This is the wrong forum for general purpose performance tuning. So I
won''t
>
> continue this much farther. ?Notice the huge number of icsw, that is a
>
> bigger
>
> symptom than locks.
>
> ?-- richard
>
>
> thanks anyway, lock must be a problem. the scenario here is, apache
> causes a bunch of stat()
> syscall, which leads to a bunch of zfs vnode access, byond the normal
> read/write operation.
>
>
> That does not explain the high icsw. In a properly sized system icsw will
be
> on the order of 0 to 100, closer to 0.
>
> If there are millions of files, then you should check the DLNC hit rate.
Involuntary context switches might be another story I need to take care of.
But I''m not aware it''s controllable, ?
>
>
> The problem is, every zfs vnode access need the **same zfs root**
> lock. When the number of
> httpd processes and the corresponding kernel threads becomes large,
> this root lock contention
> becomes horrible. This situation does not occurs on linux.
>
>
> I disagree with your conclusion and I''ve seen ZFS systems do
millions of
> stats()
> per second without issue. What does prstat -Lm show?
> ?-- richard
>
I have ever not seen any issues until I did a comparison with Linux.

Thanks,
-Aubrey

Fajar A. Nugraha

2012-Mar-25 19:22 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 2:13 AM, Aubrey Li <aubreylee at gmail.com>
wrote:>> The problem is, every zfs vnode access need the **same zfs root**
>> lock. When the number of
>> httpd processes and the corresponding kernel threads becomes large,
>> this root lock contention
>> becomes horrible. This situation does not occurs on linux.
>>
>>
>> I disagree with your conclusion and I''ve seen ZFS systems do
millions of
>> stats()
>> per second without issue. What does prstat -Lm show?
>> ?-- richard
>>
>
> I have ever not seen any issues until I did a comparison with Linux.
So basically you''re comparing linux + ext3/4 performance with solaris
+ zfs, on the same hardware? That''s not really fair, is it?
If your load is I/O-intensive enough, ext3/4 will easily win since it
doesn''t have to do things like calculating checksum.

Now if you compare it to .. say ... btrfs, or zfsonlinux, it''d be more
in the same ballpark.

-- 
Fajar

Jim Mauro

2012-Mar-25 20:18 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

If you''re chasing CPU utilization, specifically %sys (time in the
kernel),
I would start with a time-based kernel profile.

#dtrace -n ''profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec {
trunc(@, 20); printa(@0; }''

I would be curious to see where the CPU cycles are being consumed first,
before going down the lock path?

This assume that most or all of CPU utilization is %sys. If it''s %usr,
we take
a different approach.

Thanks
/jim

On Mar 25, 2012, at 1:29 AM, Aubrey Li wrote:
> Hi,
> 
> I''m migrating a webserver(apache+php) from RHEL to solaris. During
the
> stress testing comparison, I found under the same session number of client
> request, CPU% is ~70% on RHEL while CPU% is full on solaris.
> 
> After some investigation, zfs root lock emerges as a major doubtful point.
> 
> Firstly, there are a bunch of mutex events.
> ===========================================> # lockstat -D 10 sleep 2
> 
> Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)
> 
> Count indv cuml rcnt     nsec Lock                   Caller
>
-------------------------------------------------------------------------------
> 829064  45%  45% 0.00    33280 0xffffff117624a5d0     rrw_enter_read+0x1b
> 705001  38%  82% 0.00    30983 0xffffff117624a5d0     rrw_exit+0x1d
> 140678   8%  90% 0.00    10546 0xffffff117624a6e0     zfs_zget+0x46
> 37208   2%  92% 0.00     5403 0xffffff114b136840     vn_rele+0x1e
> 33926   2%  94% 0.00     5417 0xffffff114b136840     lookuppnatcred+0xc5
> 27188   1%  95% 0.00     1155 vn_vfslocks_buckets+0xd980
> vn_vfslocks_getlock+0x3b
> 11073   1%  96% 0.00     1639 vn_vfslocks_buckets+0x4600
> vn_vfslocks_getlock+0x3b
> 9321   1%  96% 0.00     1961 0xffffff114b82a680     dnlc_lookup+0x83
> 6929   0%  97% 0.00     1590 0xffffff11573b8f28
> zfs_fastaccesschk_execute+0x6a
> 5746   0%  97% 0.00     5935 0xffffff114b136840     lookuppnvp+0x566
>
-------------------------------------------------------------------------------
> 
> If we look at the lock caller''s stack, the following two are the
hotspots.
> =============================================================> #
lockstat -D 10 -l 0xffffff117624a5d0 -s 15 sleep 2
> 
> Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec)
>
-------------------------------------------------------------------------------
> Count indv cuml rcnt     nsec Lock                   Caller
> 51215   6%   6% 0.00   100023 0xffffff117624a5d0     rrw_exit+0x1d
> 
> nsec ------ Time Distribution ------ count     Stack
>  256 |                               61        zfs_root+0x78
>  512 |@                              1746      fsop_root+0x2d
> 1024 |@@@                            5210      traverse+0x65
> 2048 |@@@@@@@                        13464     kkkk+0x446
> 4096 |@@@@@@@                        13553     lookuppnatcred+0x119
> 8192 |@@@                            5355      lookupnameatcred+0x97
> 16384 |@                              2403      lookupnameat+0x6b
> 32768 |                               1331      cstatat_getvp+0x11e
> 65536 |                               673       cstatat64_32+0x5d
> 131072 |                               961       fstatat64_32+0x4c
> 262144 |                               1259     
_sys_sysenter_post_swapgs+0x149
>
-------------------------------------------------------------------------------
> Count indv cuml rcnt     nsec Lock                   Caller
> 48494   6%  17% 0.00   145263 0xffffff117624a5d0     rrw_enter_read+0x1b
> 
> nsec ------ Time Distribution ------ count     Stack
>   256 |                               17        rrw_enter+0x2c
>   512 |                               1120      zfs_root+0x3b
>  1024 |@                              1718      fsop_root+0x2d
>  2048 |@@                             4834      traverse+0x65
>  4096 |@@@@@@@@@@@                    18569     lookuppnvp+0x446
>  8192 |@@@@                           6620      lookuppnatcred+0x119
> 16384 |@                              2929      lookupnameatcred+0x97
> 32768 |@                              1635      lookupnameat+0x6b
> 65536 |                               894       cstatat_getvp+0x11e
> 131072 |                               1249      cstatat64_32+0x5d
> 262144 |@                              1620      fstatat64_32+0x4c
> 524288 |@                              2474     
_sys_sysenter_post_swapgs+0x149
> ===========================================================> 
> The apache root documentation is apparently in zfs dataset. It looks like
each
> file system operation will run into zfs root mutex contention.
> 
> Is this an expected behavior? If so, is there any zfs tunable to
> address this issue?
> 
> Thanks,
> -Aubrey
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Aubrey Li

2012-Mar-26 01:42 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 3:22 AM, Fajar A. Nugraha <work at fajar.net>
wrote:>>
>> I have ever not seen any issues until I did a comparison with Linux.
>
> So basically you''re comparing linux + ext3/4 performance with
solaris
> + zfs, on the same hardware? That''s not really fair, is it?
> If your load is I/O-intensive enough, ext3/4 will easily win since it
> doesn''t have to do things like calculating checksum.
>
> Now if you compare it to .. say ... btrfs, or zfsonlinux, it''d be
more
> in the same ballpark.
>
I''ll be more comfortable if I didn''t see the lock contention,
are you indicating
this(every file vnode access will acquire the same root mutex) is normal?
I think it will be good if calculating checksum consumes CPU.

Thanks,
-Aubrey

Aubrey Li

2012-Mar-26 01:51 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro <james.mauro at oracle.com>
wrote:> If you''re chasing CPU utilization, specifically %sys (time in the
kernel),
> I would start with a time-based kernel profile.
>
> #dtrace -n ''profile-997hz /arg0/ { @[stack()] = count(); }
tick-60sec { trunc(@, 20); printa(@0; }''
>
> I would be curious to see where the CPU cycles are being consumed first,
> before going down the lock path?
>
> This assume that most or all of CPU utilization is %sys. If it''s
%usr, we take
> a different approach.
>
Here is the output, I changed to "tick-5sec" and "trunc(@,
5)".
No.2 and No.3 is what I care about.

Thanks,
-Aubrey

21  80536                       :tick-5sec
========== 1 ====================              genunix`avl_walk+0x6a
              genunix`as_gap_aligned+0x2b7
              unix`map_addr_proc+0x179
              unix`map_addr+0x8e
              genunix`choose_addr+0x9e
              zfs`zfs_map+0x161
              genunix`fop_map+0xc5
              genunix`smmap_common+0x268
              genunix`smmaplf32+0xa2
              genunix`syscall_ap+0x92
              unix`_sys_sysenter_post_swapgs+0x149
             1427

========= 2 ================              unix`mutex_delay_default+0x7
              unix`mutex_vector_enter+0x2ae
              zfs`zfs_zget+0x46
              zfs`zfs_root+0x55
              genunix`fsop_root+0x2d
              genunix`traverse+0x65
              genunix`lookuppnvp+0x446
              genunix`lookuppnatcred+0x119
              genunix`lookupnameatcred+0x97
              genunix`lookupnameat+0x6b
              genunix`vn_openat+0x147
              genunix`copen+0x493
              genunix`openat64+0x2d
              unix`_sys_sysenter_post_swapgs+0x149
             2645

========== 3 ====================              unix`mutex_delay_default+0x7
              unix`mutex_vector_enter+0x2ae
              zfs`zfs_zget+0x46
              zfs`zfs_root+0x55
              genunix`fsop_root+0x2d
              genunix`traverse+0x65
              genunix`lookuppnvp+0x446
              genunix`lookuppnatcred+0x119
              genunix`lookupnameatcred+0x97
              genunix`lookupnameat+0x6b
              genunix`cstatat_getvp+0x11e
              genunix`cstatat64_32+0x5d
              genunix`fstatat64_32+0x4c
              unix`_sys_sysenter_post_swapgs+0x149
             3201

======== 4 ==========              unix`i86_mwait+0xd
              unix`cpu_idle_mwait+0x154
              unix`idle+0x116
              unix`thread_start+0x8
             3559

========= 5 =============              tmpfs`tmp_readdir+0x138
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
             4589

=========  6 ===============              unix`strlen+0x3
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
             5005

=========== 7 ================              tmpfs`tmp_readdir+0xc7
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
             9548


========= 8 ======================              unix`strlen+0x8
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
            11166


============= 9 ==================              unix`strlen+0xe
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
            14491

=============  10 =================              tmpfs`tmp_readdir+0xbe
              genunix`fop_readdir+0xe8
              genunix`getdents64+0xd5
              unix`_sys_sysenter_post_swapgs+0x149
            39540

Richard Elling

2012-Mar-26 03:34 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mar 25, 2012, at 6:51 PM, Aubrey Li wrote:> On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro <james.mauro at
oracle.com> wrote:
>> If you''re chasing CPU utilization, specifically %sys (time in
the kernel),
>> I would start with a time-based kernel profile.
>> 
>> #dtrace -n ''profile-997hz /arg0/ { @[stack()] = count(); }
tick-60sec { trunc(@, 20); printa(@0; }''
>> 
>> I would be curious to see where the CPU cycles are being consumed
first,
>> before going down the lock path?
>> 
>> This assume that most or all of CPU utilization is %sys. If
it''s %usr, we take
>> a different approach.
>> 
> 
> Here is the output, I changed to "tick-5sec" and "trunc(@,
5)".
> No.2 and No.3 is what I care about.
> 
> Thanks,
> -Aubrey
> 
> 21  80536                       :tick-5sec
> ========== 1 ====================>              genunix`avl_walk+0x6a
>              genunix`as_gap_aligned+0x2b7
>              unix`map_addr_proc+0x179
>              unix`map_addr+0x8e
>              genunix`choose_addr+0x9e
>              zfs`zfs_map+0x161
>              genunix`fop_map+0xc5
>              genunix`smmap_common+0x268
>              genunix`smmaplf32+0xa2
>              genunix`syscall_ap+0x92
>              unix`_sys_sysenter_post_swapgs+0x149
>             1427
> 
> ========= 2 ================>              unix`mutex_delay_default+0x7
>              unix`mutex_vector_enter+0x2ae
>              zfs`zfs_zget+0x46
>              zfs`zfs_root+0x55
>              genunix`fsop_root+0x2d
>              genunix`traverse+0x65
>              genunix`lookuppnvp+0x446
>              genunix`lookuppnatcred+0x119
>              genunix`lookupnameatcred+0x97
>              genunix`lookupnameat+0x6b
>              genunix`vn_openat+0x147
>              genunix`copen+0x493
>              genunix`openat64+0x2d
>              unix`_sys_sysenter_post_swapgs+0x149
>             2645
> 
> ========== 3 ====================>             
unix`mutex_delay_default+0x7
>              unix`mutex_vector_enter+0x2ae
>              zfs`zfs_zget+0x46
>              zfs`zfs_root+0x55
>              genunix`fsop_root+0x2d
>              genunix`traverse+0x65
>              genunix`lookuppnvp+0x446
>              genunix`lookuppnatcred+0x119
>              genunix`lookupnameatcred+0x97
>              genunix`lookupnameat+0x6b
>              genunix`cstatat_getvp+0x11e
>              genunix`cstatat64_32+0x5d
>              genunix`fstatat64_32+0x4c
>              unix`_sys_sysenter_post_swapgs+0x149
>             3201
> 
> ======== 4 ==========>              unix`i86_mwait+0xd
>              unix`cpu_idle_mwait+0x154
>              unix`idle+0x116
>              unix`thread_start+0x8
>             3559
> 
> ========= 5 =============>              tmpfs`tmp_readdir+0x138
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>             4589
> 
> =========  6 ===============>              unix`strlen+0x3
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>             5005
> 
> =========== 7 ================>              tmpfs`tmp_readdir+0xc7
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>             9548
> 
> 
> ========= 8 ======================>              unix`strlen+0x8
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>            11166
> 
> 
> ============= 9 ==================>              unix`strlen+0xe
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>            14491
> 
> =============  10 =================>              tmpfs`tmp_readdir+0xbe
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>            39540

Well, there you go. tmpfs is not zfs.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
SCALE 10x, Los Angeles, Jan 20-22, 2012

Aubrey Li

2012-Mar-26 05:11 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 11:34 AM, Richard Elling
<richard.elling at gmail.com> wrote:> On Mar 25, 2012, at 6:51 PM, Aubrey Li wrote:
>> On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro <james.mauro at
oracle.com> wrote:
>>> If you''re chasing CPU utilization, specifically %sys (time
in the kernel),
>>> I would start with a time-based kernel profile.
>>>
>>> #dtrace -n ''profile-997hz /arg0/ { @[stack()] = count(); }
tick-60sec { trunc(@, 20); printa(@0; }''
>>>
>>> I would be curious to see where the CPU cycles are being consumed
first,
>>> before going down the lock path?
>>>
>>> This assume that most or all of CPU utilization is %sys. If
it''s %usr, we take
>>> a different approach.
>>>
>>
>> Here is the output, I changed to "tick-5sec" and
"trunc(@, 5)".
>> No.2 and No.3 is what I care about.
>>
>> Thanks,
>> -Aubrey
>>
>> 21 ?80536 ? ? ? ? ? ? ? ? ? ? ? :tick-5sec
>> ========== 1 ====================>> ? ? ? ? ? ?
?genunix`avl_walk+0x6a
>> ? ? ? ? ? ? ?genunix`as_gap_aligned+0x2b7
>> ? ? ? ? ? ? ?unix`map_addr_proc+0x179
>> ? ? ? ? ? ? ?unix`map_addr+0x8e
>> ? ? ? ? ? ? ?genunix`choose_addr+0x9e
>> ? ? ? ? ? ? ?zfs`zfs_map+0x161
>> ? ? ? ? ? ? ?genunix`fop_map+0xc5
>> ? ? ? ? ? ? ?genunix`smmap_common+0x268
>> ? ? ? ? ? ? ?genunix`smmaplf32+0xa2
>> ? ? ? ? ? ? ?genunix`syscall_ap+0x92
>> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149
>> ? ? ? ? ? ? 1427
>>
>> ========= 2 ================>> ? ? ? ? ? ?
?unix`mutex_delay_default+0x7
>> ? ? ? ? ? ? ?unix`mutex_vector_enter+0x2ae
>> ? ? ? ? ? ? ?zfs`zfs_zget+0x46
>> ? ? ? ? ? ? ?zfs`zfs_root+0x55
>> ? ? ? ? ? ? ?genunix`fsop_root+0x2d
>> ? ? ? ? ? ? ?genunix`traverse+0x65
>> ? ? ? ? ? ? ?genunix`lookuppnvp+0x446
>> ? ? ? ? ? ? ?genunix`lookuppnatcred+0x119
>> ? ? ? ? ? ? ?genunix`lookupnameatcred+0x97
>> ? ? ? ? ? ? ?genunix`lookupnameat+0x6b
>> ? ? ? ? ? ? ?genunix`vn_openat+0x147
>> ? ? ? ? ? ? ?genunix`copen+0x493
>> ? ? ? ? ? ? ?genunix`openat64+0x2d
>> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149
>> ? ? ? ? ? ? 2645
>>
>> ========== 3 ====================>> ? ? ? ? ? ?
?unix`mutex_delay_default+0x7
>> ? ? ? ? ? ? ?unix`mutex_vector_enter+0x2ae
>> ? ? ? ? ? ? ?zfs`zfs_zget+0x46
>> ? ? ? ? ? ? ?zfs`zfs_root+0x55
>> ? ? ? ? ? ? ?genunix`fsop_root+0x2d
>> ? ? ? ? ? ? ?genunix`traverse+0x65
>> ? ? ? ? ? ? ?genunix`lookuppnvp+0x446
>> ? ? ? ? ? ? ?genunix`lookuppnatcred+0x119
>> ? ? ? ? ? ? ?genunix`lookupnameatcred+0x97
>> ? ? ? ? ? ? ?genunix`lookupnameat+0x6b
>> ? ? ? ? ? ? ?genunix`cstatat_getvp+0x11e
>> ? ? ? ? ? ? ?genunix`cstatat64_32+0x5d
>> ? ? ? ? ? ? ?genunix`fstatat64_32+0x4c
>> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149
>> ? ? ? ? ? ? 3201
>>
>> ======== 4 ==========>> ? ? ? ? ? ? ?unix`i86_mwait+0xd
>> ? ? ? ? ? ? ?unix`cpu_idle_mwait+0x154
>> ? ? ? ? ? ? ?unix`idle+0x116
>> ? ? ? ? ? ? ?unix`thread_start+0x8
>> ? ? ? ? ? ? 3559
>>
>> ========= 5 =============>> ? ? ? ? ? ? ?tmpfs`tmp_readdir+0x138
>> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8
>> ? ? ? ? ? ? ?genunix`getdents64+0xd5
>> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149
>> ? ? ? ? ? ? 4589
>>
>> ========= ?6 ===============>> ? ? ? ? ? ? ?unix`strlen+0x3
>> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8
>> ? ? ? ? ? ? ?genunix`getdents64+0xd5
>> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149
>> ? ? ? ? ? ? 5005
>>
>> =========== 7 ================>> ? ? ? ? ? ?
?tmpfs`tmp_readdir+0xc7
>> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8
>> ? ? ? ? ? ? ?genunix`getdents64+0xd5
>> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149
>> ? ? ? ? ? ? 9548
>>
>>
>> ========= 8 ======================>> ? ? ? ? ? ? ?unix`strlen+0x8
>> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8
>> ? ? ? ? ? ? ?genunix`getdents64+0xd5
>> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149
>> ? ? ? ? ? ?11166
>>
>>
>> ============= 9 ==================>> ? ? ? ? ? ? ?unix`strlen+0xe
>> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8
>> ? ? ? ? ? ? ?genunix`getdents64+0xd5
>> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149
>> ? ? ? ? ? ?14491
>>
>> ============= ?10 =================>> ? ? ? ? ? ?
?tmpfs`tmp_readdir+0xbe
>> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8
>> ? ? ? ? ? ? ?genunix`getdents64+0xd5
>> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149
>> ? ? ? ? ? ?39540
>
>
> Well, there you go. tmpfs is not zfs.
> ?-- richard
Let me explain below:

Root documentation path of apache is in zfs, you see
it at No.3 at the above dtrace report.

tmpfs(/tmp) is the place where PHP place the temporary
folders and files.

If you go back to see my first thread, you''ll see the zfs
root lock contetion as well.

Hope we are in the same page now.

Thanks,
-Aubrey

Richard Elling

2012-Mar-26 05:19 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

Apologies to the ZFSers, this thread really belongs elsewhere.

On Mar 25, 2012, at 10:11 PM, Aubrey Li wrote:
> On Mon, Mar 26, 2012 at 11:34 AM, Richard Elling
> <richard.elling at gmail.com> wrote:
>> On Mar 25, 2012, at 6:51 PM, Aubrey Li wrote:
>>> On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro <james.mauro at
oracle.com> wrote:
>>>> If you''re chasing CPU utilization, specifically %sys
(time in the kernel),
>>>> I would start with a time-based kernel profile.
>>>> 
>>>> #dtrace -n ''profile-997hz /arg0/ { @[stack()] =
count(); } tick-60sec { trunc(@, 20); printa(@0; }''
>>>> 
>>>> I would be curious to see where the CPU cycles are being
consumed first,
>>>> before going down the lock path?
>>>> 
>>>> This assume that most or all of CPU utilization is %sys. If
it''s %usr, we take
>>>> a different approach.
>>>> 
>>> 
>>> Here is the output, I changed to "tick-5sec" and
"trunc(@, 5)".
>>> No.2 and No.3 is what I care about.
>>> 
>>> Thanks,
>>> -Aubrey
>>> 
>>> 21  80536                       :tick-5sec
>>> ========== 1 ====================>>>             
genunix`avl_walk+0x6a
>>>              genunix`as_gap_aligned+0x2b7
>>>              unix`map_addr_proc+0x179
>>>              unix`map_addr+0x8e
>>>              genunix`choose_addr+0x9e
>>>              zfs`zfs_map+0x161
>>>              genunix`fop_map+0xc5
>>>              genunix`smmap_common+0x268
>>>              genunix`smmaplf32+0xa2
>>>              genunix`syscall_ap+0x92
>>>              unix`_sys_sysenter_post_swapgs+0x149
>>>             1427
>>> 
>>> ========= 2 ================>>>             
unix`mutex_delay_default+0x7
>>>              unix`mutex_vector_enter+0x2ae
>>>              zfs`zfs_zget+0x46
>>>              zfs`zfs_root+0x55
>>>              genunix`fsop_root+0x2d
>>>              genunix`traverse+0x65
>>>              genunix`lookuppnvp+0x446
>>>              genunix`lookuppnatcred+0x119
>>>              genunix`lookupnameatcred+0x97
>>>              genunix`lookupnameat+0x6b
>>>              genunix`vn_openat+0x147
>>>              genunix`copen+0x493
>>>              genunix`openat64+0x2d
>>>              unix`_sys_sysenter_post_swapgs+0x149
>>>             2645
>>> 
>>> ========== 3 ====================>>>             
unix`mutex_delay_default+0x7
>>>              unix`mutex_vector_enter+0x2ae
>>>              zfs`zfs_zget+0x46
>>>              zfs`zfs_root+0x55
>>>              genunix`fsop_root+0x2d
>>>              genunix`traverse+0x65
>>>              genunix`lookuppnvp+0x446
>>>              genunix`lookuppnatcred+0x119
>>>              genunix`lookupnameatcred+0x97
>>>              genunix`lookupnameat+0x6b
>>>              genunix`cstatat_getvp+0x11e
>>>              genunix`cstatat64_32+0x5d
>>>              genunix`fstatat64_32+0x4c
>>>              unix`_sys_sysenter_post_swapgs+0x149
>>>             3201
>>> 
>>> ======== 4 ==========>>>              unix`i86_mwait+0xd
>>>              unix`cpu_idle_mwait+0x154
>>>              unix`idle+0x116
>>>              unix`thread_start+0x8
>>>             3559
>>> 
>>> ========= 5 =============>>>             
tmpfs`tmp_readdir+0x138
>>>              genunix`fop_readdir+0xe8
>>>              genunix`getdents64+0xd5
>>>              unix`_sys_sysenter_post_swapgs+0x149
>>>             4589
>>> 
>>> =========  6 ===============>>>             
unix`strlen+0x3
>>>              genunix`fop_readdir+0xe8
>>>              genunix`getdents64+0xd5
>>>              unix`_sys_sysenter_post_swapgs+0x149
>>>             5005
>>> 
>>> =========== 7 ================>>>             
tmpfs`tmp_readdir+0xc7
>>>              genunix`fop_readdir+0xe8
>>>              genunix`getdents64+0xd5
>>>              unix`_sys_sysenter_post_swapgs+0x149
>>>             9548
>>> 
>>> 
>>> ========= 8 ======================>>>             
unix`strlen+0x8
>>>              genunix`fop_readdir+0xe8
>>>              genunix`getdents64+0xd5
>>>              unix`_sys_sysenter_post_swapgs+0x149
>>>            11166
>>> 
>>> 
>>> ============= 9 ==================>>>             
unix`strlen+0xe
>>>              genunix`fop_readdir+0xe8
>>>              genunix`getdents64+0xd5
>>>              unix`_sys_sysenter_post_swapgs+0x149
>>>            14491
>>> 
>>> =============  10 =================>>>             
tmpfs`tmp_readdir+0xbe
>>>              genunix`fop_readdir+0xe8
>>>              genunix`getdents64+0xd5
>>>              unix`_sys_sysenter_post_swapgs+0x149
>>>            39540
>> 
>> 
>> Well, there you go. tmpfs is not zfs.
>>  -- richard
> 
> Let me explain below:
> 
> Root documentation path of apache is in zfs, you see
> it at No.3 at the above dtrace report.
The sort is in reverse order. The large number you see below the 
stack trace is the number of times that stack was seen. By far the 
most frequently seen is tmpfs`tmp_readdir
> 
> tmpfs(/tmp) is the place where PHP place the temporary
> folders and files.
bingo
> 
> If you go back to see my first thread, you''ll see the zfs
> root lock contetion as well.
> 
> Hope we are in the same page now.
> 
> Thanks,
> -Aubrey
 -- richard

--
DTrace Conference, April 3, 2012,
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422






-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120325/161a9491/attachment-0001.html>

Fajar A. Nugraha

2012-Mar-26 05:47 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 12:19 PM, Richard Elling
<richard.elling at richardelling.com> wrote:> Apologies to the ZFSers, this thread really belongs elsewhere.
Some of the info in it is informative for other zfs users as well though :)
> Here is the output, I changed to "tick-5sec" and "trunc(@,
5)".
>
> No.2 and No.3 is what I care about.
> The sort is in reverse order. The large number you see below the
> stack trace is the number of times that stack was seen. By far the
> most frequently seen is tmpfs`tmp_readdir
So to summarize, even though there are locks in zfs, they are normal,
and insignificant performance-wise (at least in this case). And if
performance is a problem, the cause is elsewhere.

Is that correct?

-- 
Fajar

Aubrey Li

2012-Mar-26 05:51 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 1:19 PM, Richard Elling
<richard.elling at richardelling.com> wrote:> Apologies to the ZFSers, this thread really belongs elsewhere.
>
> Let me explain below:
>
> Root documentation path of apache is in zfs, you see
> it at No.3 at the above dtrace report.
>
>
> The sort is in reverse order. The large number you see below the
> stack trace is the number of times that stack was seen. By far the
> most frequently seen is tmpfs`tmp_readdir
It''s true, but I didn''t see any potential issues there.
>
>
> tmpfs(/tmp) is the place where PHP place the temporary
> folders and files.
>
>
> bingo
>
I have to paste the lock investigation here again,

Firstly, you can see which lock is spinning:
======================================================# lockstat -D 10 sleep 2

Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)

Count indv cuml rcnt     nsec Lock                   Caller
-------------------------------------------------------------------------------
829064  45%  45% 0.00    33280 0xffffff117624a5d0     rrw_enter_read+0x1b
705001  38%  82% 0.00    30983 0xffffff117624a5d0     rrw_exit+0x1d
140678   8%  90% 0.00    10546 0xffffff117624a6e0     zfs_zget+0x46
37208   2%  92% 0.00     5403 0xffffff114b136840     vn_rele+0x1e
33926   2%  94% 0.00     5417 0xffffff114b136840     lookuppnatcred+0xc5
27188   1%  95% 0.00     1155 vn_vfslocks_buckets+0xd980
vn_vfslocks_getlock+0x3b
11073   1%  96% 0.00     1639 vn_vfslocks_buckets+0x4600
vn_vfslocks_getlock+0x3b
 9321   1%  96% 0.00     1961 0xffffff114b82a680     dnlc_lookup+0x83
 6929   0%  97% 0.00     1590 0xffffff11573b8f28
zfs_fastaccesschk_execute+0x6a
 5746   0%  97% 0.00     5935 0xffffff114b136840     lookuppnvp+0x566
-------------------------------------------------------------------------------

Then if you look at the caller of lock(0xffffff117624a5d0), you''ll see
it''s ZFS, not tmpfs.
===========================================================Count indv cuml rcnt 
nsec Lock                   Caller
48494   6%  17% 0.00   145263 0xffffff117624a5d0     rrw_enter_read+0x1b

      nsec ------ Time Distribution ------ count     Stack
       256 |                               17        rrw_enter+0x2c
       512 |                               1120      zfs_root+0x3b
      1024 |@                              1718      fsop_root+0x2d
      2048 |@@                             4834      traverse+0x65
      4096 |@@@@@@@@@@@                    18569     lookuppnvp+0x446
      8192 |@@@@                           6620      lookuppnatcred+0x119
     16384 |@                              2929      lookupnameatcred+0x97
     32768 |@                              1635      lookupnameat+0x6b
     65536 |                               894       cstatat_getvp+0x11e
    131072 |                               1249      cstatat64_32+0x5d
    262144 |@                              1620      fstatat64_32+0x4c
    524288 |@                              2474
_sys_sysenter_post_swapgs+0x149


That''s why I post this subject here. Hope it''s clear this
time.

Thanks,
-Aubrey

Aubrey Li

2012-Mar-26 10:27 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 4:33 PM,  <Casper.Dik at oracle.com>
wrote:>
>>I''m migrating a webserver(apache+php) from RHEL to solaris.
During the
>>stress testing comparison, I found under the same session number of
client
>>request, CPU% is ~70% on RHEL while CPU% is full on solaris.
>
> Which version of Solaris is this?
This is Solaris 11.
>
>>The apache root documentation is apparently in zfs dataset. It looks
like each
>>file system operation will run into zfs root mutex contention.
>>
>>Is this an expected behavior? If so, is there any zfs tunable to
>>address this issue?
>
> The zfs_root() function is the function which is used when a
> mountpoint is transversed. ?Where is the apache document root
> directory and under which mountpoints?
I have an external storage server and I created a zfs pool on it.

# zpool list
NAME     SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
export1  800G   277G   523G  34%  1.00x  ONLINE  -
rpool    117G   109G  8.44G  92%  1.00x  ONLINE  -

this pool is mounted under /export, all of the apache documents.
are in this pool at /export/webserver/apache2/htdocs.

The php temporary folder is set to /tmp, which is tmpfs.

Thanks,
-Aubrey
>
> Casper

Jim Klimov

2012-Mar-26 11:28 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

2012-03-26 14:27, Aubrey Li wrote:> The php temporary folder is set to /tmp, which is tmpfs.
>
By the way, how much RAM does the box have available?
"tmpfs" in Solaris is backed by "virtual memory".
It is like a RAM disk, although maybe slower than ramdisk
FS as seen in livecd, as long as there is enough free
RAM, but then tmpfs can get swapped out to disk.

If your installation followed the default wizard, your
swap (or part of it) could be in rpool/swap and backed
by ZFS - leading to both many tmpfs accesses and many
ZFS accesses. (No idea about the mutex spinning part
though - if it is normal or not).

Also I''m not sure if tmpfs gets the benefits of caching,
and ZFS ARC cache can consume lots of RAM and thus push
tmpfs out to swap.

As a random guess, try pointing PHP tmp directory to
/var/tmp (backed by zfs) and see if any behaviors change?

Good luck,
//Jim

Jim Mauro

2012-Mar-26 12:24 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

You care about #2 and #3 because you are fixated on a ZFS root
lock contention problem, and not open to a broader discussion
about what your real problem actually is. I am not saying there is
not lock contention, and I am not saying there is - I''ll look at the
data later carefully later when I have more time.

Your problem statement, which took 20 emails to glean, is the 
Solaris system consumes more CPU than Linux on the same
hardware, doing roughly the same amount of work, and delivering
roughly the same level of performance - is that correct?

Please consider that, in Linux, you have no observability into
kernel lock statistics (at least, known that I know of) - Linux uses kernel
locks also, and for this workload, it seems likely to me that could
you observe those statistics, you would see numbers that would 
lead you to conclude you have lock contention in Linux.

Let''s talk about THE PROBLEM - Linux is 15% sys, 55% usr,
Solaris is 30% sys, 70% usr, running the same workload,
doing the same amount of work. delivering the same level
of performance. Please validate that problem statement.


On Mar 25, 2012, at 9:51 PM, Aubrey Li wrote:
> On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro <james.mauro at
oracle.com> wrote:
>> If you''re chasing CPU utilization, specifically %sys (time in
the kernel),
>> I would start with a time-based kernel profile.
>> 
>> #dtrace -n ''profile-997hz /arg0/ { @[stack()] = count(); }
tick-60sec { trunc(@, 20); printa(@0; }''
>> 
>> I would be curious to see where the CPU cycles are being consumed
first,
>> before going down the lock path?
>> 
>> This assume that most or all of CPU utilization is %sys. If
it''s %usr, we take
>> a different approach.
>> 
> 
> Here is the output, I changed to "tick-5sec" and "trunc(@,
5)".
> No.2 and No.3 is what I care about.
> 
> Thanks,
> -Aubrey
> 
> 21  80536                       :tick-5sec
> ========== 1 ====================>              genunix`avl_walk+0x6a
>              genunix`as_gap_aligned+0x2b7
>              unix`map_addr_proc+0x179
>              unix`map_addr+0x8e
>              genunix`choose_addr+0x9e
>              zfs`zfs_map+0x161
>              genunix`fop_map+0xc5
>              genunix`smmap_common+0x268
>              genunix`smmaplf32+0xa2
>              genunix`syscall_ap+0x92
>              unix`_sys_sysenter_post_swapgs+0x149
>             1427
> 
> ========= 2 ================>              unix`mutex_delay_default+0x7
>              unix`mutex_vector_enter+0x2ae
>              zfs`zfs_zget+0x46
>              zfs`zfs_root+0x55
>              genunix`fsop_root+0x2d
>              genunix`traverse+0x65
>              genunix`lookuppnvp+0x446
>              genunix`lookuppnatcred+0x119
>              genunix`lookupnameatcred+0x97
>              genunix`lookupnameat+0x6b
>              genunix`vn_openat+0x147
>              genunix`copen+0x493
>              genunix`openat64+0x2d
>              unix`_sys_sysenter_post_swapgs+0x149
>             2645
> 
> ========== 3 ====================>             
unix`mutex_delay_default+0x7
>              unix`mutex_vector_enter+0x2ae
>              zfs`zfs_zget+0x46
>              zfs`zfs_root+0x55
>              genunix`fsop_root+0x2d
>              genunix`traverse+0x65
>              genunix`lookuppnvp+0x446
>              genunix`lookuppnatcred+0x119
>              genunix`lookupnameatcred+0x97
>              genunix`lookupnameat+0x6b
>              genunix`cstatat_getvp+0x11e
>              genunix`cstatat64_32+0x5d
>              genunix`fstatat64_32+0x4c
>              unix`_sys_sysenter_post_swapgs+0x149
>             3201
> 
> ======== 4 ==========>              unix`i86_mwait+0xd
>              unix`cpu_idle_mwait+0x154
>              unix`idle+0x116
>              unix`thread_start+0x8
>             3559
> 
> ========= 5 =============>              tmpfs`tmp_readdir+0x138
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>             4589
> 
> =========  6 ===============>              unix`strlen+0x3
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>             5005
> 
> =========== 7 ================>              tmpfs`tmp_readdir+0xc7
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>             9548
> 
> 
> ========= 8 ======================>              unix`strlen+0x8
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>            11166
> 
> 
> ============= 9 ==================>              unix`strlen+0xe
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>            14491
> 
> =============  10 =================>              tmpfs`tmp_readdir+0xbe
>              genunix`fop_readdir+0xe8
>              genunix`getdents64+0xd5
>              unix`_sys_sysenter_post_swapgs+0x149
>            39540

Aubrey Li

2012-Mar-26 13:53 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 7:28 PM, Jim Klimov <jimklimov at cos.ru>
wrote:> 2012-03-26 14:27, Aubrey Li wrote:
>>
>> The php temporary folder is set to /tmp, which is tmpfs.
>>
>
> By the way, how much RAM does the box have available?
> "tmpfs" in Solaris is backed by "virtual memory".
> It is like a RAM disk, although maybe slower than ramdisk
> FS as seen in livecd, as long as there is enough free
> RAM, but then tmpfs can get swapped out to disk.
>
> If your installation followed the default wizard, your
> swap (or part of it) could be in rpool/swap and backed
> by ZFS - leading to both many tmpfs accesses and many
> ZFS accesses. (No idea about the mutex spinning part
> though - if it is normal or not).
>
> Also I''m not sure if tmpfs gets the benefits of caching,
> and ZFS ARC cache can consume lots of RAM and thus push
> tmpfs out to swap.
>
> As a random guess, try pointing PHP tmp directory to
> /var/tmp (backed by zfs) and see if any behaviors change?
>
> Good luck,
> //Jim
>
Thanks for your suggestions. Actually the default PHP tmp directory
was /var/tmp, and I changed "/var/tmp" to "/tmp". This
reduced zfs
root lock contention significantly. However, I still see a bunch of lock
contention. So I''m here ask for help.

Thanks,
-Aubrey

Aubrey Li

2012-Mar-26 14:27 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Mon, Mar 26, 2012 at 8:24 PM, Jim Mauro <james.mauro at oracle.com>
wrote:>
> You care about #2 and #3 because you are fixated on a ZFS root
> lock contention problem, and not open to a broader discussion
> about what your real problem actually is. I am not saying there is
> not lock contention, and I am not saying there is - I''ll look at
the
> data later carefully later when I have more time.
>
> Your problem statement, which took 20 emails to glean, is the
> Solaris system consumes more CPU than Linux on the same
> hardware, doing roughly the same amount of work, and delivering
> roughly the same level of performance - is that correct?
>
> Please consider that, in Linux, you have no observability into
> kernel lock statistics (at least, known that I know of) - Linux uses kernel
> locks also, and for this workload, it seems likely to me that could
> you observe those statistics, you would see numbers that would
> lead you to conclude you have lock contention in Linux.
>
> Let''s talk about THE PROBLEM - Linux is 15% sys, 55% usr,
> Solaris is 30% sys, 70% usr, running the same workload,
> doing the same amount of work. delivering the same level
> of performance. Please validate that problem statement.
>
You''re definitely right.

I''m running the same workload, doing the same amount of work,
delivering the same level of performance on the same hardware
platform, even the root disk type are exactly the same.

So the userland software stack is exactly the same.
The defference is:
        linux    is 15% sys, 55% usr.
        solaris is 30% sys, 70% usr.

Basically I agree with Fajar. This is probably not a fair comparison.
A robost system not only deliver highest performance, we should
consider reliability, availability and serviceability, and energy
efficiency and other server related features. No doubt ZFS is the
most excellent file system in the planet.

As Richard pointed out, if we look at mpstat output
=================================================================SET minf mjf
xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl sze
 0 35140   0 2380 59742 19476 93056 30906 32919 256336 1104 967806 65
35   0   0  32

Including
smtx 256336: spins on mutex

I need to look at

icsw 30906:  involuntary context switches
migr 32919:  thread migration
syscl 967806: system calls

And given that Solaris consumes 70% CPU on the userland,
I probably need to break down how long it is in apache, libphp, libc,
etc. (what''s your approach for usr% you mentioned above? BTW)

I am not open to a broader discussion because this is zfs mailing list.
zfs root lock contention is what I observed so far I can post on this forum.
I can take care of other aspects and may ask for help somewhere else.

I admit I didn''t dig into linux, I agree there could be lock contention
as well.
But since solaris consumes more CPU% and hence more system power
than linux, I think I have to look at solaris first, to see if there are any
tuning works need to be done.

Do you agree this is the right way to go ahead?

Thanks,
-Aubrey

Richard Elling

2012-Mar-26 15:41 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

I see nothing unusual in the lockstat data. I think you''re barking up
the wrong tree.
 -- richard

On Mar 25, 2012, at 10:51 PM, Aubrey Li wrote:
> On Mon, Mar 26, 2012 at 1:19 PM, Richard Elling
> <richard.elling at richardelling.com> wrote:
>> Apologies to the ZFSers, this thread really belongs elsewhere.
>> 
>> Let me explain below:
>> 
>> Root documentation path of apache is in zfs, you see
>> it at No.3 at the above dtrace report.
>> 
>> 
>> The sort is in reverse order. The large number you see below the
>> stack trace is the number of times that stack was seen. By far the
>> most frequently seen is tmpfs`tmp_readdir
> 
> It''s true, but I didn''t see any potential issues there.
> 
>> 
>> 
>> tmpfs(/tmp) is the place where PHP place the temporary
>> folders and files.
>> 
>> 
>> bingo
>> 
> 
> I have to paste the lock investigation here again,
> 
> Firstly, you can see which lock is spinning:
> ======================================================> # lockstat -D 10
sleep 2
> 
> Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec)
> 
> Count indv cuml rcnt     nsec Lock                   Caller
>
-------------------------------------------------------------------------------
> 829064  45%  45% 0.00    33280 0xffffff117624a5d0     rrw_enter_read+0x1b
> 705001  38%  82% 0.00    30983 0xffffff117624a5d0     rrw_exit+0x1d
> 140678   8%  90% 0.00    10546 0xffffff117624a6e0     zfs_zget+0x46
> 37208   2%  92% 0.00     5403 0xffffff114b136840     vn_rele+0x1e
> 33926   2%  94% 0.00     5417 0xffffff114b136840     lookuppnatcred+0xc5
> 27188   1%  95% 0.00     1155 vn_vfslocks_buckets+0xd980
> vn_vfslocks_getlock+0x3b
> 11073   1%  96% 0.00     1639 vn_vfslocks_buckets+0x4600
> vn_vfslocks_getlock+0x3b
> 9321   1%  96% 0.00     1961 0xffffff114b82a680     dnlc_lookup+0x83
> 6929   0%  97% 0.00     1590 0xffffff11573b8f28
> zfs_fastaccesschk_execute+0x6a
> 5746   0%  97% 0.00     5935 0xffffff114b136840     lookuppnvp+0x566
>
-------------------------------------------------------------------------------
> 
> Then if you look at the caller of lock(0xffffff117624a5d0), you''ll
see
> it''s ZFS, not tmpfs.
> ===========================================================> Count indv
cuml rcnt     nsec Lock                   Caller
> 48494   6%  17% 0.00   145263 0xffffff117624a5d0     rrw_enter_read+0x1b
> 
>      nsec ------ Time Distribution ------ count     Stack
>       256 |                               17        rrw_enter+0x2c
>       512 |                               1120      zfs_root+0x3b
>      1024 |@                              1718      fsop_root+0x2d
>      2048 |@@                             4834      traverse+0x65
>      4096 |@@@@@@@@@@@                    18569     lookuppnvp+0x446
>      8192 |@@@@                           6620      lookuppnatcred+0x119
>     16384 |@                              2929      lookupnameatcred+0x97
>     32768 |@                              1635      lookupnameat+0x6b
>     65536 |                               894       cstatat_getvp+0x11e
>    131072 |                               1249      cstatat64_32+0x5d
>    262144 |@                              1620      fstatat64_32+0x4c
>    524288 |@                              2474
> _sys_sysenter_post_swapgs+0x149
> 
> 
> That''s why I post this subject here. Hope it''s clear this
time.
> 
> Thanks,
> -Aubrey
--
DTrace Conference, April 3, 2012,
http://wiki.smartos.org/display/DOC/dtrace.conf
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120326/0dec5ec2/attachment-0001.html>

Jim Klimov

2012-Mar-26 17:15 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

> > As a random guess, try pointing PHP tmp directory to
> > /var/tmp (backed by zfs) and see if any behaviors change?
> >
> > Good luck,
> > //Jim
> >
> 
> Thanks for your suggestions. Actually the default PHP tmp directory
> was /var/tmp, and I changed "/var/tmp" to "/tmp". This
reduced zfs
> root lock contention significantly. However, I still see a bunch 
> of lock
> contention. So I''m here ask for help.
Well, as a further attempt down this road, is it possible for you to rule out
ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all
(swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of
same or better yet separate disks?
 
If you do have lots of swapping activity (that can be seen in "vmstat
1" as
si/so columns) going on in a zvol, you''re likely to get much
fragmentation
in the pool, and searching for contiguous stretches of space can become
tricky (and time-consuming), or larger writes can get broken down into
many smaller random writes and/or "gang blocks", which is also slower.
At least such waiting on disks can explain the overall large kernel times.
 
You can also see the disk wait times ratio in "iostat -xzn 1" column
"%w"
and disk busy times ratio in "%b" (second and third from the right).
I dont''t remember you posting that.
 
If these are accounting in tens, or even close or equal to 100%, then
your disks are the actual bottleneck. Speeding up that subsystem, 
including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL 
SSD/DDRDrive) and combatting fragmentation by moving swap and 
other scratch spaces to dedicated pools or raw slices might help.

HTH,
//Jim
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120326/722cea48/attachment.html>

Aubrey Li

2012-Mar-27 03:13 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

On Tue, Mar 27, 2012 at 1:15 AM, Jim Klimov <jim at cos.ru>
wrote:> Well, as a further attempt down this road, is it possible for you to rule
> out
> ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all
> (swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of
> same or better yet separate disks?
>
Thanks Jim for your suggestion!

> If you do have lots of swapping activity (that can be seen in "vmstat
1" as
> si/so columns) going on in a zvol, you''re likely to get much
fragmentation
> in the pool, and searching for contiguous stretches of space can become
> tricky (and time-consuming), or larger writes can get broken down into
> many smaller random writes and/or "gang blocks", which is also
slower.
> At least such waiting on disks can explain the overall large kernel times.
I took swapping activity into account, even when the CPU% is 100%,
"si"
(swap-ins) and "so" (swap-outs) are always ZEROs.
>
> You can also see the disk wait times ratio?in "iostat -xzn?1"
column "%w"
> and disk busy times ratio in "%b" (second and third from the
right).
> I dont''t remember you posting that.
>
> If these are accounting in tens, or even close or equal to 100%, then
> your disks are the actual bottleneck. Speeding up that subsystem,
> including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL
> SSD/DDRDrive) and combatting fragmentation by moving swap and
> other scratch spaces to dedicated pools or raw slices might help.
My storage system is not quite busy, and there are only read operations.
====================================# iostat -xnz 3
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  112.4    0.0 1691.5    0.0  0.0  0.5    0.0    4.8   0  41 c11t0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  118.7    0.0 1867.0    0.0  0.0  0.5    0.0    4.5   0  42 c11t0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  127.7    0.0 2121.6    0.0  0.0  0.6    0.0    4.7   0  44 c11t0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  141.3    0.0 2158.5    0.0  0.0  0.7    0.0    4.6   0  48 c11t0d0
=============================================
Thanks,
-Aubrey

Phil Harman

2012-Mar-27 07:24 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

One of the glories of Solaris is that it is so very observable. Then 
there are the many excellent blog posts, wiki entries, and books - some 
or which are authored by contributors to this very thread - explaining 
how Solaris works. But these virtues are also a snare to some, and it is 
not uncommon for even experienced practitioners to become fixated by 
this or that.

Aubrey, a 70:30 user to system ratio is pretty respectable. Running at 
100% is not so pretty (e.g. I would be surprised if you DIDN''T see many
involuntary context switches (icsw) in such a scenario). Esteemed 
experts have assured you that ZFS lock contention is not you main 
concern. I would run with that.

You said it was a stress test. That raises many questions for me. I am 
much more interested in how systems perform in the real world. In my 
experience, many of the issues we find in production are not the ones we 
found in the lab. Indeed, I would argue that too many systems are tuned 
to run simplistic benchmarks instead of real workloads.

However, I don''t think it''s helpful to continue this
discussion here.
There are some esteemed experienced practitioners "out there" who
would
be only too happy to provide holistic systems performance tuning and 
health check services to you on a commercial basis (I hear that some may 
even accept PayPal).

Phil
(phil at harmanholistix.com)

"Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."

2012-Mar-27 13:42 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

hi
you did not answer the question, what is the RAM of the server? how many 
socket and core  etc
what is the block size of zfs?
what is the cache  ram of your  san array?
what is the block size/strip size  of your raid in san array? raid 5 or 
what?
what is your test program and how (from what kind  client)
regards



On 3/26/2012 11:13 PM, Aubrey Li wrote:> On Tue, Mar 27, 2012 at 1:15 AM, Jim Klimov<jim at cos.ru>  wrote:
>> Well, as a further attempt down this road, is it possible for you to
rule
>> out
>> ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all
>> (swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices
of
>> same or better yet separate disks?
>>
> Thanks Jim for your suggestion!
>
>
>> If you do have lots of swapping activity (that can be seen in
"vmstat 1" as
>> si/so columns) going on in a zvol, you''re likely to get much
fragmentation
>> in the pool, and searching for contiguous stretches of space can become
>> tricky (and time-consuming), or larger writes can get broken down into
>> many smaller random writes and/or "gang blocks", which is
also slower.
>> At least such waiting on disks can explain the overall large kernel
times.
> I took swapping activity into account, even when the CPU% is 100%,
"si"
> (swap-ins) and "so" (swap-outs) are always ZEROs.
>
>> You can also see the disk wait times ratio in "iostat -xzn 1"
column "%w"
>> and disk busy times ratio in "%b" (second and third from the
right).
>> I dont''t remember you posting that.
>>
>> If these are accounting in tens, or even close or equal to 100%, then
>> your disks are the actual bottleneck. Speeding up that subsystem,
>> including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL
>> SSD/DDRDrive) and combatting fragmentation by moving swap and
>> other scratch spaces to dedicated pools or raw slices might help.
> My storage system is not quite busy, and there are only read operations.
> ====================================> # iostat -xnz 3
>                      extended device statistics
>      r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>    112.4    0.0 1691.5    0.0  0.0  0.5    0.0    4.8   0  41 c11t0d0
>                      extended device statistics
>      r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>    118.7    0.0 1867.0    0.0  0.0  0.5    0.0    4.5   0  42 c11t0d0
>                      extended device statistics
>      r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>    127.7    0.0 2121.6    0.0  0.0  0.6    0.0    4.7   0  44 c11t0d0
>                      extended device statistics
>      r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>    141.3    0.0 2158.5    0.0  0.0  0.7    0.0    4.6   0  48 c11t0d0
> =============================================>
> Thanks,
> -Aubrey
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Hung-Sheng Tsao Ph D.
Founder&  Principal
HopBit GridComputing LLC
cell: 9734950840

http://laotsao.blogspot.com/
http://laotsao.wordpress.com/
http://blogs.oracle.com/hstsao/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 608 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120327/ddfb1614/attachment.vcf>

Aubrey Li

2012-Mar-28 09:12 UTC

head link

[zfs-discuss] webserver zfs root lock contention under heavy load

It looks like I was posting on the wrong mailing list. I thought
this mailing list includes developers. The experiment I did is
not for commercial purpose. The purpose of comparison is to
find the optimization opportunity of the entire software stack
on both linux and solaris.

As for this zfs root lock, currently a special re-entrant rwlock
is implemented here for ZFS only. The interesting is, all http
and https request of my benchmark are performing the read
ops on apache server, while we still saw lots of mutex spin
events of this rwlock in the lockstat report . I''ll continue to
investigate if this is a conflict of design philosophy. At least,
this lock does not well behave for this kind of more read, less
write case.

If anyone has interested in this topic, I can send the update
offline.

Thanks,
-Aubrey



On Tue, Mar 27, 2012 at 9:42 PM, "Hung-Sheng Tsao (Lao Tsao ??) Ph.D."
<laotsao at gmail.com> wrote:> hi
> you did not answer the question, what is the RAM of the server? how many
> socket and core ?etc
> what is the block size of zfs?
> what is the cache ?ram of your ?san array?
> what is the block size/strip size ?of your raid in san array? raid 5 or
> what?
> what is your test program and how (from what kind ?client)
> regards
>
>
>
>
> On 3/26/2012 11:13 PM, Aubrey Li wrote:
>>
>> On Tue, Mar 27, 2012 at 1:15 AM, Jim Klimov<jim at cos.ru>
?wrote:
>>>
>>> Well, as a further attempt down this road, is it possible for you
to rule
>>> out
>>> ZFS from swapping - i.e. if RAM amounts permit, disable the swap at
all
>>> (swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated
slices of
>>> same or better yet separate disks?
>>>
>> Thanks Jim for your suggestion!
>>
>>
>>> If you do have lots of swapping activity (that can be seen in
"vmstat 1"
>>> as
>>> si/so columns) going on in a zvol, you''re likely to get
much
>>> fragmentation
>>> in the pool, and searching for contiguous stretches of space can
become
>>> tricky (and time-consuming), or larger writes can get broken down
into
>>> many smaller random writes and/or "gang blocks", which is
also slower.
>>> At least such waiting on disks can explain the overall large kernel
>>> times.
>>
>> I took swapping activity into account, even when the CPU% is 100%,
"si"
>> (swap-ins) and "so" (swap-outs) are always ZEROs.
>>
>>> You can also see the disk wait times ratio in "iostat -xzn
1" column "%w"
>>> and disk busy times ratio in "%b" (second and third from
the right).
>>> I dont''t remember you posting that.
>>>
>>> If these are accounting in tens, or even close or equal to 100%,
then
>>> your disks are the actual bottleneck. Speeding up that subsystem,
>>> including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL
>>> SSD/DDRDrive) and combatting fragmentation by moving swap and
>>> other scratch spaces to dedicated pools or raw slices might help.
>>
>> My storage system is not quite busy, and there are only read
operations.
>> ====================================>> # iostat -xnz 3
>> ? ? ? ? ? ? ? ? ? ? extended device statistics
>> ? ? r/s ? ?w/s ? kr/s ? kw/s wait actv wsvc_t asvc_t ?%w ?%b device
>> ? 112.4 ? ?0.0 1691.5 ? ?0.0 ?0.0 ?0.5 ? ?0.0 ? ?4.8 ? 0 ?41 c11t0d0
>> ? ? ? ? ? ? ? ? ? ? extended device statistics
>> ? ? r/s ? ?w/s ? kr/s ? kw/s wait actv wsvc_t asvc_t ?%w ?%b device
>> ? 118.7 ? ?0.0 1867.0 ? ?0.0 ?0.0 ?0.5 ? ?0.0 ? ?4.5 ? 0 ?42 c11t0d0
>> ? ? ? ? ? ? ? ? ? ? extended device statistics
>> ? ? r/s ? ?w/s ? kr/s ? kw/s wait actv wsvc_t asvc_t ?%w ?%b device
>> ? 127.7 ? ?0.0 2121.6 ? ?0.0 ?0.0 ?0.6 ? ?0.0 ? ?4.7 ? 0 ?44 c11t0d0
>> ? ? ? ? ? ? ? ? ? ? extended device statistics
>> ? ? r/s ? ?w/s ? kr/s ? kw/s wait actv wsvc_t asvc_t ?%w ?%b device
>> ? 141.3 ? ?0.0 2158.5 ? ?0.0 ?0.0 ?0.7 ? ?0.0 ? ?4.6 ? 0 ?48 c11t0d0
>> =============================================>>
>> Thanks,
>> -Aubrey
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
> --
> Hung-Sheng Tsao Ph D.
> Founder& ?Principal
> HopBit GridComputing LLC
> cell: 9734950840
>
> http://laotsao.blogspot.com/
> http://laotsao.wordpress.com/
> http://blogs.oracle.com/hstsao/
>

zfs discuss - Mar 2012 - webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load

[zfs-discuss] webserver zfs root lock contention under heavy load