Aubrey Li
2012-Mar-25 05:29 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
Hi, I''m migrating a webserver(apache+php) from RHEL to solaris. During the stress testing comparison, I found under the same session number of client request, CPU% is ~70% on RHEL while CPU% is full on solaris. After some investigation, zfs root lock emerges as a major doubtful point. Firstly, there are a bunch of mutex events. ===========================================# lockstat -D 10 sleep 2 Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 829064 45% 45% 0.00 33280 0xffffff117624a5d0 rrw_enter_read+0x1b 705001 38% 82% 0.00 30983 0xffffff117624a5d0 rrw_exit+0x1d 140678 8% 90% 0.00 10546 0xffffff117624a6e0 zfs_zget+0x46 37208 2% 92% 0.00 5403 0xffffff114b136840 vn_rele+0x1e 33926 2% 94% 0.00 5417 0xffffff114b136840 lookuppnatcred+0xc5 27188 1% 95% 0.00 1155 vn_vfslocks_buckets+0xd980 vn_vfslocks_getlock+0x3b 11073 1% 96% 0.00 1639 vn_vfslocks_buckets+0x4600 vn_vfslocks_getlock+0x3b 9321 1% 96% 0.00 1961 0xffffff114b82a680 dnlc_lookup+0x83 6929 0% 97% 0.00 1590 0xffffff11573b8f28 zfs_fastaccesschk_execute+0x6a 5746 0% 97% 0.00 5935 0xffffff114b136840 lookuppnvp+0x566 ------------------------------------------------------------------------------- If we look at the lock caller''s stack, the following two are the hotspots. =============================================================# lockstat -D 10 -l 0xffffff117624a5d0 -s 15 sleep 2 Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec) ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Caller 51215 6% 6% 0.00 100023 0xffffff117624a5d0 rrw_exit+0x1d nsec ------ Time Distribution ------ count Stack 256 | 61 zfs_root+0x78 512 |@ 1746 fsop_root+0x2d 1024 |@@@ 5210 traverse+0x65 2048 |@@@@@@@ 13464 kkkk+0x446 4096 |@@@@@@@ 13553 lookuppnatcred+0x119 8192 |@@@ 5355 lookupnameatcred+0x97 16384 |@ 2403 lookupnameat+0x6b 32768 | 1331 cstatat_getvp+0x11e 65536 | 673 cstatat64_32+0x5d 131072 | 961 fstatat64_32+0x4c 262144 | 1259 _sys_sysenter_post_swapgs+0x149 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Caller 48494 6% 17% 0.00 145263 0xffffff117624a5d0 rrw_enter_read+0x1b nsec ------ Time Distribution ------ count Stack 256 | 17 rrw_enter+0x2c 512 | 1120 zfs_root+0x3b 1024 |@ 1718 fsop_root+0x2d 2048 |@@ 4834 traverse+0x65 4096 |@@@@@@@@@@@ 18569 lookuppnvp+0x446 8192 |@@@@ 6620 lookuppnatcred+0x119 16384 |@ 2929 lookupnameatcred+0x97 32768 |@ 1635 lookupnameat+0x6b 65536 | 894 cstatat_getvp+0x11e 131072 | 1249 cstatat64_32+0x5d 262144 |@ 1620 fstatat64_32+0x4c 524288 |@ 2474 _sys_sysenter_post_swapgs+0x149 =========================================================== The apache root documentation is apparently in zfs dataset. It looks like each file system operation will run into zfs root mutex contention. Is this an expected behavior? If so, is there any zfs tunable to address this issue? Thanks, -Aubrey
Richard Elling
2012-Mar-25 07:55 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mar 24, 2012, at 10:29 PM, Aubrey Li wrote:> Hi, > > I''m migrating a webserver(apache+php) from RHEL to solaris. During the > stress testing comparison, I found under the same session number of client > request, CPU% is ~70% on RHEL while CPU% is full on solaris. > > After some investigation, zfs root lock emerges as a major doubtful point.This data does not seem major, less than 1% of total walltime. What is the ratio of user/sys? -- richard> > Firstly, there are a bunch of mutex events. > ===========================================> # lockstat -D 10 sleep 2 > > Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec) > > Count indv cuml rcnt nsec Lock Caller > ------------------------------------------------------------------------------- > 829064 45% 45% 0.00 33280 0xffffff117624a5d0 rrw_enter_read+0x1b > 705001 38% 82% 0.00 30983 0xffffff117624a5d0 rrw_exit+0x1d > 140678 8% 90% 0.00 10546 0xffffff117624a6e0 zfs_zget+0x46 > 37208 2% 92% 0.00 5403 0xffffff114b136840 vn_rele+0x1e > 33926 2% 94% 0.00 5417 0xffffff114b136840 lookuppnatcred+0xc5 > 27188 1% 95% 0.00 1155 vn_vfslocks_buckets+0xd980 > vn_vfslocks_getlock+0x3b > 11073 1% 96% 0.00 1639 vn_vfslocks_buckets+0x4600 > vn_vfslocks_getlock+0x3b > 9321 1% 96% 0.00 1961 0xffffff114b82a680 dnlc_lookup+0x83 > 6929 0% 97% 0.00 1590 0xffffff11573b8f28 > zfs_fastaccesschk_execute+0x6a > 5746 0% 97% 0.00 5935 0xffffff114b136840 lookuppnvp+0x566 > ------------------------------------------------------------------------------- > > If we look at the lock caller''s stack, the following two are the hotspots. > =============================================================> # lockstat -D 10 -l 0xffffff117624a5d0 -s 15 sleep 2 > > Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec) > ------------------------------------------------------------------------------- > Count indv cuml rcnt nsec Lock Caller > 51215 6% 6% 0.00 100023 0xffffff117624a5d0 rrw_exit+0x1d > > nsec ------ Time Distribution ------ count Stack > 256 | 61 zfs_root+0x78 > 512 |@ 1746 fsop_root+0x2d > 1024 |@@@ 5210 traverse+0x65 > 2048 |@@@@@@@ 13464 kkkk+0x446 > 4096 |@@@@@@@ 13553 lookuppnatcred+0x119 > 8192 |@@@ 5355 lookupnameatcred+0x97 > 16384 |@ 2403 lookupnameat+0x6b > 32768 | 1331 cstatat_getvp+0x11e > 65536 | 673 cstatat64_32+0x5d > 131072 | 961 fstatat64_32+0x4c > 262144 | 1259 _sys_sysenter_post_swapgs+0x149 > ------------------------------------------------------------------------------- > Count indv cuml rcnt nsec Lock Caller > 48494 6% 17% 0.00 145263 0xffffff117624a5d0 rrw_enter_read+0x1b > > nsec ------ Time Distribution ------ count Stack > 256 | 17 rrw_enter+0x2c > 512 | 1120 zfs_root+0x3b > 1024 |@ 1718 fsop_root+0x2d > 2048 |@@ 4834 traverse+0x65 > 4096 |@@@@@@@@@@@ 18569 lookuppnvp+0x446 > 8192 |@@@@ 6620 lookuppnatcred+0x119 > 16384 |@ 2929 lookupnameatcred+0x97 > 32768 |@ 1635 lookupnameat+0x6b > 65536 | 894 cstatat_getvp+0x11e > 131072 | 1249 cstatat64_32+0x5d > 262144 |@ 1620 fstatat64_32+0x4c > 524288 |@ 2474 _sys_sysenter_post_swapgs+0x149 > ===========================================================> > The apache root documentation is apparently in zfs dataset. It looks like each > file system operation will run into zfs root mutex contention. > > Is this an expected behavior? If so, is there any zfs tunable to > address this issue? > > Thanks, > -Aubrey > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120325/7373b3df/attachment.html>
Aubrey Li
2012-Mar-25 13:24 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Sun, Mar 25, 2012 at 3:55 PM, Richard Elling <richard.elling at richardelling.com> wrote:> On Mar 24, 2012, at 10:29 PM, Aubrey Li wrote: > > Hi, > > I''m migrating a webserver(apache+php) from RHEL to solaris. During the > stress testing comparison, I found under the same session number of client > request, CPU% is ~70% on RHEL while CPU% is full on solaris. > > After some investigation, zfs root lock emerges as a major doubtful point. > > > This data does not seem major, less than 1% of total walltime. > What is the ratio of user/sys? > ?-- richardHere is the vmstat output. ========================================================== kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s1 s1 s2 s2 in sy cs us sy id 37 0 0 25257648 34744756 26671 40446 0 0 0 0 0 0 0 258 0 54473 833569 100171 67 31 2 16 0 0 25270872 34745544 26416 40427 0 0 0 0 0 0 0 161 0 53804 850816 100016 68 29 3 80 0 0 25266716 34731880 27082 41500 0 0 0 0 0 0 0 145 0 56908 901578 98194 70 28 1 81 0 0 25318348 34763484 26436 40384 0 0 0 0 0 0 0 173 0 59977 1014699 99273 66 33 0 104 0 0 25359888 34787868 25539 39726 0 0 0 0 0 0 0 122 0 58175 989927 93106 68 32 0 113 0 0 25359208 34779872 25746 39992 0 0 0 0 0 0 0 161 0 57212 916427 94401 67 33 0 123 0 0 25409656 34809704 26228 41330 0 0 0 0 0 0 0 176 0 59332 890511 97127 69 31 0 97 0 0 25419628 34807796 26508 42013 0 0 0 0 0 0 0 148 0 59736 864022 98918 71 29 0 111 0 0 25468728 34834968 26477 40232 0 0 0 0 0 0 0 159 0 59270 867006 98308 67 33 0 105 0 0 25501972 34860952 26776 42204 0 0 0 0 0 0 0 163 0 70868 867899 97981 71 29 0 78 0 0 25526492 34865416 26489 43504 0 0 0 0 0 0 0 157 0 61110 952207 99806 67 33 0 ================================================================= and "mpstat -a" matches the above measure. ~200,000 spins on mutex. =================================================================SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze 0 35140 0 2380 59742 19476 93056 30906 32919 256336 1104 967806 65 35 0 0 32 SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze 0 36044 0 3268 61910 19482 96287 32121 32844 245293 1122 954150 64 36 0 0 32 SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze 0 36516 0 3900 61459 19478 95502 31256 35345 221473 1106 932210 66 34 0 0 32 SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze 0 35791 0 3544 60751 19371 97982 31158 33920 209179 3115 859644 64 36 0 0 32 SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze 0 38317 0 4892 63046 19566 97316 31688 38100 220262 2058 915555 65 35 0 0 32 SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze 0 36408 0 3154 60766 19393 94343 31244 35582 248314 811 980967 65 35 0 0 32 ================================================================== Thanks, -Aubrey> > > Firstly, there are a bunch of mutex events. > ===========================================> # lockstat -D 10 sleep 2 > > Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec) > > Count indv cuml rcnt ????nsec Lock ??????????????????Caller > ------------------------------------------------------------------------------- > 829064 ?45% ?45% 0.00 ???33280 0xffffff117624a5d0 ????rrw_enter_read+0x1b > 705001 ?38% ?82% 0.00 ???30983 0xffffff117624a5d0 ????rrw_exit+0x1d > 140678 ??8% ?90% 0.00 ???10546 0xffffff117624a6e0 ????zfs_zget+0x46 > 37208 ??2% ?92% 0.00 ????5403 0xffffff114b136840 ????vn_rele+0x1e > 33926 ??2% ?94% 0.00 ????5417 0xffffff114b136840 ????lookuppnatcred+0xc5 > 27188 ??1% ?95% 0.00 ????1155 vn_vfslocks_buckets+0xd980 > vn_vfslocks_getlock+0x3b > 11073 ??1% ?96% 0.00 ????1639 vn_vfslocks_buckets+0x4600 > vn_vfslocks_getlock+0x3b > 9321 ??1% ?96% 0.00 ????1961 0xffffff114b82a680 ????dnlc_lookup+0x83 > 6929 ??0% ?97% 0.00 ????1590 0xffffff11573b8f28 > zfs_fastaccesschk_execute+0x6a > 5746 ??0% ?97% 0.00 ????5935 0xffffff114b136840 ????lookuppnvp+0x566 > ------------------------------------------------------------------------------- > > If we look at the lock caller''s stack, the following two are the hotspots. > =============================================================> # lockstat -D 10 -l 0xffffff117624a5d0 -s 15 sleep 2 > > Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec) > ------------------------------------------------------------------------------- > Count indv cuml rcnt ????nsec Lock ??????????????????Caller > 51215 ??6% ??6% 0.00 ??100023 0xffffff117624a5d0 ????rrw_exit+0x1d > > nsec ------ Time Distribution ------ count ????Stack > ?256 | ??????????????????????????????61 ???????zfs_root+0x78 > ?512 |@ ?????????????????????????????1746 ?????fsop_root+0x2d > 1024 |@@@ ???????????????????????????5210 ?????traverse+0x65 > 2048 |@@@@@@@ ???????????????????????13464 ????kkkk+0x446 > 4096 |@@@@@@@ ???????????????????????13553 ????lookuppnatcred+0x119 > 8192 |@@@ ???????????????????????????5355 ?????lookupnameatcred+0x97 > 16384 |@ ?????????????????????????????2403 ?????lookupnameat+0x6b > 32768 | ??????????????????????????????1331 ?????cstatat_getvp+0x11e > 65536 | ??????????????????????????????673 ??????cstatat64_32+0x5d > 131072 | ??????????????????????????????961 ??????fstatat64_32+0x4c > 262144 | ??????????????????????????????1259 > ?????_sys_sysenter_post_swapgs+0x149 > ------------------------------------------------------------------------------- > Count indv cuml rcnt ????nsec Lock ??????????????????Caller > 48494 ??6% ?17% 0.00 ??145263 0xffffff117624a5d0 ????rrw_enter_read+0x1b > > nsec ------ Time Distribution ------ count ????Stack > ??256 | ??????????????????????????????17 ???????rrw_enter+0x2c > ??512 | ??????????????????????????????1120 ?????zfs_root+0x3b > ?1024 |@ ?????????????????????????????1718 ?????fsop_root+0x2d > ?2048 |@@ ????????????????????????????4834 ?????traverse+0x65 > ?4096 |@@@@@@@@@@@ ???????????????????18569 ????lookuppnvp+0x446 > ?8192 |@@@@ ??????????????????????????6620 ?????lookuppnatcred+0x119 > 16384 |@ ?????????????????????????????2929 ?????lookupnameatcred+0x97 > 32768 |@ ?????????????????????????????1635 ?????lookupnameat+0x6b > 65536 | ??????????????????????????????894 ??????cstatat_getvp+0x11e > 131072 | ??????????????????????????????1249 ?????cstatat64_32+0x5d > 262144 |@ ?????????????????????????????1620 ?????fstatat64_32+0x4c > 524288 |@ ?????????????????????????????2474 > ?????_sys_sysenter_post_swapgs+0x149 > ===========================================================> > The apache root documentation is apparently in zfs dataset. It looks like > each > file system operation will run into zfs root mutex contention. > > Is this an expected behavior? If so, is there any zfs tunable to > address this issue? > > Thanks, > -Aubrey > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > -- > DTrace Conference, April 3, > 2012,?http://wiki.smartos.org/display/DOC/dtrace.conf > ZFS Performance and Training > Richard.Elling at RichardElling.com > +1-760-896-4422 > > > > > >
Richard Elling
2012-Mar-25 16:48 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
This is the wrong forum for general purpose performance tuning. So I won''t continue this much farther. Notice the huge number of icsw, that is a bigger symptom than locks. -- richard On Mar 25, 2012, at 6:24 AM, Aubrey Li wrote:> SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze > 0 35140 0 2380 59742 19476 93056 30906 32919 256336 1104 967806 > 65 35 0 0 32 > SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze > 0 36044 0 3268 61910 19482 96287 32121 32844 245293 1122 954150 > 64 36 0 0 32 > SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze > 0 36516 0 3900 61459 19478 95502 31256 35345 221473 1106 932210 > 66 34 0 0 32 > SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze > 0 35791 0 3544 60751 19371 97982 31158 33920 209179 3115 859644 > 64 36 0 0 32 > SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze > 0 38317 0 4892 63046 19566 97316 31688 38100 220262 2058 915555 > 65 35 0 0 32 > SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze > 0 36408 0 3154 60766 19393 94343 31244 35582 248314 811 980967 > 65 35 0 0 32-- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120325/28cf06b1/attachment.html>
Aubrey Li
2012-Mar-25 17:25 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling <richard.elling at richardelling.com> wrote:> This is the wrong forum for general purpose performance tuning. So I won''t > continue this much farther. ?Notice the huge number of icsw, that is a > bigger > symptom than locks. > ?-- richardthanks anyway, lock must be a problem. the scenario here is, apache causes a bunch of stat() syscall, which leads to a bunch of zfs vnode access, byond the normal read/write operation. The problem is, every zfs vnode access need the **same zfs root** lock. When the number of httpd processes and the corresponding kernel threads becomes large, this root lock contention becomes horrible. This situation does not occurs on linux. Let me see if any others have a clue for me. Any suggestions will be highly appreciated! Regards, -Aubrey> > On Mar 25, 2012, at 6:24 AM, Aubrey Li wrote: > > SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt idl > sze > ?0 35140 ??0 2380 59742 19476 93056 30906 32919 256336 1104 967806 > 65 ?35 ??0 ??0 ?32 > SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt idl > sze > ?0 36044 ??0 3268 61910 19482 96287 32121 32844 245293 1122 954150 > 64 ?36 ??0 ??0 ?32 > SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt idl > sze > ?0 36516 ??0 3900 61459 19478 95502 31256 35345 221473 1106 932210 > 66 ?34 ??0 ??0 ?32 > SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt idl > sze > ?0 35791 ??0 3544 60751 19371 97982 31158 33920 209179 3115 859644 > 64 ?36 ??0 ??0 ?32 > SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt idl > sze > ?0 38317 ??0 4892 63046 19566 97316 31688 38100 220262 2058 915555 > 65 ?35 ??0 ??0 ?32 > SET minf mjf xcal ?intr ithr ?csw icsw migr smtx ?srw syscl ?usr sys ?wt idl > sze > ?0 36408 ??0 3154 60766 19393 94343 31244 35582 248314 ?811 980967 > 65 ?35 ??0 ??0 ?32 > > > -- > DTrace Conference, April 3, > 2012,?http://wiki.smartos.org/display/DOC/dtrace.conf > ZFS Performance and Training > Richard.Elling at RichardElling.com > +1-760-896-4422 > > >
zfs user
2012-Mar-25 18:10 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On 3/25/12 10:25 AM, Aubrey Li wrote:> On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling > <richard.elling at richardelling.com> wrote: >> This is the wrong forum for general purpose performance tuning. So I won''t >> continue this much farther. Notice the huge number of icsw, that is a >> bigger >> symptom than locks. >> -- richard > > thanks anyway, lock must be a problem. the scenario here is, apache > causes a bunch of stat() > syscall, which leads to a bunch of zfs vnode access, byond the normal > read/write operation. > > The problem is, every zfs vnode access need the **same zfs root** > lock. When the number of > httpd processes and the corresponding kernel threads becomes large, > this root lock contention > becomes horrible. This situation does not occurs on linux. > > Let me see if any others have a clue for me. Any suggestions will be > highly appreciated! > > Regards, > -AubreyIs noatime turned on for all your zfs filesystems? Have you tried the latest version of httpd?
Richard Elling
2012-Mar-25 18:58 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mar 25, 2012, at 10:25 AM, Aubrey Li wrote:> On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling > <richard.elling at richardelling.com> wrote: >> This is the wrong forum for general purpose performance tuning. So I won''t >> continue this much farther. Notice the huge number of icsw, that is a >> bigger >> symptom than locks. >> -- richard > > thanks anyway, lock must be a problem. the scenario here is, apache > causes a bunch of stat() > syscall, which leads to a bunch of zfs vnode access, byond the normal > read/write operation.That does not explain the high icsw. In a properly sized system icsw will be on the order of 0 to 100, closer to 0. If there are millions of files, then you should check the DLNC hit rate.> > The problem is, every zfs vnode access need the **same zfs root** > lock. When the number of > httpd processes and the corresponding kernel threads becomes large, > this root lock contention > becomes horrible. This situation does not occurs on linux.I disagree with your conclusion and I''ve seen ZFS systems do millions of stats() per second without issue. What does prstat -Lm show? -- richard> > Let me see if any others have a clue for me. Any suggestions will be > highly appreciated! > > Regards, > -Aubrey > >> >> On Mar 25, 2012, at 6:24 AM, Aubrey Li wrote: >> >> SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl >> sze >> 0 35140 0 2380 59742 19476 93056 30906 32919 256336 1104 967806 >> 65 35 0 0 32 >> SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl >> sze >> 0 36044 0 3268 61910 19482 96287 32121 32844 245293 1122 954150 >> 64 36 0 0 32 >> SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl >> sze >> 0 36516 0 3900 61459 19478 95502 31256 35345 221473 1106 932210 >> 66 34 0 0 32 >> SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl >> sze >> 0 35791 0 3544 60751 19371 97982 31158 33920 209179 3115 859644 >> 64 36 0 0 32 >> SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl >> sze >> 0 38317 0 4892 63046 19566 97316 31688 38100 220262 2058 915555 >> 65 35 0 0 32 >> SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl >> sze >> 0 36408 0 3154 60766 19393 94343 31244 35582 248314 811 980967 >> 65 35 0 0 32 >> >> >> -- >> DTrace Conference, April 3, >> 2012, http://wiki.smartos.org/display/DOC/dtrace.conf >> ZFS Performance and Training >> Richard.Elling at RichardElling.com >> +1-760-896-4422 >> >> >>-- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120325/3be4abd6/attachment.html>
Aubrey Li
2012-Mar-25 19:03 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 2:10 AM, zfs user <zfsml at itsbeen.sent.com> wrote:> On 3/25/12 10:25 AM, Aubrey Li wrote: >> >> On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling >> <richard.elling at richardelling.com> ?wrote: >>> >>> This is the wrong forum for general purpose performance tuning. So I >>> won''t >>> continue this much farther. ?Notice the huge number of icsw, that is a >>> bigger >>> symptom than locks. >>> ?-- richard >> >> >> thanks anyway, lock must be a problem. the scenario here is, apache >> causes a bunch of stat() >> syscall, which leads to a bunch of zfs vnode access, byond the normal >> read/write operation. >> >> The problem is, every zfs vnode access need the **same zfs root** >> lock. When the number of >> httpd processes and the corresponding kernel threads becomes large, >> this root lock contention >> becomes horrible. This situation does not occurs on linux. >> >> Let me see if any others have a clue for me. Any suggestions will be >> highly appreciated! >> >> Regards, >> -Aubrey > > > Is noatime turned on for all your zfs filesystems? > Have you tried the latest version of httpd? >yeah, atime is turned off. It''s not helpful to reduce the zfs vnode access. I didn''t try the latest version of httpd. I want to get the same performance by the same user land software stack as it on Linux. Thanks, -Aubrey
Aubrey Li
2012-Mar-25 19:13 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 2:58 AM, Richard Elling <richard.elling at richardelling.com> wrote:> On Mar 25, 2012, at 10:25 AM, Aubrey Li wrote: > > On Mon, Mar 26, 2012 at 12:48 AM, Richard Elling > <richard.elling at richardelling.com> wrote: > > This is the wrong forum for general purpose performance tuning. So I won''t > > continue this much farther. ?Notice the huge number of icsw, that is a > > bigger > > symptom than locks. > > ?-- richard > > > thanks anyway, lock must be a problem. the scenario here is, apache > causes a bunch of stat() > syscall, which leads to a bunch of zfs vnode access, byond the normal > read/write operation. > > > That does not explain the high icsw. In a properly sized system icsw will be > on the order of 0 to 100, closer to 0. > > If there are millions of files, then you should check the DLNC hit rate.Involuntary context switches might be another story I need to take care of. But I''m not aware it''s controllable, ?> > > The problem is, every zfs vnode access need the **same zfs root** > lock. When the number of > httpd processes and the corresponding kernel threads becomes large, > this root lock contention > becomes horrible. This situation does not occurs on linux. > > > I disagree with your conclusion and I''ve seen ZFS systems do millions of > stats() > per second without issue. What does prstat -Lm show? > ?-- richard >I have ever not seen any issues until I did a comparison with Linux. Thanks, -Aubrey
Fajar A. Nugraha
2012-Mar-25 19:22 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 2:13 AM, Aubrey Li <aubreylee at gmail.com> wrote:>> The problem is, every zfs vnode access need the **same zfs root** >> lock. When the number of >> httpd processes and the corresponding kernel threads becomes large, >> this root lock contention >> becomes horrible. This situation does not occurs on linux. >> >> >> I disagree with your conclusion and I''ve seen ZFS systems do millions of >> stats() >> per second without issue. What does prstat -Lm show? >> ?-- richard >> > > I have ever not seen any issues until I did a comparison with Linux.So basically you''re comparing linux + ext3/4 performance with solaris + zfs, on the same hardware? That''s not really fair, is it? If your load is I/O-intensive enough, ext3/4 will easily win since it doesn''t have to do things like calculating checksum. Now if you compare it to .. say ... btrfs, or zfsonlinux, it''d be more in the same ballpark. -- Fajar
Jim Mauro
2012-Mar-25 20:18 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
If you''re chasing CPU utilization, specifically %sys (time in the kernel), I would start with a time-based kernel profile. #dtrace -n ''profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { trunc(@, 20); printa(@0; }'' I would be curious to see where the CPU cycles are being consumed first, before going down the lock path? This assume that most or all of CPU utilization is %sys. If it''s %usr, we take a different approach. Thanks /jim On Mar 25, 2012, at 1:29 AM, Aubrey Li wrote:> Hi, > > I''m migrating a webserver(apache+php) from RHEL to solaris. During the > stress testing comparison, I found under the same session number of client > request, CPU% is ~70% on RHEL while CPU% is full on solaris. > > After some investigation, zfs root lock emerges as a major doubtful point. > > Firstly, there are a bunch of mutex events. > ===========================================> # lockstat -D 10 sleep 2 > > Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec) > > Count indv cuml rcnt nsec Lock Caller > ------------------------------------------------------------------------------- > 829064 45% 45% 0.00 33280 0xffffff117624a5d0 rrw_enter_read+0x1b > 705001 38% 82% 0.00 30983 0xffffff117624a5d0 rrw_exit+0x1d > 140678 8% 90% 0.00 10546 0xffffff117624a6e0 zfs_zget+0x46 > 37208 2% 92% 0.00 5403 0xffffff114b136840 vn_rele+0x1e > 33926 2% 94% 0.00 5417 0xffffff114b136840 lookuppnatcred+0xc5 > 27188 1% 95% 0.00 1155 vn_vfslocks_buckets+0xd980 > vn_vfslocks_getlock+0x3b > 11073 1% 96% 0.00 1639 vn_vfslocks_buckets+0x4600 > vn_vfslocks_getlock+0x3b > 9321 1% 96% 0.00 1961 0xffffff114b82a680 dnlc_lookup+0x83 > 6929 0% 97% 0.00 1590 0xffffff11573b8f28 > zfs_fastaccesschk_execute+0x6a > 5746 0% 97% 0.00 5935 0xffffff114b136840 lookuppnvp+0x566 > ------------------------------------------------------------------------------- > > If we look at the lock caller''s stack, the following two are the hotspots. > =============================================================> # lockstat -D 10 -l 0xffffff117624a5d0 -s 15 sleep 2 > > Adaptive mutex spin: 865609 events in 3.237 seconds (267444 events/sec) > ------------------------------------------------------------------------------- > Count indv cuml rcnt nsec Lock Caller > 51215 6% 6% 0.00 100023 0xffffff117624a5d0 rrw_exit+0x1d > > nsec ------ Time Distribution ------ count Stack > 256 | 61 zfs_root+0x78 > 512 |@ 1746 fsop_root+0x2d > 1024 |@@@ 5210 traverse+0x65 > 2048 |@@@@@@@ 13464 kkkk+0x446 > 4096 |@@@@@@@ 13553 lookuppnatcred+0x119 > 8192 |@@@ 5355 lookupnameatcred+0x97 > 16384 |@ 2403 lookupnameat+0x6b > 32768 | 1331 cstatat_getvp+0x11e > 65536 | 673 cstatat64_32+0x5d > 131072 | 961 fstatat64_32+0x4c > 262144 | 1259 _sys_sysenter_post_swapgs+0x149 > ------------------------------------------------------------------------------- > Count indv cuml rcnt nsec Lock Caller > 48494 6% 17% 0.00 145263 0xffffff117624a5d0 rrw_enter_read+0x1b > > nsec ------ Time Distribution ------ count Stack > 256 | 17 rrw_enter+0x2c > 512 | 1120 zfs_root+0x3b > 1024 |@ 1718 fsop_root+0x2d > 2048 |@@ 4834 traverse+0x65 > 4096 |@@@@@@@@@@@ 18569 lookuppnvp+0x446 > 8192 |@@@@ 6620 lookuppnatcred+0x119 > 16384 |@ 2929 lookupnameatcred+0x97 > 32768 |@ 1635 lookupnameat+0x6b > 65536 | 894 cstatat_getvp+0x11e > 131072 | 1249 cstatat64_32+0x5d > 262144 |@ 1620 fstatat64_32+0x4c > 524288 |@ 2474 _sys_sysenter_post_swapgs+0x149 > ===========================================================> > The apache root documentation is apparently in zfs dataset. It looks like each > file system operation will run into zfs root mutex contention. > > Is this an expected behavior? If so, is there any zfs tunable to > address this issue? > > Thanks, > -Aubrey > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Aubrey Li
2012-Mar-26 01:42 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 3:22 AM, Fajar A. Nugraha <work at fajar.net> wrote:>> >> I have ever not seen any issues until I did a comparison with Linux. > > So basically you''re comparing linux + ext3/4 performance with solaris > + zfs, on the same hardware? That''s not really fair, is it? > If your load is I/O-intensive enough, ext3/4 will easily win since it > doesn''t have to do things like calculating checksum. > > Now if you compare it to .. say ... btrfs, or zfsonlinux, it''d be more > in the same ballpark. >I''ll be more comfortable if I didn''t see the lock contention, are you indicating this(every file vnode access will acquire the same root mutex) is normal? I think it will be good if calculating checksum consumes CPU. Thanks, -Aubrey
Aubrey Li
2012-Mar-26 01:51 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro <james.mauro at oracle.com> wrote:> If you''re chasing CPU utilization, specifically %sys (time in the kernel), > I would start with a time-based kernel profile. > > #dtrace -n ''profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { trunc(@, 20); printa(@0; }'' > > I would be curious to see where the CPU cycles are being consumed first, > before going down the lock path? > > This assume that most or all of CPU utilization is %sys. If it''s %usr, we take > a different approach. >Here is the output, I changed to "tick-5sec" and "trunc(@, 5)". No.2 and No.3 is what I care about. Thanks, -Aubrey 21 80536 :tick-5sec ========== 1 ==================== genunix`avl_walk+0x6a genunix`as_gap_aligned+0x2b7 unix`map_addr_proc+0x179 unix`map_addr+0x8e genunix`choose_addr+0x9e zfs`zfs_map+0x161 genunix`fop_map+0xc5 genunix`smmap_common+0x268 genunix`smmaplf32+0xa2 genunix`syscall_ap+0x92 unix`_sys_sysenter_post_swapgs+0x149 1427 ========= 2 ================ unix`mutex_delay_default+0x7 unix`mutex_vector_enter+0x2ae zfs`zfs_zget+0x46 zfs`zfs_root+0x55 genunix`fsop_root+0x2d genunix`traverse+0x65 genunix`lookuppnvp+0x446 genunix`lookuppnatcred+0x119 genunix`lookupnameatcred+0x97 genunix`lookupnameat+0x6b genunix`vn_openat+0x147 genunix`copen+0x493 genunix`openat64+0x2d unix`_sys_sysenter_post_swapgs+0x149 2645 ========== 3 ==================== unix`mutex_delay_default+0x7 unix`mutex_vector_enter+0x2ae zfs`zfs_zget+0x46 zfs`zfs_root+0x55 genunix`fsop_root+0x2d genunix`traverse+0x65 genunix`lookuppnvp+0x446 genunix`lookuppnatcred+0x119 genunix`lookupnameatcred+0x97 genunix`lookupnameat+0x6b genunix`cstatat_getvp+0x11e genunix`cstatat64_32+0x5d genunix`fstatat64_32+0x4c unix`_sys_sysenter_post_swapgs+0x149 3201 ======== 4 ========== unix`i86_mwait+0xd unix`cpu_idle_mwait+0x154 unix`idle+0x116 unix`thread_start+0x8 3559 ========= 5 ============= tmpfs`tmp_readdir+0x138 genunix`fop_readdir+0xe8 genunix`getdents64+0xd5 unix`_sys_sysenter_post_swapgs+0x149 4589 ========= 6 =============== unix`strlen+0x3 genunix`fop_readdir+0xe8 genunix`getdents64+0xd5 unix`_sys_sysenter_post_swapgs+0x149 5005 =========== 7 ================ tmpfs`tmp_readdir+0xc7 genunix`fop_readdir+0xe8 genunix`getdents64+0xd5 unix`_sys_sysenter_post_swapgs+0x149 9548 ========= 8 ====================== unix`strlen+0x8 genunix`fop_readdir+0xe8 genunix`getdents64+0xd5 unix`_sys_sysenter_post_swapgs+0x149 11166 ============= 9 ================== unix`strlen+0xe genunix`fop_readdir+0xe8 genunix`getdents64+0xd5 unix`_sys_sysenter_post_swapgs+0x149 14491 ============= 10 ================= tmpfs`tmp_readdir+0xbe genunix`fop_readdir+0xe8 genunix`getdents64+0xd5 unix`_sys_sysenter_post_swapgs+0x149 39540
Richard Elling
2012-Mar-26 03:34 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mar 25, 2012, at 6:51 PM, Aubrey Li wrote:> On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro <james.mauro at oracle.com> wrote: >> If you''re chasing CPU utilization, specifically %sys (time in the kernel), >> I would start with a time-based kernel profile. >> >> #dtrace -n ''profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { trunc(@, 20); printa(@0; }'' >> >> I would be curious to see where the CPU cycles are being consumed first, >> before going down the lock path? >> >> This assume that most or all of CPU utilization is %sys. If it''s %usr, we take >> a different approach. >> > > Here is the output, I changed to "tick-5sec" and "trunc(@, 5)". > No.2 and No.3 is what I care about. > > Thanks, > -Aubrey > > 21 80536 :tick-5sec > ========== 1 ====================> genunix`avl_walk+0x6a > genunix`as_gap_aligned+0x2b7 > unix`map_addr_proc+0x179 > unix`map_addr+0x8e > genunix`choose_addr+0x9e > zfs`zfs_map+0x161 > genunix`fop_map+0xc5 > genunix`smmap_common+0x268 > genunix`smmaplf32+0xa2 > genunix`syscall_ap+0x92 > unix`_sys_sysenter_post_swapgs+0x149 > 1427 > > ========= 2 ================> unix`mutex_delay_default+0x7 > unix`mutex_vector_enter+0x2ae > zfs`zfs_zget+0x46 > zfs`zfs_root+0x55 > genunix`fsop_root+0x2d > genunix`traverse+0x65 > genunix`lookuppnvp+0x446 > genunix`lookuppnatcred+0x119 > genunix`lookupnameatcred+0x97 > genunix`lookupnameat+0x6b > genunix`vn_openat+0x147 > genunix`copen+0x493 > genunix`openat64+0x2d > unix`_sys_sysenter_post_swapgs+0x149 > 2645 > > ========== 3 ====================> unix`mutex_delay_default+0x7 > unix`mutex_vector_enter+0x2ae > zfs`zfs_zget+0x46 > zfs`zfs_root+0x55 > genunix`fsop_root+0x2d > genunix`traverse+0x65 > genunix`lookuppnvp+0x446 > genunix`lookuppnatcred+0x119 > genunix`lookupnameatcred+0x97 > genunix`lookupnameat+0x6b > genunix`cstatat_getvp+0x11e > genunix`cstatat64_32+0x5d > genunix`fstatat64_32+0x4c > unix`_sys_sysenter_post_swapgs+0x149 > 3201 > > ======== 4 ==========> unix`i86_mwait+0xd > unix`cpu_idle_mwait+0x154 > unix`idle+0x116 > unix`thread_start+0x8 > 3559 > > ========= 5 =============> tmpfs`tmp_readdir+0x138 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 4589 > > ========= 6 ===============> unix`strlen+0x3 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 5005 > > =========== 7 ================> tmpfs`tmp_readdir+0xc7 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 9548 > > > ========= 8 ======================> unix`strlen+0x8 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 11166 > > > ============= 9 ==================> unix`strlen+0xe > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 14491 > > ============= 10 =================> tmpfs`tmp_readdir+0xbe > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 39540Well, there you go. tmpfs is not zfs. -- richard -- ZFS and performance consulting http://www.RichardElling.com SCALE 10x, Los Angeles, Jan 20-22, 2012
Aubrey Li
2012-Mar-26 05:11 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 11:34 AM, Richard Elling <richard.elling at gmail.com> wrote:> On Mar 25, 2012, at 6:51 PM, Aubrey Li wrote: >> On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro <james.mauro at oracle.com> wrote: >>> If you''re chasing CPU utilization, specifically %sys (time in the kernel), >>> I would start with a time-based kernel profile. >>> >>> #dtrace -n ''profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { trunc(@, 20); printa(@0; }'' >>> >>> I would be curious to see where the CPU cycles are being consumed first, >>> before going down the lock path? >>> >>> This assume that most or all of CPU utilization is %sys. If it''s %usr, we take >>> a different approach. >>> >> >> Here is the output, I changed to "tick-5sec" and "trunc(@, 5)". >> No.2 and No.3 is what I care about. >> >> Thanks, >> -Aubrey >> >> 21 ?80536 ? ? ? ? ? ? ? ? ? ? ? :tick-5sec >> ========== 1 ====================>> ? ? ? ? ? ? ?genunix`avl_walk+0x6a >> ? ? ? ? ? ? ?genunix`as_gap_aligned+0x2b7 >> ? ? ? ? ? ? ?unix`map_addr_proc+0x179 >> ? ? ? ? ? ? ?unix`map_addr+0x8e >> ? ? ? ? ? ? ?genunix`choose_addr+0x9e >> ? ? ? ? ? ? ?zfs`zfs_map+0x161 >> ? ? ? ? ? ? ?genunix`fop_map+0xc5 >> ? ? ? ? ? ? ?genunix`smmap_common+0x268 >> ? ? ? ? ? ? ?genunix`smmaplf32+0xa2 >> ? ? ? ? ? ? ?genunix`syscall_ap+0x92 >> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149 >> ? ? ? ? ? ? 1427 >> >> ========= 2 ================>> ? ? ? ? ? ? ?unix`mutex_delay_default+0x7 >> ? ? ? ? ? ? ?unix`mutex_vector_enter+0x2ae >> ? ? ? ? ? ? ?zfs`zfs_zget+0x46 >> ? ? ? ? ? ? ?zfs`zfs_root+0x55 >> ? ? ? ? ? ? ?genunix`fsop_root+0x2d >> ? ? ? ? ? ? ?genunix`traverse+0x65 >> ? ? ? ? ? ? ?genunix`lookuppnvp+0x446 >> ? ? ? ? ? ? ?genunix`lookuppnatcred+0x119 >> ? ? ? ? ? ? ?genunix`lookupnameatcred+0x97 >> ? ? ? ? ? ? ?genunix`lookupnameat+0x6b >> ? ? ? ? ? ? ?genunix`vn_openat+0x147 >> ? ? ? ? ? ? ?genunix`copen+0x493 >> ? ? ? ? ? ? ?genunix`openat64+0x2d >> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149 >> ? ? ? ? ? ? 2645 >> >> ========== 3 ====================>> ? ? ? ? ? ? ?unix`mutex_delay_default+0x7 >> ? ? ? ? ? ? ?unix`mutex_vector_enter+0x2ae >> ? ? ? ? ? ? ?zfs`zfs_zget+0x46 >> ? ? ? ? ? ? ?zfs`zfs_root+0x55 >> ? ? ? ? ? ? ?genunix`fsop_root+0x2d >> ? ? ? ? ? ? ?genunix`traverse+0x65 >> ? ? ? ? ? ? ?genunix`lookuppnvp+0x446 >> ? ? ? ? ? ? ?genunix`lookuppnatcred+0x119 >> ? ? ? ? ? ? ?genunix`lookupnameatcred+0x97 >> ? ? ? ? ? ? ?genunix`lookupnameat+0x6b >> ? ? ? ? ? ? ?genunix`cstatat_getvp+0x11e >> ? ? ? ? ? ? ?genunix`cstatat64_32+0x5d >> ? ? ? ? ? ? ?genunix`fstatat64_32+0x4c >> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149 >> ? ? ? ? ? ? 3201 >> >> ======== 4 ==========>> ? ? ? ? ? ? ?unix`i86_mwait+0xd >> ? ? ? ? ? ? ?unix`cpu_idle_mwait+0x154 >> ? ? ? ? ? ? ?unix`idle+0x116 >> ? ? ? ? ? ? ?unix`thread_start+0x8 >> ? ? ? ? ? ? 3559 >> >> ========= 5 =============>> ? ? ? ? ? ? ?tmpfs`tmp_readdir+0x138 >> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8 >> ? ? ? ? ? ? ?genunix`getdents64+0xd5 >> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149 >> ? ? ? ? ? ? 4589 >> >> ========= ?6 ===============>> ? ? ? ? ? ? ?unix`strlen+0x3 >> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8 >> ? ? ? ? ? ? ?genunix`getdents64+0xd5 >> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149 >> ? ? ? ? ? ? 5005 >> >> =========== 7 ================>> ? ? ? ? ? ? ?tmpfs`tmp_readdir+0xc7 >> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8 >> ? ? ? ? ? ? ?genunix`getdents64+0xd5 >> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149 >> ? ? ? ? ? ? 9548 >> >> >> ========= 8 ======================>> ? ? ? ? ? ? ?unix`strlen+0x8 >> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8 >> ? ? ? ? ? ? ?genunix`getdents64+0xd5 >> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149 >> ? ? ? ? ? ?11166 >> >> >> ============= 9 ==================>> ? ? ? ? ? ? ?unix`strlen+0xe >> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8 >> ? ? ? ? ? ? ?genunix`getdents64+0xd5 >> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149 >> ? ? ? ? ? ?14491 >> >> ============= ?10 =================>> ? ? ? ? ? ? ?tmpfs`tmp_readdir+0xbe >> ? ? ? ? ? ? ?genunix`fop_readdir+0xe8 >> ? ? ? ? ? ? ?genunix`getdents64+0xd5 >> ? ? ? ? ? ? ?unix`_sys_sysenter_post_swapgs+0x149 >> ? ? ? ? ? ?39540 > > > Well, there you go. tmpfs is not zfs. > ?-- richardLet me explain below: Root documentation path of apache is in zfs, you see it at No.3 at the above dtrace report. tmpfs(/tmp) is the place where PHP place the temporary folders and files. If you go back to see my first thread, you''ll see the zfs root lock contetion as well. Hope we are in the same page now. Thanks, -Aubrey
Richard Elling
2012-Mar-26 05:19 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
Apologies to the ZFSers, this thread really belongs elsewhere. On Mar 25, 2012, at 10:11 PM, Aubrey Li wrote:> On Mon, Mar 26, 2012 at 11:34 AM, Richard Elling > <richard.elling at gmail.com> wrote: >> On Mar 25, 2012, at 6:51 PM, Aubrey Li wrote: >>> On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro <james.mauro at oracle.com> wrote: >>>> If you''re chasing CPU utilization, specifically %sys (time in the kernel), >>>> I would start with a time-based kernel profile. >>>> >>>> #dtrace -n ''profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { trunc(@, 20); printa(@0; }'' >>>> >>>> I would be curious to see where the CPU cycles are being consumed first, >>>> before going down the lock path? >>>> >>>> This assume that most or all of CPU utilization is %sys. If it''s %usr, we take >>>> a different approach. >>>> >>> >>> Here is the output, I changed to "tick-5sec" and "trunc(@, 5)". >>> No.2 and No.3 is what I care about. >>> >>> Thanks, >>> -Aubrey >>> >>> 21 80536 :tick-5sec >>> ========== 1 ====================>>> genunix`avl_walk+0x6a >>> genunix`as_gap_aligned+0x2b7 >>> unix`map_addr_proc+0x179 >>> unix`map_addr+0x8e >>> genunix`choose_addr+0x9e >>> zfs`zfs_map+0x161 >>> genunix`fop_map+0xc5 >>> genunix`smmap_common+0x268 >>> genunix`smmaplf32+0xa2 >>> genunix`syscall_ap+0x92 >>> unix`_sys_sysenter_post_swapgs+0x149 >>> 1427 >>> >>> ========= 2 ================>>> unix`mutex_delay_default+0x7 >>> unix`mutex_vector_enter+0x2ae >>> zfs`zfs_zget+0x46 >>> zfs`zfs_root+0x55 >>> genunix`fsop_root+0x2d >>> genunix`traverse+0x65 >>> genunix`lookuppnvp+0x446 >>> genunix`lookuppnatcred+0x119 >>> genunix`lookupnameatcred+0x97 >>> genunix`lookupnameat+0x6b >>> genunix`vn_openat+0x147 >>> genunix`copen+0x493 >>> genunix`openat64+0x2d >>> unix`_sys_sysenter_post_swapgs+0x149 >>> 2645 >>> >>> ========== 3 ====================>>> unix`mutex_delay_default+0x7 >>> unix`mutex_vector_enter+0x2ae >>> zfs`zfs_zget+0x46 >>> zfs`zfs_root+0x55 >>> genunix`fsop_root+0x2d >>> genunix`traverse+0x65 >>> genunix`lookuppnvp+0x446 >>> genunix`lookuppnatcred+0x119 >>> genunix`lookupnameatcred+0x97 >>> genunix`lookupnameat+0x6b >>> genunix`cstatat_getvp+0x11e >>> genunix`cstatat64_32+0x5d >>> genunix`fstatat64_32+0x4c >>> unix`_sys_sysenter_post_swapgs+0x149 >>> 3201 >>> >>> ======== 4 ==========>>> unix`i86_mwait+0xd >>> unix`cpu_idle_mwait+0x154 >>> unix`idle+0x116 >>> unix`thread_start+0x8 >>> 3559 >>> >>> ========= 5 =============>>> tmpfs`tmp_readdir+0x138 >>> genunix`fop_readdir+0xe8 >>> genunix`getdents64+0xd5 >>> unix`_sys_sysenter_post_swapgs+0x149 >>> 4589 >>> >>> ========= 6 ===============>>> unix`strlen+0x3 >>> genunix`fop_readdir+0xe8 >>> genunix`getdents64+0xd5 >>> unix`_sys_sysenter_post_swapgs+0x149 >>> 5005 >>> >>> =========== 7 ================>>> tmpfs`tmp_readdir+0xc7 >>> genunix`fop_readdir+0xe8 >>> genunix`getdents64+0xd5 >>> unix`_sys_sysenter_post_swapgs+0x149 >>> 9548 >>> >>> >>> ========= 8 ======================>>> unix`strlen+0x8 >>> genunix`fop_readdir+0xe8 >>> genunix`getdents64+0xd5 >>> unix`_sys_sysenter_post_swapgs+0x149 >>> 11166 >>> >>> >>> ============= 9 ==================>>> unix`strlen+0xe >>> genunix`fop_readdir+0xe8 >>> genunix`getdents64+0xd5 >>> unix`_sys_sysenter_post_swapgs+0x149 >>> 14491 >>> >>> ============= 10 =================>>> tmpfs`tmp_readdir+0xbe >>> genunix`fop_readdir+0xe8 >>> genunix`getdents64+0xd5 >>> unix`_sys_sysenter_post_swapgs+0x149 >>> 39540 >> >> >> Well, there you go. tmpfs is not zfs. >> -- richard > > Let me explain below: > > Root documentation path of apache is in zfs, you see > it at No.3 at the above dtrace report.The sort is in reverse order. The large number you see below the stack trace is the number of times that stack was seen. By far the most frequently seen is tmpfs`tmp_readdir> > tmpfs(/tmp) is the place where PHP place the temporary > folders and files.bingo> > If you go back to see my first thread, you''ll see the zfs > root lock contetion as well. > > Hope we are in the same page now. > > Thanks, > -Aubrey-- richard -- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120325/161a9491/attachment-0001.html>
Fajar A. Nugraha
2012-Mar-26 05:47 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 12:19 PM, Richard Elling <richard.elling at richardelling.com> wrote:> Apologies to the ZFSers, this thread really belongs elsewhere.Some of the info in it is informative for other zfs users as well though :)> Here is the output, I changed to "tick-5sec" and "trunc(@, 5)". > > No.2 and No.3 is what I care about.> The sort is in reverse order. The large number you see below the > stack trace is the number of times that stack was seen. By far the > most frequently seen is tmpfs`tmp_readdirSo to summarize, even though there are locks in zfs, they are normal, and insignificant performance-wise (at least in this case). And if performance is a problem, the cause is elsewhere. Is that correct? -- Fajar
Aubrey Li
2012-Mar-26 05:51 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 1:19 PM, Richard Elling <richard.elling at richardelling.com> wrote:> Apologies to the ZFSers, this thread really belongs elsewhere. > > Let me explain below: > > Root documentation path of apache is in zfs, you see > it at No.3 at the above dtrace report. > > > The sort is in reverse order. The large number you see below the > stack trace is the number of times that stack was seen. By far the > most frequently seen is tmpfs`tmp_readdirIt''s true, but I didn''t see any potential issues there.> > > tmpfs(/tmp) is the place where PHP place the temporary > folders and files. > > > bingo >I have to paste the lock investigation here again, Firstly, you can see which lock is spinning: ======================================================# lockstat -D 10 sleep 2 Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec) Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 829064 45% 45% 0.00 33280 0xffffff117624a5d0 rrw_enter_read+0x1b 705001 38% 82% 0.00 30983 0xffffff117624a5d0 rrw_exit+0x1d 140678 8% 90% 0.00 10546 0xffffff117624a6e0 zfs_zget+0x46 37208 2% 92% 0.00 5403 0xffffff114b136840 vn_rele+0x1e 33926 2% 94% 0.00 5417 0xffffff114b136840 lookuppnatcred+0xc5 27188 1% 95% 0.00 1155 vn_vfslocks_buckets+0xd980 vn_vfslocks_getlock+0x3b 11073 1% 96% 0.00 1639 vn_vfslocks_buckets+0x4600 vn_vfslocks_getlock+0x3b 9321 1% 96% 0.00 1961 0xffffff114b82a680 dnlc_lookup+0x83 6929 0% 97% 0.00 1590 0xffffff11573b8f28 zfs_fastaccesschk_execute+0x6a 5746 0% 97% 0.00 5935 0xffffff114b136840 lookuppnvp+0x566 ------------------------------------------------------------------------------- Then if you look at the caller of lock(0xffffff117624a5d0), you''ll see it''s ZFS, not tmpfs. ===========================================================Count indv cuml rcnt nsec Lock Caller 48494 6% 17% 0.00 145263 0xffffff117624a5d0 rrw_enter_read+0x1b nsec ------ Time Distribution ------ count Stack 256 | 17 rrw_enter+0x2c 512 | 1120 zfs_root+0x3b 1024 |@ 1718 fsop_root+0x2d 2048 |@@ 4834 traverse+0x65 4096 |@@@@@@@@@@@ 18569 lookuppnvp+0x446 8192 |@@@@ 6620 lookuppnatcred+0x119 16384 |@ 2929 lookupnameatcred+0x97 32768 |@ 1635 lookupnameat+0x6b 65536 | 894 cstatat_getvp+0x11e 131072 | 1249 cstatat64_32+0x5d 262144 |@ 1620 fstatat64_32+0x4c 524288 |@ 2474 _sys_sysenter_post_swapgs+0x149 That''s why I post this subject here. Hope it''s clear this time. Thanks, -Aubrey
Aubrey Li
2012-Mar-26 10:27 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 4:33 PM, <Casper.Dik at oracle.com> wrote:> >>I''m migrating a webserver(apache+php) from RHEL to solaris. During the >>stress testing comparison, I found under the same session number of client >>request, CPU% is ~70% on RHEL while CPU% is full on solaris. > > Which version of Solaris is this?This is Solaris 11.> >>The apache root documentation is apparently in zfs dataset. It looks like each >>file system operation will run into zfs root mutex contention. >> >>Is this an expected behavior? If so, is there any zfs tunable to >>address this issue? > > The zfs_root() function is the function which is used when a > mountpoint is transversed. ?Where is the apache document root > directory and under which mountpoints?I have an external storage server and I created a zfs pool on it. # zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT export1 800G 277G 523G 34% 1.00x ONLINE - rpool 117G 109G 8.44G 92% 1.00x ONLINE - this pool is mounted under /export, all of the apache documents. are in this pool at /export/webserver/apache2/htdocs. The php temporary folder is set to /tmp, which is tmpfs. Thanks, -Aubrey> > Casper
Jim Klimov
2012-Mar-26 11:28 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
2012-03-26 14:27, Aubrey Li wrote:> The php temporary folder is set to /tmp, which is tmpfs. >By the way, how much RAM does the box have available? "tmpfs" in Solaris is backed by "virtual memory". It is like a RAM disk, although maybe slower than ramdisk FS as seen in livecd, as long as there is enough free RAM, but then tmpfs can get swapped out to disk. If your installation followed the default wizard, your swap (or part of it) could be in rpool/swap and backed by ZFS - leading to both many tmpfs accesses and many ZFS accesses. (No idea about the mutex spinning part though - if it is normal or not). Also I''m not sure if tmpfs gets the benefits of caching, and ZFS ARC cache can consume lots of RAM and thus push tmpfs out to swap. As a random guess, try pointing PHP tmp directory to /var/tmp (backed by zfs) and see if any behaviors change? Good luck, //Jim
Jim Mauro
2012-Mar-26 12:24 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
You care about #2 and #3 because you are fixated on a ZFS root lock contention problem, and not open to a broader discussion about what your real problem actually is. I am not saying there is not lock contention, and I am not saying there is - I''ll look at the data later carefully later when I have more time. Your problem statement, which took 20 emails to glean, is the Solaris system consumes more CPU than Linux on the same hardware, doing roughly the same amount of work, and delivering roughly the same level of performance - is that correct? Please consider that, in Linux, you have no observability into kernel lock statistics (at least, known that I know of) - Linux uses kernel locks also, and for this workload, it seems likely to me that could you observe those statistics, you would see numbers that would lead you to conclude you have lock contention in Linux. Let''s talk about THE PROBLEM - Linux is 15% sys, 55% usr, Solaris is 30% sys, 70% usr, running the same workload, doing the same amount of work. delivering the same level of performance. Please validate that problem statement. On Mar 25, 2012, at 9:51 PM, Aubrey Li wrote:> On Mon, Mar 26, 2012 at 4:18 AM, Jim Mauro <james.mauro at oracle.com> wrote: >> If you''re chasing CPU utilization, specifically %sys (time in the kernel), >> I would start with a time-based kernel profile. >> >> #dtrace -n ''profile-997hz /arg0/ { @[stack()] = count(); } tick-60sec { trunc(@, 20); printa(@0; }'' >> >> I would be curious to see where the CPU cycles are being consumed first, >> before going down the lock path? >> >> This assume that most or all of CPU utilization is %sys. If it''s %usr, we take >> a different approach. >> > > Here is the output, I changed to "tick-5sec" and "trunc(@, 5)". > No.2 and No.3 is what I care about. > > Thanks, > -Aubrey > > 21 80536 :tick-5sec > ========== 1 ====================> genunix`avl_walk+0x6a > genunix`as_gap_aligned+0x2b7 > unix`map_addr_proc+0x179 > unix`map_addr+0x8e > genunix`choose_addr+0x9e > zfs`zfs_map+0x161 > genunix`fop_map+0xc5 > genunix`smmap_common+0x268 > genunix`smmaplf32+0xa2 > genunix`syscall_ap+0x92 > unix`_sys_sysenter_post_swapgs+0x149 > 1427 > > ========= 2 ================> unix`mutex_delay_default+0x7 > unix`mutex_vector_enter+0x2ae > zfs`zfs_zget+0x46 > zfs`zfs_root+0x55 > genunix`fsop_root+0x2d > genunix`traverse+0x65 > genunix`lookuppnvp+0x446 > genunix`lookuppnatcred+0x119 > genunix`lookupnameatcred+0x97 > genunix`lookupnameat+0x6b > genunix`vn_openat+0x147 > genunix`copen+0x493 > genunix`openat64+0x2d > unix`_sys_sysenter_post_swapgs+0x149 > 2645 > > ========== 3 ====================> unix`mutex_delay_default+0x7 > unix`mutex_vector_enter+0x2ae > zfs`zfs_zget+0x46 > zfs`zfs_root+0x55 > genunix`fsop_root+0x2d > genunix`traverse+0x65 > genunix`lookuppnvp+0x446 > genunix`lookuppnatcred+0x119 > genunix`lookupnameatcred+0x97 > genunix`lookupnameat+0x6b > genunix`cstatat_getvp+0x11e > genunix`cstatat64_32+0x5d > genunix`fstatat64_32+0x4c > unix`_sys_sysenter_post_swapgs+0x149 > 3201 > > ======== 4 ==========> unix`i86_mwait+0xd > unix`cpu_idle_mwait+0x154 > unix`idle+0x116 > unix`thread_start+0x8 > 3559 > > ========= 5 =============> tmpfs`tmp_readdir+0x138 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 4589 > > ========= 6 ===============> unix`strlen+0x3 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 5005 > > =========== 7 ================> tmpfs`tmp_readdir+0xc7 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 9548 > > > ========= 8 ======================> unix`strlen+0x8 > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 11166 > > > ============= 9 ==================> unix`strlen+0xe > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 14491 > > ============= 10 =================> tmpfs`tmp_readdir+0xbe > genunix`fop_readdir+0xe8 > genunix`getdents64+0xd5 > unix`_sys_sysenter_post_swapgs+0x149 > 39540
Aubrey Li
2012-Mar-26 13:53 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 7:28 PM, Jim Klimov <jimklimov at cos.ru> wrote:> 2012-03-26 14:27, Aubrey Li wrote: >> >> The php temporary folder is set to /tmp, which is tmpfs. >> > > By the way, how much RAM does the box have available? > "tmpfs" in Solaris is backed by "virtual memory". > It is like a RAM disk, although maybe slower than ramdisk > FS as seen in livecd, as long as there is enough free > RAM, but then tmpfs can get swapped out to disk. > > If your installation followed the default wizard, your > swap (or part of it) could be in rpool/swap and backed > by ZFS - leading to both many tmpfs accesses and many > ZFS accesses. (No idea about the mutex spinning part > though - if it is normal or not). > > Also I''m not sure if tmpfs gets the benefits of caching, > and ZFS ARC cache can consume lots of RAM and thus push > tmpfs out to swap. > > As a random guess, try pointing PHP tmp directory to > /var/tmp (backed by zfs) and see if any behaviors change? > > Good luck, > //Jim >Thanks for your suggestions. Actually the default PHP tmp directory was /var/tmp, and I changed "/var/tmp" to "/tmp". This reduced zfs root lock contention significantly. However, I still see a bunch of lock contention. So I''m here ask for help. Thanks, -Aubrey
Aubrey Li
2012-Mar-26 14:27 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Mon, Mar 26, 2012 at 8:24 PM, Jim Mauro <james.mauro at oracle.com> wrote:> > You care about #2 and #3 because you are fixated on a ZFS root > lock contention problem, and not open to a broader discussion > about what your real problem actually is. I am not saying there is > not lock contention, and I am not saying there is - I''ll look at the > data later carefully later when I have more time. > > Your problem statement, which took 20 emails to glean, is the > Solaris system consumes more CPU than Linux on the same > hardware, doing roughly the same amount of work, and delivering > roughly the same level of performance - is that correct? > > Please consider that, in Linux, you have no observability into > kernel lock statistics (at least, known that I know of) - Linux uses kernel > locks also, and for this workload, it seems likely to me that could > you observe those statistics, you would see numbers that would > lead you to conclude you have lock contention in Linux. > > Let''s talk about THE PROBLEM - Linux is 15% sys, 55% usr, > Solaris is 30% sys, 70% usr, running the same workload, > doing the same amount of work. delivering the same level > of performance. Please validate that problem statement. >You''re definitely right. I''m running the same workload, doing the same amount of work, delivering the same level of performance on the same hardware platform, even the root disk type are exactly the same. So the userland software stack is exactly the same. The defference is: linux is 15% sys, 55% usr. solaris is 30% sys, 70% usr. Basically I agree with Fajar. This is probably not a fair comparison. A robost system not only deliver highest performance, we should consider reliability, availability and serviceability, and energy efficiency and other server related features. No doubt ZFS is the most excellent file system in the planet. As Richard pointed out, if we look at mpstat output =================================================================SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze 0 35140 0 2380 59742 19476 93056 30906 32919 256336 1104 967806 65 35 0 0 32 Including smtx 256336: spins on mutex I need to look at icsw 30906: involuntary context switches migr 32919: thread migration syscl 967806: system calls And given that Solaris consumes 70% CPU on the userland, I probably need to break down how long it is in apache, libphp, libc, etc. (what''s your approach for usr% you mentioned above? BTW) I am not open to a broader discussion because this is zfs mailing list. zfs root lock contention is what I observed so far I can post on this forum. I can take care of other aspects and may ask for help somewhere else. I admit I didn''t dig into linux, I agree there could be lock contention as well. But since solaris consumes more CPU% and hence more system power than linux, I think I have to look at solaris first, to see if there are any tuning works need to be done. Do you agree this is the right way to go ahead? Thanks, -Aubrey
Richard Elling
2012-Mar-26 15:41 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
I see nothing unusual in the lockstat data. I think you''re barking up the wrong tree. -- richard On Mar 25, 2012, at 10:51 PM, Aubrey Li wrote:> On Mon, Mar 26, 2012 at 1:19 PM, Richard Elling > <richard.elling at richardelling.com> wrote: >> Apologies to the ZFSers, this thread really belongs elsewhere. >> >> Let me explain below: >> >> Root documentation path of apache is in zfs, you see >> it at No.3 at the above dtrace report. >> >> >> The sort is in reverse order. The large number you see below the >> stack trace is the number of times that stack was seen. By far the >> most frequently seen is tmpfs`tmp_readdir > > It''s true, but I didn''t see any potential issues there. > >> >> >> tmpfs(/tmp) is the place where PHP place the temporary >> folders and files. >> >> >> bingo >> > > I have to paste the lock investigation here again, > > Firstly, you can see which lock is spinning: > ======================================================> # lockstat -D 10 sleep 2 > > Adaptive mutex spin: 1862701 events in 3.678 seconds (506499 events/sec) > > Count indv cuml rcnt nsec Lock Caller > ------------------------------------------------------------------------------- > 829064 45% 45% 0.00 33280 0xffffff117624a5d0 rrw_enter_read+0x1b > 705001 38% 82% 0.00 30983 0xffffff117624a5d0 rrw_exit+0x1d > 140678 8% 90% 0.00 10546 0xffffff117624a6e0 zfs_zget+0x46 > 37208 2% 92% 0.00 5403 0xffffff114b136840 vn_rele+0x1e > 33926 2% 94% 0.00 5417 0xffffff114b136840 lookuppnatcred+0xc5 > 27188 1% 95% 0.00 1155 vn_vfslocks_buckets+0xd980 > vn_vfslocks_getlock+0x3b > 11073 1% 96% 0.00 1639 vn_vfslocks_buckets+0x4600 > vn_vfslocks_getlock+0x3b > 9321 1% 96% 0.00 1961 0xffffff114b82a680 dnlc_lookup+0x83 > 6929 0% 97% 0.00 1590 0xffffff11573b8f28 > zfs_fastaccesschk_execute+0x6a > 5746 0% 97% 0.00 5935 0xffffff114b136840 lookuppnvp+0x566 > ------------------------------------------------------------------------------- > > Then if you look at the caller of lock(0xffffff117624a5d0), you''ll see > it''s ZFS, not tmpfs. > ===========================================================> Count indv cuml rcnt nsec Lock Caller > 48494 6% 17% 0.00 145263 0xffffff117624a5d0 rrw_enter_read+0x1b > > nsec ------ Time Distribution ------ count Stack > 256 | 17 rrw_enter+0x2c > 512 | 1120 zfs_root+0x3b > 1024 |@ 1718 fsop_root+0x2d > 2048 |@@ 4834 traverse+0x65 > 4096 |@@@@@@@@@@@ 18569 lookuppnvp+0x446 > 8192 |@@@@ 6620 lookuppnatcred+0x119 > 16384 |@ 2929 lookupnameatcred+0x97 > 32768 |@ 1635 lookupnameat+0x6b > 65536 | 894 cstatat_getvp+0x11e > 131072 | 1249 cstatat64_32+0x5d > 262144 |@ 1620 fstatat64_32+0x4c > 524288 |@ 2474 > _sys_sysenter_post_swapgs+0x149 > > > That''s why I post this subject here. Hope it''s clear this time. > > Thanks, > -Aubrey-- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120326/0dec5ec2/attachment-0001.html>
Jim Klimov
2012-Mar-26 17:15 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
> > As a random guess, try pointing PHP tmp directory to > > /var/tmp (backed by zfs) and see if any behaviors change? > > > > Good luck, > > //Jim > > > > Thanks for your suggestions. Actually the default PHP tmp directory > was /var/tmp, and I changed "/var/tmp" to "/tmp". This reduced zfs > root lock contention significantly. However, I still see a bunch > of lock > contention. So I''m here ask for help.Well, as a further attempt down this road, is it possible for you to rule out ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all (swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of same or better yet separate disks? If you do have lots of swapping activity (that can be seen in "vmstat 1" as si/so columns) going on in a zvol, you''re likely to get much fragmentation in the pool, and searching for contiguous stretches of space can become tricky (and time-consuming), or larger writes can get broken down into many smaller random writes and/or "gang blocks", which is also slower. At least such waiting on disks can explain the overall large kernel times. You can also see the disk wait times ratio in "iostat -xzn 1" column "%w" and disk busy times ratio in "%b" (second and third from the right). I dont''t remember you posting that. If these are accounting in tens, or even close or equal to 100%, then your disks are the actual bottleneck. Speeding up that subsystem, including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL SSD/DDRDrive) and combatting fragmentation by moving swap and other scratch spaces to dedicated pools or raw slices might help. HTH, //Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120326/722cea48/attachment.html>
Aubrey Li
2012-Mar-27 03:13 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
On Tue, Mar 27, 2012 at 1:15 AM, Jim Klimov <jim at cos.ru> wrote:> Well, as a further attempt down this road, is it possible for you to rule > out > ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all > (swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of > same or better yet separate disks? >Thanks Jim for your suggestion!> If you do have lots of swapping activity (that can be seen in "vmstat 1" as > si/so columns) going on in a zvol, you''re likely to get much fragmentation > in the pool, and searching for contiguous stretches of space can become > tricky (and time-consuming), or larger writes can get broken down into > many smaller random writes and/or "gang blocks", which is also slower. > At least such waiting on disks can explain the overall large kernel times.I took swapping activity into account, even when the CPU% is 100%, "si" (swap-ins) and "so" (swap-outs) are always ZEROs.> > You can also see the disk wait times ratio?in "iostat -xzn?1" column "%w" > and disk busy times ratio in "%b" (second and third from the right). > I dont''t remember you posting that. > > If these are accounting in tens, or even close or equal to 100%, then > your disks are the actual bottleneck. Speeding up that subsystem, > including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL > SSD/DDRDrive) and combatting fragmentation by moving swap and > other scratch spaces to dedicated pools or raw slices might help.My storage system is not quite busy, and there are only read operations. ====================================# iostat -xnz 3 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 112.4 0.0 1691.5 0.0 0.0 0.5 0.0 4.8 0 41 c11t0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 118.7 0.0 1867.0 0.0 0.0 0.5 0.0 4.5 0 42 c11t0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 127.7 0.0 2121.6 0.0 0.0 0.6 0.0 4.7 0 44 c11t0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 141.3 0.0 2158.5 0.0 0.0 0.7 0.0 4.6 0 48 c11t0d0 ============================================= Thanks, -Aubrey
Phil Harman
2012-Mar-27 07:24 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
One of the glories of Solaris is that it is so very observable. Then there are the many excellent blog posts, wiki entries, and books - some or which are authored by contributors to this very thread - explaining how Solaris works. But these virtues are also a snare to some, and it is not uncommon for even experienced practitioners to become fixated by this or that. Aubrey, a 70:30 user to system ratio is pretty respectable. Running at 100% is not so pretty (e.g. I would be surprised if you DIDN''T see many involuntary context switches (icsw) in such a scenario). Esteemed experts have assured you that ZFS lock contention is not you main concern. I would run with that. You said it was a stress test. That raises many questions for me. I am much more interested in how systems perform in the real world. In my experience, many of the issues we find in production are not the ones we found in the lab. Indeed, I would argue that too many systems are tuned to run simplistic benchmarks instead of real workloads. However, I don''t think it''s helpful to continue this discussion here. There are some esteemed experienced practitioners "out there" who would be only too happy to provide holistic systems performance tuning and health check services to you on a commercial basis (I hear that some may even accept PayPal). Phil (phil at harmanholistix.com)
"Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."
2012-Mar-27 13:42 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
hi you did not answer the question, what is the RAM of the server? how many socket and core etc what is the block size of zfs? what is the cache ram of your san array? what is the block size/strip size of your raid in san array? raid 5 or what? what is your test program and how (from what kind client) regards On 3/26/2012 11:13 PM, Aubrey Li wrote:> On Tue, Mar 27, 2012 at 1:15 AM, Jim Klimov<jim at cos.ru> wrote: >> Well, as a further attempt down this road, is it possible for you to rule >> out >> ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all >> (swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of >> same or better yet separate disks? >> > Thanks Jim for your suggestion! > > >> If you do have lots of swapping activity (that can be seen in "vmstat 1" as >> si/so columns) going on in a zvol, you''re likely to get much fragmentation >> in the pool, and searching for contiguous stretches of space can become >> tricky (and time-consuming), or larger writes can get broken down into >> many smaller random writes and/or "gang blocks", which is also slower. >> At least such waiting on disks can explain the overall large kernel times. > I took swapping activity into account, even when the CPU% is 100%, "si" > (swap-ins) and "so" (swap-outs) are always ZEROs. > >> You can also see the disk wait times ratio in "iostat -xzn 1" column "%w" >> and disk busy times ratio in "%b" (second and third from the right). >> I dont''t remember you posting that. >> >> If these are accounting in tens, or even close or equal to 100%, then >> your disks are the actual bottleneck. Speeding up that subsystem, >> including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL >> SSD/DDRDrive) and combatting fragmentation by moving swap and >> other scratch spaces to dedicated pools or raw slices might help. > My storage system is not quite busy, and there are only read operations. > ====================================> # iostat -xnz 3 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 112.4 0.0 1691.5 0.0 0.0 0.5 0.0 4.8 0 41 c11t0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 118.7 0.0 1867.0 0.0 0.0 0.5 0.0 4.5 0 42 c11t0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 127.7 0.0 2121.6 0.0 0.0 0.6 0.0 4.7 0 44 c11t0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 141.3 0.0 2158.5 0.0 0.0 0.7 0.0 4.6 0 48 c11t0d0 > =============================================> > Thanks, > -Aubrey > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Hung-Sheng Tsao Ph D. Founder& Principal HopBit GridComputing LLC cell: 9734950840 http://laotsao.blogspot.com/ http://laotsao.wordpress.com/ http://blogs.oracle.com/hstsao/ -------------- next part -------------- A non-text attachment was scrubbed... Name: laotsao.vcf Type: text/x-vcard Size: 608 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120327/ddfb1614/attachment.vcf>
Aubrey Li
2012-Mar-28 09:12 UTC
[zfs-discuss] webserver zfs root lock contention under heavy load
It looks like I was posting on the wrong mailing list. I thought this mailing list includes developers. The experiment I did is not for commercial purpose. The purpose of comparison is to find the optimization opportunity of the entire software stack on both linux and solaris. As for this zfs root lock, currently a special re-entrant rwlock is implemented here for ZFS only. The interesting is, all http and https request of my benchmark are performing the read ops on apache server, while we still saw lots of mutex spin events of this rwlock in the lockstat report . I''ll continue to investigate if this is a conflict of design philosophy. At least, this lock does not well behave for this kind of more read, less write case. If anyone has interested in this topic, I can send the update offline. Thanks, -Aubrey On Tue, Mar 27, 2012 at 9:42 PM, "Hung-Sheng Tsao (Lao Tsao ??) Ph.D." <laotsao at gmail.com> wrote:> hi > you did not answer the question, what is the RAM of the server? how many > socket and core ?etc > what is the block size of zfs? > what is the cache ?ram of your ?san array? > what is the block size/strip size ?of your raid in san array? raid 5 or > what? > what is your test program and how (from what kind ?client) > regards > > > > > On 3/26/2012 11:13 PM, Aubrey Li wrote: >> >> On Tue, Mar 27, 2012 at 1:15 AM, Jim Klimov<jim at cos.ru> ?wrote: >>> >>> Well, as a further attempt down this road, is it possible for you to rule >>> out >>> ZFS from swapping - i.e. if RAM amounts permit, disable the swap at all >>> (swap -d /dev/zvol/dsk/rpool/swap) or relocate it to dedicated slices of >>> same or better yet separate disks? >>> >> Thanks Jim for your suggestion! >> >> >>> If you do have lots of swapping activity (that can be seen in "vmstat 1" >>> as >>> si/so columns) going on in a zvol, you''re likely to get much >>> fragmentation >>> in the pool, and searching for contiguous stretches of space can become >>> tricky (and time-consuming), or larger writes can get broken down into >>> many smaller random writes and/or "gang blocks", which is also slower. >>> At least such waiting on disks can explain the overall large kernel >>> times. >> >> I took swapping activity into account, even when the CPU% is 100%, "si" >> (swap-ins) and "so" (swap-outs) are always ZEROs. >> >>> You can also see the disk wait times ratio in "iostat -xzn 1" column "%w" >>> and disk busy times ratio in "%b" (second and third from the right). >>> I dont''t remember you posting that. >>> >>> If these are accounting in tens, or even close or equal to 100%, then >>> your disks are the actual bottleneck. Speeding up that subsystem, >>> including addition of cache (ARC RAM, L2ARC SSD, maybe ZIL >>> SSD/DDRDrive) and combatting fragmentation by moving swap and >>> other scratch spaces to dedicated pools or raw slices might help. >> >> My storage system is not quite busy, and there are only read operations. >> ====================================>> # iostat -xnz 3 >> ? ? ? ? ? ? ? ? ? ? extended device statistics >> ? ? r/s ? ?w/s ? kr/s ? kw/s wait actv wsvc_t asvc_t ?%w ?%b device >> ? 112.4 ? ?0.0 1691.5 ? ?0.0 ?0.0 ?0.5 ? ?0.0 ? ?4.8 ? 0 ?41 c11t0d0 >> ? ? ? ? ? ? ? ? ? ? extended device statistics >> ? ? r/s ? ?w/s ? kr/s ? kw/s wait actv wsvc_t asvc_t ?%w ?%b device >> ? 118.7 ? ?0.0 1867.0 ? ?0.0 ?0.0 ?0.5 ? ?0.0 ? ?4.5 ? 0 ?42 c11t0d0 >> ? ? ? ? ? ? ? ? ? ? extended device statistics >> ? ? r/s ? ?w/s ? kr/s ? kw/s wait actv wsvc_t asvc_t ?%w ?%b device >> ? 127.7 ? ?0.0 2121.6 ? ?0.0 ?0.0 ?0.6 ? ?0.0 ? ?4.7 ? 0 ?44 c11t0d0 >> ? ? ? ? ? ? ? ? ? ? extended device statistics >> ? ? r/s ? ?w/s ? kr/s ? kw/s wait actv wsvc_t asvc_t ?%w ?%b device >> ? 141.3 ? ?0.0 2158.5 ? ?0.0 ?0.0 ?0.7 ? ?0.0 ? ?4.6 ? 0 ?48 c11t0d0 >> =============================================>> >> Thanks, >> -Aubrey >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > -- > Hung-Sheng Tsao Ph D. > Founder& ?Principal > HopBit GridComputing LLC > cell: 9734950840 > > http://laotsao.blogspot.com/ > http://laotsao.wordpress.com/ > http://blogs.oracle.com/hstsao/ >