We have a serious performance problem on our server. Here is some data: <pre>> ::memstatPage Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 1133252 4426 31% Anon 1956988 7644 53% Exec and libs 31104 121 1% Page cache 332818 1300 9% Free (cachelist) 77813 303 2% Free (freelist) 135815 530 4% Total 3667790 14327 Physical 3593201 14035></pre> <pre> sar -u 5 10: 18:06:58 %usr %sys %wio %idle 18:07:03 8 57 0 35 18:07:08 3 22 0 75 18:07:14 3 66 0 31 18:07:19 3 16 0 81 18:07:24 4 52 0 44 18:07:29 3 20 0 77 18:07:34 2 60 0 38 18:07:39 2 39 0 59 18:07:44 2 50 0 48 18:07:49 2 21 0 77 Average 3 40 0 57 </pre> A lot of system time is eating up the CPU. Using vmstat shows: <pre> kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s1 s2 s3 s4 in sy cs us sy id 0 0 0 2593264 373392 182 1013 4 0 0 0 0 0 0 0 23 1483 9112 1862 3 9 88 0 0 0 2647980 425032 246 1168 2 0 0 0 0 0 0 0 23 896 23589 2229 4 10 86 0 0 0 2645524 424328 221 1091 3 0 0 0 0 0 0 0 20 872 8795 1870 3 9 88 0 0 0 2621896 403968 206 969 2 0 0 0 0 0 0 0 23 839 9171 2091 3 9 88 0 0 0 2601288 382732 217 946 2 0 0 0 0 0 0 0 21 679 98075 1783 3 11 87 0 0 0 2580244 362876 239 1161 3 0 0 0 0 0 0 0 55 1649 106163 2221 5 13 82 0 0 0 2651656 420528 225 1181 2 0 0 0 0 0 0 0 16 645 9846 1887 3 9 88 0 0 0 2697620 428048 268 1449 481 0 0 0 0 0 0 0 35 1339 50362 2453 3 11 85 3 0 0 2956632 488440 180 712 37 0 0 0 0 0 0 0 78 907 58331 2310 3 13 84 0 0 0 2643064 382292 339 1884 294 0 0 0 0 0 0 0 45 1893 9649 2282 5 11 84 0 0 0 2784544 422340 224 1192 8 0 0 0 0 0 0 0 88 1041 112430 4572 7 15 79 (this is when system bogs down) 12 0 0 2815300 406156 292 1451 66 0 0 0 0 0 0 0 282 4993 110489 4649 6 24 70 11 0 0 2596252 370944 304 1910 27 0 0 0 0 0 0 0 223 2404 57232 3445 7 48 45 12 0 0 2654676 423784 199 1016 10 0 0 0 0 0 0 0 203 1470 9183 3672 3 48 49 6 0 0 2601900 380100 218 1039 7 0 0 0 0 0 0 0 221 2310 10486 4025 4 41 56 10 0 0 2649432 407956 332 1484 16 0 0 0 0 0 0 0 198 3757 10921 4291 6 40 53 8 0 0 2626320 397504 198 1101 14 0 0 0 0 0 0 0 203 1840 10345 3940 5 40 55 19 0 0 2598156 375780 209 1188 2 0 0 0 0 0 0 0 229 2229 8940 3465 10 48 42 18 0 0 2643936 423656 176 794 9 0 0 0 0 0 0 0 168 1306 8182 3165 8 39 53 7 0 0 2711160 474176 248 675 22 0 0 0 0 0 0 0 84 1147 8616 2461 2 23 74 </pre> <br> Notice the run queue. Is there a DTrace script (from the DTT package) that I can use to figure out what is going on? <br><br> mpstat shows: <pre> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 3511 48 97 784 197 1067 32 239 365 0 5814 5 43 0 52 1 1287 28 43 429 0 901 37 215 314 0 2821 3 40 0 57 2 2954 54 155 1442 1079 1176 26 241 339 0 4927 4 42 0 54 3 1364 20 886 167 16 655 32 184 299 0 3939 4 41 0 55 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 3523 14 46 486 197 1129 50 251 411 0 6895 7 52 0 41 1 1536 8 31 119 0 922 53 220 375 0 4149 4 51 0 45 2 3160 11 76 1251 1177 1058 56 239 403 0 5987 5 57 0 38 3 1592 5 38 102 2 725 50 189 363 0 3929 4 51 0 45 </pre> and when things *appear* to be good: <pre> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 355 0 14 680 202 631 5 146 67 0 2225 2 13 0 85 1 59 0 804 29 0 593 13 173 48 0 1948 2 3 0 95 2 455 0 13 648 363 675 7 179 43 0 4473 3 8 0 89 3 96 0 7 293 2 419 6 165 40 0 2434 2 9 0 89 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 379 0 12 610 202 821 7 174 62 0 1594 4 7 0 89 1 189 0 23 223 0 646 15 182 49 0 1695 3 7 0 90 2 322 0 582 565 535 695 10 169 45 0 2477 12 14 0 75 3 216 0 9 221 2 439 11 168 39 0 1845 12 5 0 83 </pre> (the idle time is much higher) The only thing I see is a high smtx? -- This message posted from opensolaris.org
What does the system do? What apps is it running, etc. Jim ---- Anil wrote:> We have a serious performance problem on our server. Here is some data: > <pre> > >> ::memstat >> > Page Summary Pages MB %Tot > ------------ ---------------- ---------------- ---- > Kernel 1133252 4426 31% > Anon 1956988 7644 53% > Exec and libs 31104 121 1% > Page cache 332818 1300 9% > Free (cachelist) 77813 303 2% > Free (freelist) 135815 530 4% > > Total 3667790 14327 > Physical 3593201 14035 > > </pre> > > <pre> > sar -u 5 10: > 18:06:58 %usr %sys %wio %idle > 18:07:03 8 57 0 35 > 18:07:08 3 22 0 75 > 18:07:14 3 66 0 31 > 18:07:19 3 16 0 81 > 18:07:24 4 52 0 44 > 18:07:29 3 20 0 77 > 18:07:34 2 60 0 38 > 18:07:39 2 39 0 59 > 18:07:44 2 50 0 48 > 18:07:49 2 21 0 77 > > Average 3 40 0 57 > </pre> > > > A lot of system time is eating up the CPU. Using vmstat shows: > > > <pre> > kthr memory page disk faults cpu > r b w swap free re mf pi po fr de sr s1 s2 s3 s4 in sy cs us sy id > 0 0 0 2593264 373392 182 1013 4 0 0 0 0 0 0 0 23 1483 9112 1862 3 9 88 > 0 0 0 2647980 425032 246 1168 2 0 0 0 0 0 0 0 23 896 23589 2229 4 10 86 > 0 0 0 2645524 424328 221 1091 3 0 0 0 0 0 0 0 20 872 8795 1870 3 9 88 > 0 0 0 2621896 403968 206 969 2 0 0 0 0 0 0 0 23 839 9171 2091 3 9 88 > 0 0 0 2601288 382732 217 946 2 0 0 0 0 0 0 0 21 679 98075 1783 3 11 87 > 0 0 0 2580244 362876 239 1161 3 0 0 0 0 0 0 0 55 1649 106163 2221 5 13 82 > 0 0 0 2651656 420528 225 1181 2 0 0 0 0 0 0 0 16 645 9846 1887 3 9 88 > 0 0 0 2697620 428048 268 1449 481 0 0 0 0 0 0 0 35 1339 50362 2453 3 11 85 > 3 0 0 2956632 488440 180 712 37 0 0 0 0 0 0 0 78 907 58331 2310 3 13 84 > 0 0 0 2643064 382292 339 1884 294 0 0 0 0 0 0 0 45 1893 9649 2282 5 11 84 > 0 0 0 2784544 422340 224 1192 8 0 0 0 0 0 0 0 88 1041 112430 4572 7 15 79 > > (this is when system bogs down) > > 12 0 0 2815300 406156 292 1451 66 0 0 0 0 0 0 0 282 4993 110489 4649 6 24 70 > 11 0 0 2596252 370944 304 1910 27 0 0 0 0 0 0 0 223 2404 57232 3445 7 48 45 > 12 0 0 2654676 423784 199 1016 10 0 0 0 0 0 0 0 203 1470 9183 3672 3 48 49 > 6 0 0 2601900 380100 218 1039 7 0 0 0 0 0 0 0 221 2310 10486 4025 4 41 56 > 10 0 0 2649432 407956 332 1484 16 0 0 0 0 0 0 0 198 3757 10921 4291 6 40 53 > 8 0 0 2626320 397504 198 1101 14 0 0 0 0 0 0 0 203 1840 10345 3940 5 40 55 > 19 0 0 2598156 375780 209 1188 2 0 0 0 0 0 0 0 229 2229 8940 3465 10 48 42 > 18 0 0 2643936 423656 176 794 9 0 0 0 0 0 0 0 168 1306 8182 3165 8 39 53 > 7 0 0 2711160 474176 248 675 22 0 0 0 0 0 0 0 84 1147 8616 2461 2 23 74 > </pre> > > <br> > Notice the run queue. Is there a DTrace script (from the DTT package) that I can use to figure out what is going on? > <br><br> > > mpstat shows: > <pre> > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 3511 48 97 784 197 1067 32 239 365 0 5814 5 43 0 52 > 1 1287 28 43 429 0 901 37 215 314 0 2821 3 40 0 57 > 2 2954 54 155 1442 1079 1176 26 241 339 0 4927 4 42 0 54 > 3 1364 20 886 167 16 655 32 184 299 0 3939 4 41 0 55 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 3523 14 46 486 197 1129 50 251 411 0 6895 7 52 0 41 > 1 1536 8 31 119 0 922 53 220 375 0 4149 4 51 0 45 > 2 3160 11 76 1251 1177 1058 56 239 403 0 5987 5 57 0 38 > 3 1592 5 38 102 2 725 50 189 363 0 3929 4 51 0 45 > </pre> > > and when things *appear* to be good: > > <pre> > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 355 0 14 680 202 631 5 146 67 0 2225 2 13 0 85 > 1 59 0 804 29 0 593 13 173 48 0 1948 2 3 0 95 > 2 455 0 13 648 363 675 7 179 43 0 4473 3 8 0 89 > 3 96 0 7 293 2 419 6 165 40 0 2434 2 9 0 89 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 379 0 12 610 202 821 7 174 62 0 1594 4 7 0 89 > 1 189 0 23 223 0 646 15 182 49 0 1695 3 7 0 90 > 2 322 0 582 565 535 695 10 169 45 0 2477 12 14 0 75 > 3 216 0 9 221 2 439 11 168 39 0 1845 12 5 0 83 > </pre> > > (the idle time is much higher) > > The only thing I see is a high smtx? >
System config as well as lockstats please. If networking is involved, netstats would also prove to be of use. Between what Jim has asked for and this we should have enough information to point you in the proper direction for figuring out what is going on. Dave Valin On 07/09/09 01:30, James Litchfield wrote:> What does the system do? What apps is it running, etc. > > Jim > ---- > Anil wrote: >> We have a serious performance problem on our server. Here is some data: >> <pre> >> >>> ::memstat >>> >> Page Summary Pages MB %Tot >> ------------ ---------------- ---------------- ---- >> Kernel 1133252 4426 31% >> Anon 1956988 7644 53% >> Exec and libs 31104 121 1% >> Page cache 332818 1300 9% >> Free (cachelist) 77813 303 2% >> Free (freelist) 135815 530 4% >> >> Total 3667790 14327 >> Physical 3593201 14035 >> </pre> >> >> <pre> >> sar -u 5 10: >> 18:06:58 %usr %sys %wio %idle >> 18:07:03 8 57 0 35 >> 18:07:08 3 22 0 75 >> 18:07:14 3 66 0 31 >> 18:07:19 3 16 0 81 >> 18:07:24 4 52 0 44 >> 18:07:29 3 20 0 77 >> 18:07:34 2 60 0 38 >> 18:07:39 2 39 0 59 >> 18:07:44 2 50 0 48 >> 18:07:49 2 21 0 77 >> >> Average 3 40 0 57 >> </pre> >> >> >> A lot of system time is eating up the CPU. Using vmstat shows: >> >> >> <pre> >> kthr memory page disk >> faults cpu >> r b w swap free re mf pi po fr de sr s1 s2 s3 s4 in sy cs >> us sy id >> 0 0 0 2593264 373392 182 1013 4 0 0 0 0 0 0 0 23 1483 9112 >> 1862 3 9 88 >> 0 0 0 2647980 425032 246 1168 2 0 0 0 0 0 0 0 23 896 23589 >> 2229 4 10 86 >> 0 0 0 2645524 424328 221 1091 3 0 0 0 0 0 0 0 20 872 8795 >> 1870 3 9 88 >> 0 0 0 2621896 403968 206 969 2 0 0 0 0 0 0 0 23 839 9171 >> 2091 3 9 88 >> 0 0 0 2601288 382732 217 946 2 0 0 0 0 0 0 0 21 679 98075 >> 1783 3 11 87 >> 0 0 0 2580244 362876 239 1161 3 0 0 0 0 0 0 0 55 1649 106163 >> 2221 5 13 82 >> 0 0 0 2651656 420528 225 1181 2 0 0 0 0 0 0 0 16 645 9846 >> 1887 3 9 88 >> 0 0 0 2697620 428048 268 1449 481 0 0 0 0 0 0 0 35 1339 50362 >> 2453 3 11 85 >> 3 0 0 2956632 488440 180 712 37 0 0 0 0 0 0 0 78 907 58331 >> 2310 3 13 84 >> 0 0 0 2643064 382292 339 1884 294 0 0 0 0 0 0 0 45 1893 9649 >> 2282 5 11 84 >> 0 0 0 2784544 422340 224 1192 8 0 0 0 0 0 0 0 88 1041 112430 >> 4572 7 15 79 >> >> (this is when system bogs down) >> >> 12 0 0 2815300 406156 292 1451 66 0 0 0 0 0 0 0 282 4993 110489 >> 4649 6 24 70 >> 11 0 0 2596252 370944 304 1910 27 0 0 0 0 0 0 0 223 2404 57232 >> 3445 7 48 45 >> 12 0 0 2654676 423784 199 1016 10 0 0 0 0 0 0 0 203 1470 9183 >> 3672 3 48 49 >> 6 0 0 2601900 380100 218 1039 7 0 0 0 0 0 0 0 221 2310 10486 >> 4025 4 41 56 >> 10 0 0 2649432 407956 332 1484 16 0 0 0 0 0 0 0 198 3757 10921 >> 4291 6 40 53 >> 8 0 0 2626320 397504 198 1101 14 0 0 0 0 0 0 0 203 1840 10345 >> 3940 5 40 55 >> 19 0 0 2598156 375780 209 1188 2 0 0 0 0 0 0 0 229 2229 8940 >> 3465 10 48 42 >> 18 0 0 2643936 423656 176 794 9 0 0 0 0 0 0 0 168 1306 8182 >> 3165 8 39 53 >> 7 0 0 2711160 474176 248 675 22 0 0 0 0 0 0 0 84 1147 8616 >> 2461 2 23 74 >> </pre> >> >> <br> >> Notice the run queue. Is there a DTrace script (from the DTT package) >> that I can use to figure out what is going on? >> <br><br> >> >> mpstat shows: >> <pre> >> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys >> wt idl >> 0 3511 48 97 784 197 1067 32 239 365 0 5814 5 >> 43 0 52 >> 1 1287 28 43 429 0 901 37 215 314 0 2821 3 >> 40 0 57 >> 2 2954 54 155 1442 1079 1176 26 241 339 0 4927 4 >> 42 0 54 >> 3 1364 20 886 167 16 655 32 184 299 0 3939 4 >> 41 0 55 >> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys >> wt idl >> 0 3523 14 46 486 197 1129 50 251 411 0 6895 7 >> 52 0 41 >> 1 1536 8 31 119 0 922 53 220 375 0 4149 4 >> 51 0 45 >> 2 3160 11 76 1251 1177 1058 56 239 403 0 5987 5 >> 57 0 38 >> 3 1592 5 38 102 2 725 50 189 363 0 3929 4 >> 51 0 45 >> </pre> >> >> and when things *appear* to be good: >> >> <pre> >> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys >> wt idl >> 0 355 0 14 680 202 631 5 146 67 0 2225 2 >> 13 0 85 >> 1 59 0 804 29 0 593 13 173 48 0 1948 2 >> 3 0 95 >> 2 455 0 13 648 363 675 7 179 43 0 4473 3 >> 8 0 89 >> 3 96 0 7 293 2 419 6 165 40 0 2434 2 >> 9 0 89 >> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys >> wt idl >> 0 379 0 12 610 202 821 7 174 62 0 1594 4 >> 7 0 89 >> 1 189 0 23 223 0 646 15 182 49 0 1695 3 >> 7 0 90 >> 2 322 0 582 565 535 695 10 169 45 0 2477 12 >> 14 0 75 >> 3 216 0 9 221 2 439 11 168 39 0 1845 12 >> 5 0 83 >> </pre> >> >> (the idle time is much higher) >> >> The only thing I see is a high smtx? >> > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Hi I''m probably not the biggest expert in performance analises but, some things struck me as odd in your outputs. Anil wrote:> We have a serious performance problem on our server. Here is some data: > > sar -u 5 10: > 18:06:58 %usr %sys %wio %idle > 18:07:03 8 57 0 35 > 18:07:08 3 22 0 75 > 18:07:14 3 66 0 31 > 18:07:19 3 16 0 81 > 18:07:24 4 52 0 44 > 18:07:29 3 20 0 77 > 18:07:34 2 60 0 38 > 18:07:39 2 39 0 59 > 18:07:44 2 50 0 48 > 18:07:49 2 21 0 77 > > Average 3 40 0 57 > > > A lot of system time is eating up the CPU. Using vmstat shows: >You have a peak of 31% of CPU idle so, if you hadn''t told you were having performance issues, I would just assume this machine was simply doing I/O but still had space to grow.> > kthr memory page disk faults cpu > r b w swap free re mf pi po fr de sr s1 s2 s3 s4 in sy cs us sy id > 0 0 0 2593264 373392 182 1013 4 0 0 0 0 0 0 0 23 1483 9112 1862 3 9 88 > 0 0 0 2647980 425032 246 1168 2 0 0 0 0 0 0 0 23 896 23589 2229 4 10 86 > 0 0 0 2645524 424328 221 1091 3 0 0 0 0 0 0 0 20 872 8795 1870 3 9 88 ><snip> CPU still has a lot of idle, po column is always zero so, no serious CPU or memory issues are shown here> (this is when system bogs down) > > 12 0 0 2815300 406156 292 1451 66 0 0 0 0 0 0 0 282 4993 110489 4649 6 24 70 > 11 0 0 2596252 370944 304 1910 27 0 0 0 0 0 0 0 223 2404 57232 3445 7 48 45 > 12 0 0 2654676 423784 199 1016 10 0 0 0 0 0 0 0 203 1470 9183 3672 3 48 49 >Runnign queue shoots up, PI (Page in, rate at with Solaris is loading stuff into memory) also shoots up but, PO (page out, rate at with Solaris frees memory pages) stays zero. That is somewhat strange, I would expect PO to move away from zero from time to time. Can you tell us what was the interval you used in vmstat (so we can have an idea of the sample''s size)?> <br> > Notice the run queue. Is there a DTrace script (from the DTT package) that I can use to figure out what is going on? > <br><br> > > mpstat shows: > <pre> > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 3511 48 97 784 197 1067 32 239 365 0 5814 5 43 0 52 > 1 1287 28 43 429 0 901 37 215 314 0 2821 3 40 0 57 > 2 2954 54 155 1442 1079 1176 26 241 339 0 4927 4 42 0 54 > 3 1364 20 886 167 16 655 32 184 299 0 3939 4 41 0 55 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 3523 14 46 486 197 1129 50 251 411 0 6895 7 52 0 41 > 1 1536 8 31 119 0 922 53 220 375 0 4149 4 51 0 45 > 2 3160 11 76 1251 1177 1058 56 239 403 0 5987 5 57 0 38 > 3 1592 5 38 102 2 725 50 189 363 0 3929 4 51 0 45 > </pre> > > and when things *appear* to be good: > > <pre> > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 355 0 14 680 202 631 5 146 67 0 2225 2 13 0 85 > 1 59 0 804 29 0 593 13 173 48 0 1948 2 3 0 95 > 2 455 0 13 648 363 675 7 179 43 0 4473 3 8 0 89 > 3 96 0 7 293 2 419 6 165 40 0 2434 2 9 0 89 > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 379 0 12 610 202 821 7 174 62 0 1594 4 7 0 89 > 1 189 0 23 223 0 646 15 182 49 0 1695 3 7 0 90 > 2 322 0 582 565 535 695 10 169 45 0 2477 12 14 0 75 > 3 216 0 9 221 2 439 11 168 39 0 1845 12 5 0 83 > </pre> > > (the idle time is much higher) > > The only thing I see is a high smtx? >vmstat and mpstat have some differences in the way they measure information. That''s why it''s important to use both. By the looks of it, you have an application running with it''s threading configuration a little too agressive and the CPUs spend a lot of time switching their context. but, it''s pretty hard to actually point out something without: - What is this machine? - What applications is it running? When did the problems started, what happened then, etc, etc. - I hope you already did this but I''ll ask just the same: Is the hardware all checked? Is iostat -E output the same when executed in 24 hours period? Does /var/adm/messages show any errors (Retriable errors on disks for example)?