Hi All, I have what I think is a serious performance issue with fbsd 5.3 release. I've read about threading issues, and it seems to me that that is what I'm looking at, but I'm not confident enough to rule out that it might be a hardware issue, a kernel configuration issue, or something to do with the python port. I'd appreciate it if someone would it point out if I'm overlooking something obvious. Otherwise, if it is the problem I think it is, then there seems entirely too little acknowledgement of a major issue. Here's the background. I just got a new (to me) AMD machine and put 5.3 release on it. I'd been very happy with the way my old Intel machine had been performing with 4.10 stable, and I decided to run a simple performance diagnostic on both machines, to wow myself with the amazing performance of the new hardware / kernel combination. However, the result was pretty disappointing. Here are what I think are the pertinent dmesg details. Old rig: FreeBSD 4.10-RELEASE #0: Thu Jul 1 22:47:08 EDT 2004 Timecounter "i8254" frequency 1193182 Hz Timecounter "TSC" frequency 449235058 Hz CPU: Pentium III/Pentium III Xeon/Celeron (449.24-MHz 686-class CPU) New rig: FreeBSD 5.3-RELEASE #0: Fri Nov 5 04:19:18 UTC 2004 Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: AMD Athlon(tm) Processor (995.77-MHz 686-class CPU) Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 Timecounter "TSC" frequency 995767383 Hz quality 800 Timecounters tick every 10.000 msec The diagnostic I selected was a python program to generate 1 million pseudo-random numbers and then to perform a heap sort on them. That code is included at the foot of this email. I named the file "heapsort.py". I ran it on both machines, using the "time" utility in /usr/bin/ (not the builtin tcsh "time"). So the command line was /usr/bin/time -al -o heapsort.data ./heapsort.py 1000000 A typical result for the old rig was 130.78 real 129.86 user 0.11 sys 22344 maximum resident set size 608 average shared memory size 20528 average unshared data size 128 average unshared stack size 5360 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 2386 involuntary context switches Whereas, the typical result for the new rig looked more like 105.36 real 71.10 user 33.41 sys 23376 maximum resident set size 659 average shared memory size 20796 average unshared data size 127 average unshared stack size 5402 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 10548 involuntary context switches You'll notice that the new rig is indeed a little faster (times in seconds): 105.36 real (new rig) compared with 130.78 real (old rig). However, the new rig spends about 33.41 seconds on system overhead compared with just 0.11 seconds on the old rig. Comparing the rusage stats, the only significant difference is the "involuntary context switches" field, where the old rig has 2386 and the new rig has a whopping 10548. Further, I noticed that the number of context switches on the new rig seems to be more or less exactly one per 10 msec of real time, that is, one per timecounter tick. (I saw this when comparing heapsort.py runs with arguments other than 1000000.) I think the new rig ought to execute this task in about 70 seconds: just over the amount of user time. Assuming that I'm not overlooking something obvious, and that I'm not interpreting a feature as a bug, this business with the context switches strikes me as a bit of a show-stopper. If that's right, it appears to be severely underplayed in the release documentation. I'll be happy if someone would kindly explain to me what's going on here. I'll be even happier to hear of a fix or workaround to remedy the situation. Thanks in advance, -e heapsort.py: #!/usr/local/bin/python -O # $Id: heapsort-python-3.code,v 1.3 2005/04/04 14:56:45 bfulgham Exp $ # # The Great Computer Language Shootout # http://shootout.alioth.debian.org/ # # Updated by Valentino Volonghi for Python 2.4 # Reworked by Kevin Carson to produce correct results and same intent import sys IM = 139968 IA = 3877 IC = 29573 LAST = 42 def gen_random(max) : global LAST LAST = (LAST * IA + IC) % IM return( (max * LAST) / IM ) def heapsort(n, ra) : ir = n l = (n >> 1) + 1 while True : if l > 1 : l -= 1 rra = ra[l] else : rra = ra[ir] ra[ir] = ra[1] ir -= 1 if ir == 1 : ra[1] = rra return i = l j = l << 1 while j <= ir : if (j < ir) and (ra[j] < ra[j + 1]) : j += 1 if rra < ra[j] : ra[i] = ra[j] i = j j += j else : j = ir + 1; ra[i] = rra; def main() : if len(sys.argv) == 2 : N = int(sys.argv[1]) else : N = 1 ary = [None]*(N + 1) for i in xrange(1, N + 1) : ary[i] = gen_random(1.0) heapsort(N, ary) print "%.10f" % ary[N] main()
Ewan Todd wrote:> Hi All, > > I have what I think is a serious performance issue with fbsd 5.3 > release. I've read about threading issues, and it seems to me that > that is what I'm looking at, but I'm not confident enough to rule out > that it might be a hardware issue, a kernel configuration issue, or > something to do with the python port. I'd appreciate it if someone > would it point out if I'm overlooking something obvious. Otherwise, > if it is the problem I think it is, then there seems entirely too > little acknowledgement of a major issue. > > Here's the background. I just got a new (to me) AMD machine and put > 5.3 release on it. I'd been very happy with the way my old Intel > machine had been performing with 4.10 stable, and I decided to run a > simple performance diagnostic on both machines, to wow myself with the > amazing performance of the new hardware / kernel combination. > However, the result was pretty disappointing. > > Here are what I think are the pertinent dmesg details. > > Old rig: > > FreeBSD 4.10-RELEASE #0: Thu Jul 1 22:47:08 EDT 2004 > Timecounter "i8254" frequency 1193182 Hz > Timecounter "TSC" frequency 449235058 Hz > CPU: Pentium III/Pentium III Xeon/Celeron (449.24-MHz 686-class CPU) > > New rig: > > FreeBSD 5.3-RELEASE #0: Fri Nov 5 04:19:18 UTC 2004 > Timecounter "i8254" frequency 1193182 Hz quality 0 > CPU: AMD Athlon(tm) Processor (995.77-MHz 686-class CPU) > Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 > Timecounter "TSC" frequency 995767383 Hz quality 800 > Timecounters tick every 10.000 msec > > The diagnostic I selected was a python program to generate 1 million > pseudo-random numbers and then to perform a heap sort on them. That > code is included at the foot of this email. I named the file > "heapsort.py". I ran it on both machines, using the "time" utility in > /usr/bin/ (not the builtin tcsh "time"). So the command line was > > /usr/bin/time -al -o heapsort.data ./heapsort.py 1000000 > > A typical result for the old rig was > > 130.78 real 129.86 user 0.11 sys > 22344 maximum resident set size > 608 average shared memory size > 20528 average unshared data size > 128 average unshared stack size > 5360 page reclaims > 0 page faults > 0 swaps > 0 block input operations > 0 block output operations > 0 messages sent > 0 messages received > 0 signals received > 0 voluntary context switches > 2386 involuntary context switches > > Whereas, the typical result for the new rig looked more like > > 105.36 real 71.10 user 33.41 sys > 23376 maximum resident set size > 659 average shared memory size > 20796 average unshared data size > 127 average unshared stack size > 5402 page reclaims > 0 page faults > 0 swaps > 0 block input operations > 0 block output operations > 0 messages sent > 0 messages received > 0 signals received > 0 voluntary context switches > 10548 involuntary context switches > > You'll notice that the new rig is indeed a little faster (times in > seconds): 105.36 real (new rig) compared with 130.78 real (old rig). > > However, the new rig spends about 33.41 seconds on system overhead > compared with just 0.11 seconds on the old rig. Comparing the rusage > stats, the only significant difference is the "involuntary context > switches" field, where the old rig has 2386 and the new rig has a > whopping 10548. Further, I noticed that the number of context > switches on the new rig seems to be more or less exactly one per 10 > msec of real time, that is, one per timecounter tick. (I saw this > when comparing heapsort.py runs with arguments other than 1000000.) > > I think the new rig ought to execute this task in about 70 seconds: > just over the amount of user time. Assuming that I'm not overlooking > something obvious, and that I'm not interpreting a feature as a bug, > this business with the context switches strikes me as a bit of a > show-stopper. If that's right, it appears to be severely underplayed > in the release documentation. > > I'll be happy if someone would kindly explain to me what's going on > here. I'll be even happier to hear of a fix or workaround to remedy > the situation. > > Thanks in advance, > > -e > >First of all, make sure that you have WITNESS and INVARIANTS off in your kernel. You might also want to recompile your kernel with the SMP option turned off. Scott
At 11:00 AM 09/05/2005, Ewan Todd wrote:>Here's the background. I just got a new (to me) AMD machine and put >5.3 release on it. I'd been very happy with the way my old IntelThere have been quite a few changes since 5.3. If you are starting fresh, I would strongly recommend going to 5.4 RC4. There have been a number of improvements that might help the problems you are seeing. ---Mike
> Whereas, the typical result for the new rig looked more like > > 105.36 real 71.10 user 33.41 sys...> 10548 involuntary context switchesNow I just ran this test myself. This machine is a 2.4 gig P4 with hyperthreading enabled. Much as I am an AMD fan, I would expect a 1gig Athlon to be significantly slower than a 2.4 gig Pentium 4. but: 93.45 real 56.55 user 36.85 sys 1857 involuntary context switches Uhhh... so it takes almost the same time to do the calculation, but spends actually *more* of it in system space. Does far less context switches though, but I am assuming thats due to HTT. Numbers look very odd to me. So I then ran it on another P4 system we have round here which is still running 4.11. This is a 2.66 gig P4, not a 2.4 so it should be a bit faster, but: 33.77 real 33.49 user 0.07 sys 711 involuntary context switches Over two and a half times faster ?! Thats not right at all! All the new systems I have tried are 5.4-RC4, so should be the latest and greatest. When my colleague finishes on his machine I can try a GENERIC 5.4-RC4 kernel on another P4 and see what that gives. -pcf.
Same test on a 5.3-STABLE from 31.01.2005: 81,90 real 77,05 user 3,51 sys 22908 maximum resident set size 620 average shared memory size 20083 average unshared data size 128 average unshared stack size 5379 page reclaims 26 page faults 0 swaps 36 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 62 voluntary context switches 10623 involuntary context switches This is a on a slow dual-processor system: Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel Pentium III (732.98-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x683 Stepping = 3 Features=0x387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CM OV,PAT,PSE36,PN,MMX,FXSR,SSE> real memory = 2281635840 (2175 MB) avail memory = 2232012800 (2128 MB) ACPI APIC Table: <PTLTD APIC > FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 (if this is of any help). Scheduler is 4BSD. Regards, Holger Kipp -----Original Message----- From: owner-freebsd-stable@freebsd.org on behalf of Pete French Sent: Mon 09.05.2005 18:10 To: ewan@mathcode.net; freebsd-stable@freebsd.org Subject: Re: Performance issue> Whereas, the typical result for the new rig looked more like > > 105.36 real 71.10 user 33.41 sys...> 10548 involuntary context switchesNow I just ran this test myself. This machine is a 2.4 gig P4 with hyperthreading enabled. Much as I am an AMD fan, I would expect a 1gig Athlon to be significantly slower than a 2.4 gig Pentium 4. but: 93.45 real 56.55 user 36.85 sys 1857 involuntary context switches Uhhh... so it takes almost the same time to do the calculation, but spends actually *more* of it in system space. Does far less context switches though, but I am assuming thats due to HTT. Numbers look very odd to me. So I then ran it on another P4 system we have round here which is still running 4.11. This is a 2.66 gig P4, not a 2.4 so it should be a bit faster, but: 33.77 real 33.49 user 0.07 sys 711 involuntary context switches Over two and a half times faster ?! Thats not right at all! All the new systems I have tried are 5.4-RC4, so should be the latest and greatest. When my colleague finishes on his machine I can try a GENERIC 5.4-RC4 kernel on another P4 and see what that gives. -pcf. _______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
> > > >Whereas, the typical result for the new rig looked more like > > > > 105.36 real 71.10 user 33.41 sys > > ... > > 10548 involuntary context switches > > > > > > First of all, make sure that you have WITNESS and INVARIANTS off in your > kernel. You might also want to recompile your kernel with the SMP > option turned off. > > ScottFirst of all, thanks to Mike Tancsa for suggesting 5.4 RC4 and to Pete French for running the test independently on the higher spec machines with 5.4 RC4 on them, confirming the system time thing, ruling out an AMD problem, dissociating the system time result from the context switching, and saving me the trouble of rediscovering the same problem on 5.4 RC4. This is my first foray into the public world of FreeBSD discussion lists, and I am encouraged by the helpfulness of the response. Scott, the 5.3 kernel I had was a essentially a GENERIC release kernel, with about 100 options commented out. WITNESS and INVARIANTS are off by default, which I confirmed by looking through `sysctl -a`. However, I was curious to see what I would get if I switched them on, so I added these options and recompiled the kernel: options KDB options DDB options INVARIANTS options INVARIANT_SUPPORT options WITNESS options WITNESS_SKIPSPIN The result, below, has essentially the same user time (or just less, if that makes any sense), but tripled system time. The context switches are consistent with the one-per-10msec I saw before. Is there anything useful I can do while I still have the kernel debug options on? -e 172.29 real 67.53 user 103.07 sys 23376 maximum resident set size 659 average shared memory size 20805 average unshared data size 127 average unshared stack size 5402 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 17234 involuntary context switches
Ewan Todd wrote:>>>Whereas, the typical result for the new rig looked more like >>> >>> 105.36 real 71.10 user 33.41 sys >>> ... >>> 10548 involuntary context switches >>> >>> >> >>First of all, make sure that you have WITNESS and INVARIANTS off in your >>kernel. You might also want to recompile your kernel with the SMP >>option turned off. >> >>Scott > > > First of all, thanks to Mike Tancsa for suggesting 5.4 RC4 and to Pete > French for running the test independently on the higher spec machines > with 5.4 RC4 on them, confirming the system time thing, ruling out an > AMD problem, dissociating the system time result from the context > switching, and saving me the trouble of rediscovering the same problem > on 5.4 RC4. > > This is my first foray into the public world of FreeBSD discussion > lists, and I am encouraged by the helpfulness of the response. > > Scott, the 5.3 kernel I had was a essentially a GENERIC release > kernel, with about 100 options commented out. WITNESS and INVARIANTS > are off by default, which I confirmed by looking through `sysctl -a`. > However, I was curious to see what I would get if I switched them on, > so I added these options and recompiled the kernel: > > options KDB > options DDB > options INVARIANTS > options INVARIANT_SUPPORT > options WITNESS > options WITNESS_SKIPSPIN > > The result, below, has essentially the same user time (or just less, > if that makes any sense), but tripled system time. The context > switches are consistent with the one-per-10msec I saw before. Is > there anything useful I can do while I still have the kernel debug > options on? > > -e > > > 172.29 real 67.53 user 103.07 sys > 23376 maximum resident set size > 659 average shared memory size > 20805 average unshared data size > 127 average unshared stack size > 5402 page reclaims > 0 page faults > 0 swaps > 0 block input operations > 0 block output operations > 0 messages sent > 0 messages received > 0 signals received > 0 voluntary context switches > 17234 involuntary context switches > >5.3 ships with SMP turned on, which makes lock operations rather expensive on single-processor machines. 4.x does not have SMP turned on by default. Would you be able to re-run your test with SMP turned off? Scott
Scott Long wrote:> First of all, make sure that you have WITNESS and INVARIANTS off in your > kernel. You might also want to recompile your kernel with the SMP > option turned off.I can confirm this. I just rerun it on RELENG_5_4 as of yesterday and got 136.52 real 80.29 user 50.16 sys 23212 maximum resident set size 674 average shared memory size 20961 average unshared data size 128 average unshared stack size 5419 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 25738 involuntary context switches No debugging or SMP in kernel. -- Best regards, Alexander.
On Mon, May 9, 2005 1:06 pm, Scott Long said:> 5.3 ships with SMP turned on, which makes lock operations rather > expensive on single-processor machines. 4.x does not have SMP turned on by > default. Would you be able to re-run your test with SMP turned off?This is what i get on my system, which has debugging and smp off in the kernel. FreeBSD 6.0-CURRENT #0: Tue May 3 23:55:43 EDT 2005 root@fbsd.wettoast.net:/usr/obj/usr/src/sys/DP Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: AMD Athlon(tm) Processor (1410.21-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0x644 Stepping = 4 Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR> AMD Features=0xc0440000<RSVD,AMIE,DSP,3DNow!> --- 76.89 real 49.33 user 22.87 sys 23116 maximum resident set size 686 average shared memory size 20795 average unshared data size 127 average unshared stack size 5380 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 1 voluntary context switches 10018 involuntary context switches --- As we can see, it is still spending a lot of time in system, and there are a lot of context switches being done.
On Mon, 2005-May-09 11:00:18 -0400, Ewan Todd wrote:>I have what I think is a serious performance issue with fbsd 5.3 >release. I've read about threading issues, and it seems to me that >that is what I'm looking at, but I'm not confident enough to rule out >that it might be a hardware issue, a kernel configuration issue, or >something to do with the python port.There does appear to be a problem in FreeBSD. Python is built with threading enabled by default, the threading libraries play with the signal mask and there have been extensive changes there. My suggestions on things you could check are: 1) Rebuild python with threading disabled (add '-DWITHOUT_THREADS' to the 'make' command line and see if that makes any difference 2) Re-write the sample program in a non-threaded language - eg C or perl and see if the high system time goes away. Unfortunately, I can't think of a solution at present. -- Peter Jeremy