dnetc is an open-source program from http://www.distributed.net/. It tries a brute-force approach to cracking RC4 puzzles and also computes optimal Golomb rulers. It starts up one process per CPU and runs at nice 20 and is, for all intents and purposes, 100% compute bound. Here is what happens on my system, running 9.0-PRERELEASE, with and without dnetc running, with SCHED_ULE and SCHED-4BSD, when I run the command: time make buildkernel KERNCONF=WONDERLAND (I get similar results on 8.x as well.) SCHED_4BSD, dnetc not running: 1329.715u 123.739s 24:47.95 97.6% 6310+1987k 11233+11098io 419pf+0w SCHED_4BSD, dnetc running: 1329.364u 115.158s 26:14.83 91.7% 6325+1987k 10912+11060io 393pf+0w SCHED_ULE, dnetc not running: 1357.457u 121.526s 25:20.64 97.2% 6326+1990k 11234+11149io 419pf+0w SCHED_ULE, dnetc running: Still going after seven and a half hours of clock time, up to compiling netgraph/bluetooth. (Completed in another five minutes after stopping dnetc so I could write this message in a reasonable amount of time.) Not everybody runs this sort of program, but there are plenty of similar projects out there, and people who try to participate in them will be mightily displeased with their FreeBSD systems when they do. Is there some case where SCHED_ULE exhibits significantly better performance than SCHED_4BSD? If not, I think SCHED-4BSD should remain the default GENERIC configuration until this is fixed. -- George Mitchell
09.12.2011 13:03, George Mitchell wrote:> dnetc is an open-source program from http://www.distributed.net/. It > tries a brute-force approach to cracking RC4 puzzles and also computes > optimal Golomb rulers. It starts up one process per CPU and runs at > nice 20 and is, for all intents and purposes, 100% compute bound.nice 20 doesn't mean it should give time to just any other program. Have you tried setting dnetc_idprio?> Not everybody runs this sort of program, but there are plenty of > similar projects out there, and people who try to participate in > them will be mightily displeased with their FreeBSD systems when > they do. Is there some case where SCHED_ULE exhibits significantly > better performance than SCHED_4BSD? If not, I think SCHED-4BSD > should remain the default GENERIC configuration until this is fixed.Not fully right, boinc defaults to run on idprio 31 so this isn't an issue. And yes, there are cases where SCHED_ULE shows much better performance then SCHED_4BSD. You incidentally found rare misbehavior of SCHED_ULE and I think this would be treated. -- Sphinx of black quartz judge my vow.
2011/12/9 George Mitchell <george+freebsd@m5p.com>:> dnetc is an open-source program from http://www.distributed.net/. ?It > tries a brute-force approach to cracking RC4 puzzles and also computes > optimal Golomb rulers. ?It starts up one process per CPU and runs at > nice 20 and is, for all intents and purposes, 100% compute bound. > > Here is what happens on my system, running 9.0-PRERELEASE, with and > without dnetc running, with SCHED_ULE and SCHED-4BSD, when I run the > command: > > time make buildkernel KERNCONF=WONDERLAND > > (I get similar results on 8.x as well.) > > SCHED_4BSD, dnetc not running: > 1329.715u 123.739s 24:47.95 97.6% ? ? ? 6310+1987k 11233+11098io 419pf+0w > > SCHED_4BSD, dnetc running: > 1329.364u 115.158s 26:14.83 91.7% ? ? ? 6325+1987k 10912+11060io 393pf+0w > > SCHED_ULE, dnetc not running: > 1357.457u 121.526s 25:20.64 97.2% ? ? ? 6326+1990k 11234+11149io 419pf+0w > > SCHED_ULE, dnetc running: > Still going after seven and a half hours of clock time, up to > compiling netgraph/bluetooth. ?(Completed in another five minutes > after stopping dnetc so I could write this message in a reasonable > amount of time.) > > Not everybody runs this sort of program, but there are plenty of > similar projects out there, and people who try to participate in > them will be mightily displeased with their FreeBSD systems when > they do. ?Is there some case where SCHED_ULE exhibits significantly > better performance than SCHED_4BSD? ?If not, I think SCHED-4BSD > should remain the default GENERIC configuration until this is fixed.Hi George, are you interested in exploring more the case with SCHED_ULE and dnetc? More precisely I'd be interested in KTR traces. To be even more precise: With a completely stable GENERIC configuration (or otherwise please post your kernel config) please add the following: options KTR options KTR_ENTRIES=262144 options KTR_COMPILE=(KTR_SCHED) options KTR_MASK=(KTR_SCHED) While you are in the middle of the slow-down (so once it is well established) please do: # sysclt debug.ktr.cpumask="" In the end go with: # ktrdump -ctf > ktr-ule-problem.out and send the file to this mailing list. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein
On Fri, Dec 9, 2011 at 6:03 AM, George Mitchell <george+freebsd@m5p.com> wrote:> dnetc is an open-source program from http://www.distributed.net/. ?It > tries a brute-force approach to cracking RC4 puzzles and also computes > optimal Golomb rulers. ?It starts up one process per CPU and runs at > nice 20 and is, for all intents and purposes, 100% compute bound.Try idprio as well (atm it requires root to use though). nice only means "play nice". idprio means "only run when nothing else wants to run". -- Eitan Adler
2011/12/10 Eitan Adler <lists@eitanadler.com>:> On Fri, Dec 9, 2011 at 8:15 PM, George Mitchell <george@m5p.com> wrote: >> Hope the attached helps. ? ? ? ? ? ? ? ? ? ? ? ? -- George Mitchell > > You attached dmesg, not a patch.This is what is needed for a schedgraph analysis, along with KTR points collection. Attilio -- Peace can only be achieved by understanding - A. Einstein
> Not fully right, boinc defaults to run on idprio 31 so this isn't an > issue. And yes, there are cases where SCHED_ULE shows much better > performance then SCHED_4BSD. [...]Do we have any proof at hand for such cases where SCHED_ULE performs much better than SCHED_4BSD? Whenever the subject comes up, it is mentioned, that SCHED_ULE has better performance on boxes with a ncpu > 2. But in the end I see here contradictionary statements. People complain about poor performance (especially in scientific environments), and other give contra not being the case. Within our department, we developed a highly scalable code for planetary science purposes on imagery. It utilizes present GPUs via OpenCL if present. Otherwise it grabs as many cores as it can. By the end of this year I'll get a new desktop box based on Intels new Sandy Bridge-E architecture with plenty of memory. If the colleague who developed the code is willing performing some benchmarks on the same hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most recent Suse. For FreeBSD I intent also to look for performance with both different schedulers available. O. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 292 bytes Desc: OpenPGP digital signature Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20111212/d861bac0/signature.pgp
On Tue, Dec 13, 2011 at 3:39 PM, Ivan Klymenko <fidaj@ukr.net> wrote:> ? Wed, 14 Dec 2011 00:04:42 +0100 > Jilles Tjoelker <jilles@stack.nl> ?????: > >> On Tue, Dec 13, 2011 at 10:40:48AM +0200, Ivan Klymenko wrote: >> > If the algorithm ULE does not contain problems - it means the >> > problem has Core2Duo, or in a piece of code that uses the ULE >> > scheduler. I already wrote in a mailing list that specifically in >> > my case (Core2Duo) partially helps the following patch: >> > --- sched_ule.c.orig ? ? ? ?2011-11-24 18:11:48.000000000 +0200 >> > +++ sched_ule.c ? ? 2011-12-10 22:47:08.000000000 +0200 >> > @@ -794,7 +794,8 @@ >> > ? ? ?* 1.5 * balance_interval. >> > ? ? ?*/ >> > ? ? balance_ticks = max(balance_interval / 2, 1); >> > - ? balance_ticks += random() % balance_interval; >> > +// balance_ticks += random() % balance_interval; >> > + ? balance_ticks += ((int)random()) % balance_interval; >> > ? ? if (smp_started == 0 || rebalance == 0) >> > ? ? ? ? ? ? return; >> > ? ? tdq = TDQ_SELF(); >> >> This avoids a 64-bit division on 64-bit platforms but seems to have no >> effect otherwise. Because this function is not called very often, the >> change seems unlikely to help. > > Yes, this section does not apply to this problem :) > Just I posted the latest patch which i using now... > >> >> > @@ -2118,13 +2119,21 @@ >> > ? ? struct td_sched *ts; >> > >> > ? ? THREAD_LOCK_ASSERT(td, MA_OWNED); >> > + ? if (td->td_pri_class & PRI_FIFO_BIT) >> > + ? ? ? ? ? return; >> > + ? ts = td->td_sched; >> > + ? /* >> > + ? ?* We used up one time slice. >> > + ? ?*/ >> > + ? if (--ts->ts_slice > 0) >> > + ? ? ? ? ? return; >> >> This skips most of the periodic functionality (long term load >> balancer, saving switch count (?), insert index (?), interactivity >> score update for long running thread) if the thread is not going to >> be rescheduled right now. >> >> It looks wrong but it is a data point if it helps your workload. > > Yes, I did it for as long as possible to delay the execution of the code in section: > ... > #ifdef SMP > ? ? ? ?/* > ? ? ? ? * We run the long term load balancer infrequently on the first cpu. > ? ? ? ? */ > ? ? ? ?if (balance_tdq == tdq) { > ? ? ? ? ? ? ? ?if (balance_ticks && --balance_ticks == 0) > ? ? ? ? ? ? ? ? ? ? ? ?sched_balance(); > ? ? ? ?} > #endif > ... > >> >> > ? ? tdq = TDQ_SELF(); >> > ?#ifdef SMP >> > ? ? /* >> > ? ? ?* We run the long term load balancer infrequently on the >> > first cpu. */ >> > - ? if (balance_tdq == tdq) { >> > - ? ? ? ? ? if (balance_ticks && --balance_ticks == 0) >> > + ? if (balance_ticks && --balance_ticks == 0) { >> > + ? ? ? ? ? if (balance_tdq == tdq) >> > ? ? ? ? ? ? ? ? ? ? sched_balance(); >> > ? ? } >> > ?#endif >> >> The main effect of this appears to be to disable the long term load >> balancer completely after some time. At some point, a CPU other than >> the first CPU (which uses balance_tdq) will set balance_ticks = 0, and >> sched_balance() will never be called again. >> > > That is, for the same reason as above in the text... > >> It also introduces a hypothetical race condition because the access to >> balance_ticks is no longer restricted to one CPU under a spinlock. >> >> If the long term load balancer may be causing trouble, try setting >> kern.sched.balance_interval to a higher value with unpatched code. > > I checked it in the first place - but it did not help fix the situation... > > The impression of malfunction rebalancing... > It seems that the thread is passed on to the same core that is loaded and so... > Perhaps this is a consequence of an incorrect definition of the topology CPU? > >> >> > @@ -2144,9 +2153,6 @@ >> > ? ? ? ? ? ? if >> > (TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx])) >> > tdq->tdq_ridx = tdq->tdq_idx; } >> > - ? ts = td->td_sched; >> > - ? if (td->td_pri_class & PRI_FIFO_BIT) >> > - ? ? ? ? ? return; >> > ? ? if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE) { >> > ? ? ? ? ? ? /* >> > ? ? ? ? ? ? ?* We used a tick; charge it to the thread so >> > @@ -2157,11 +2163,6 @@ >> > ? ? ? ? ? ? sched_priority(td); >> > ? ? } >> > ? ? /* >> > - ? ?* We used up one time slice. >> > - ? ?*/ >> > - ? if (--ts->ts_slice > 0) >> > - ? ? ? ? ? return; >> > - ? /* >> > ? ? ?* We're out of time, force a requeue at userret(). >> > ? ? ?*/ >> > ? ? ts->ts_slice = sched_slice; >> >> > and refusal to use options FULL_PREEMPTION >> > But no one has unsubscribed to my letter, my patch helps or not in >> > the case of Core2Duo... >> > There is a suspicion that the problems stem from the sections of >> > code associated with the SMP... >> > Maybe I'm in something wrong, but I want to help in solving this >> > problem ...Has anyone experiencing problems tried to set sysctl kern.sched.steal_thresh=1 ? I don't remember what our specific problem at $WORK was, perhaps it was just interrupt threads not getting serviced fast enough, but we've hard-coded this to 1 and removed the code that sets it in sched_initticks(). The same effect should be had by setting the sysctl after a box is up. Thanks, matthew
Michael Larabel
2011-Dec-15 14:28 UTC
Benchmark (Phoronix): FreeBSD 9.0-RC2 vs. Oracle Linux 6.1 Server
On 12/15/2011 08:26 AM, Sergey Matveychuk wrote:> 15.12.2011 17:36, Michael Larabel ?????: >> On 12/15/2011 07:25 AM, Stefan Esser wrote: >>> Am 15.12.2011 11:10, schrieb Michael Larabel: >>>> No, the same hardware was used for each OS. >>>> >>>> In terms of the software, the stock software stack for each OS was >>>> used. >>> Just curious: Why did you choose ZFS on FreeBSD, while UFS2 (with >>> journaling enabled) should be an obvious choice since it is more >>> similar >>> in concept to ext4 and since that is what most FreeBSD users will use >>> with FreeBSD? >> >> I was running some ZFS vs. UFS tests as well and this happened to have >> ZFS on when I was running some other tests. >> > > Can we look at the tests? > My opinion is ZFS without tuning is much slower than UFS2. >http://www.phoronix.com/scan.php?page=news_item&px=MTAyNjg
On Thu, Dec 15, 2011 at 10:32 AM, Steven Hartland <killing@multiplay.co.uk> wrote:> Lars Engels wrote: >> >> 9.0 ships with gcc and clang which both need to be compiled, 8.2 only >> has gcc. > > > Ahh, any reason we need both, and is it possible to disable clang?man src.conf add WITHOUT_CLANG=yes to /etc/src.conf -- Eitan Adler
2011/12/9 George Mitchell <george+freebsd@m5p.com>:> dnetc is an open-source program from http://www.distributed.net/. ?It > tries a brute-force approach to cracking RC4 puzzles and also computes > optimal Golomb rulers. ?It starts up one process per CPU and runs at > nice 20 and is, for all intents and purposes, 100% compute bound.[Posting on the first message of the thread] I basically went through all the e-mail you just sent and identified 4 real report on which we could work on and summarizied in the attached Excel file. I'd like that George, Steve, Doug, Andrey and Mike possibly review the few datas there and add more, if they want, or make more important clarifications in particular about the Xorg presence (or rather not) in their workload. I've readed a couple of message in the thread pointing the finger to Xorg to be excessively CPU-intensive and I think they are right, we might try to find a solution for that at some point, but it is really a very edge case. Geroge's and Steve's case, instead, look very different from this and I want to analyze them in detail. George already provided schedgraph traces and for others, if they cannot provide them directly, I'd really appreciate they would at least describe in detail the workload so that I get a chance to reproduce it. If someone else thinks he has a specific problem that is not characterized by one of the cases above please let me know and I will put this in the chart. Thanks for the hard work you guys put in pointing out ULE's problem, I think we will get at the bottom of this if we keep up sharing thoughts and reports. Attilio -- Peace can only be achieved by understanding - A. Einstein
On Mon Dec 19 11, Nathan Whitehorn wrote:> On 12/18/11 04:34, Adrian Chadd wrote: > >The trouble is that there's lots of anecdotal evidence, but noone's > >really gone digging deep into _their_ example of why it's broken. The > >developers who know this stuff don't see anything wrong. That hints to > >me it may be something a little more creepy - as an example, the > >interplay between netisr/swi/taskqueue/callbacks and such. It may be > >that something is being starved that isn't obviously obvious. It's > >just a stab in the dark, but it sounds somewhat plausible based on > >what I've seen ULE do in my network throughput hacking. > > > >I applaud reppie for trying to make it as easy as possible for people > >to use KTR to provide scheduler traces for him to go digging with, so > >please, if you have these issues and you can absolutely reproduce > >them, please follow his instructions and work with him to get him what > >he needs. > > The thing I've seen is that ULE is substantially more enthusiastic about > migrating processes between cores than 4BSD. Often, this is a good > thing, but can increase the rate of cache misses, hurting performance > for cache-bound processes (I see this particularly in HPC-type > scientific workloads). It might be interesting to add some kind of > tunable here.does r228718 have any impact regarding this behaviour? cheers. alex> > Another more interesting and slightly longer-term possibility if someone > wants a project would be to integrate scheduling decisions with hwpmc > counters, to accumulate statistics on cache hits at each context switch > and preferentially keep processes with a high hits/misses ratio on the > same thread/cache domain relative to processes with a low one. > -Nathan > > P.S. The other thing that could be very interesting from a research and > scheduling standpoint would be to integrate heterogeneous SMP support > into the operating system, with a FreeBSD-4 "Application Processor" > syscall model. We seem to be going down the road where GPGPU computing > has MMUs, timer interrupts, IPIs, etc. (the next AMD Fusions, IBM Cell). > This is something that no operating system currently supports well, and > would be a place for BSD to shine. If anyone has a free graduate student...