On 12/03/15 00:38, John Baldwin wrote:>>> It sounds like this issue might be the one fixed in r272566: if the >>> > >KERN_PROC_ALL sysctl is read with an insufficiently large buffer, an >>> > >sbuf error return value could bubble up and be treated as ERESTART, >>> > >resulting in a loop. >>> > > >>> > >This can be confirmed with something like >>> > > >>> > > dtrace -n 'syscall:::entry/pid == $target/{@[probefunc] = count();} tick-3s {exit(0);}' -p <pid of looping proc> >>> > > >>> > >If the output consists solely of __sysctl, this bug is likely the >>> > >culprit. >> > >> >Unfortunately, I accidentally killed fstat this morning before I could do any further debug. >> > >> >I ran truss -p on it yesterday and it was spinning solely on __sysctl. >> > >> >I'll try compiling with debug symbols in case it happens again. I haven't been able to reproduce the >> >problem in a reasonable time frame so it could be days or weeks before we see it happen again. > Tha truss output is consistent with Mark's suggestion, so I would try > his suggested fix of 272566.I patched the 10.1 kernel with r272566 and it appears to have fixed the issue. Is this patch likely to be MFCed back to 10-stable? Our RC script forks off about 200 processes when starting our software, and I wrote a small script to repeatedly stop/start the software, which fairly reliably reproduces the issue about 1 in 10 times. I've been running the script with the patched kernel for an hour now and I haven't seen the issue appear. Thanks for your help. -Nick -- Founder, CTO www.akips.com
On Thu, Mar 12, 2015 at 02:05:32PM +1000, Nick Frampton wrote:> On 12/03/15 00:38, John Baldwin wrote: > >>> It sounds like this issue might be the one fixed in r272566: if the > >>> > >KERN_PROC_ALL sysctl is read with an insufficiently large buffer, an > >>> > >sbuf error return value could bubble up and be treated as ERESTART, > >>> > >resulting in a loop. > >>> > > > >>> > >This can be confirmed with something like > >>> > > > >>> > > dtrace -n 'syscall:::entry/pid == $target/{@[probefunc] = count();} tick-3s {exit(0);}' -p <pid of looping proc> > >>> > > > >>> > >If the output consists solely of __sysctl, this bug is likely the > >>> > >culprit. > >> > > >> >Unfortunately, I accidentally killed fstat this morning before I could do any further debug. > >> > > >> >I ran truss -p on it yesterday and it was spinning solely on __sysctl. > >> > > >> >I'll try compiling with debug symbols in case it happens again. I haven't been able to reproduce the > >> >problem in a reasonable time frame so it could be days or weeks before we see it happen again. > > Tha truss output is consistent with Mark's suggestion, so I would try > > his suggested fix of 272566. > > I patched the 10.1 kernel with r272566 and it appears to have fixed the issue. Is this patch likely > to be MFCed back to 10-stable?I can't see any reason it shouldn't be, and there was an MFC reminder in the commit log entry for that revision. I've cc'ed kib@, who might have a reason.> > Our RC script forks off about 200 processes when starting our software, and I wrote a small script > to repeatedly stop/start the software, which fairly reliably reproduces the issue about 1 in 10 > times. I've been running the script with the patched kernel for an hour now and I haven't seen the > issue appear.
On Wed, Mar 11, 2015 at 9:05 PM, Nick Frampton <nick.frampton at akips.com> wrote:> On 12/03/15 00:38, John Baldwin wrote: >>>> >>>> It sounds like this issue might be the one fixed in r272566: if the >>>> > >KERN_PROC_ALL sysctl is read with an insufficiently large buffer, an >>>> > >sbuf error return value could bubble up and be treated as ERESTART, >>>> > >resulting in a loop. >>>> > > >>>> > >This can be confirmed with something like >>>> > > >>>> > > dtrace -n 'syscall:::entry/pid == $target/{@[probefunc] >>>> > > count();} tick-3s {exit(0);}' -p <pid of looping proc> >>>> > > >>>> > >If the output consists solely of __sysctl, this bug is likely the >>>> > >culprit. >>> >>> > >>> >Unfortunately, I accidentally killed fstat this morning before I could >>> > do any further debug. >>> > >>> >I ran truss -p on it yesterday and it was spinning solely on __sysctl. >>> > >>> >I'll try compiling with debug symbols in case it happens again. I >>> > haven't been able to reproduce the >>> >problem in a reasonable time frame so it could be days or weeks before >>> > we see it happen again. >> >> Tha truss output is consistent with Mark's suggestion, so I would try >> his suggested fix of 272566. > > > I patched the 10.1 kernel with r272566 and it appears to have fixed the > issue. Is this patch likely to be MFCed back to 10-stable?A followup to the thread: the change has been merged to stable/10 in r279926.