thr3ads.net - freebsd stable - Deadlock in state 'sysctl lock' [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Rink Springer

2007-Feb-20 11:43 UTC

Deadlock in state 'sysctl lock'

Hi people,

At work, one of our SpamAssassin/ClamAV filtering machines just entered
a deadlock state:

FreeBSD/i386 (xxx.qsp.nl) (cuad0)

login: root
load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k

After inspection, I believe the following code in
kern/kern_sysctl.c:userland_sysctl() is the culprit:

        SYSCTL_LOCK();

        do {
                req.oldidx = 0;
                req.newidx = 0;
                error = sysctl_root(0, name, namelen, &req);
        } while (error == EAGAIN);

        if (req.lock == REQ_WIRED && req.validlen > 0)
                vsunlock(req.oldptr, req.validlen);

        SYSCTL_UNLOCK();

Clearly, should sysctl_root() always return EAGAIN, this will cause a
serious deadlock condition. It appears this is possible.

The only plausible reference to sysctl's returning EGAIN seems to be in 
kern/kern_proc.c:sysctl_out_proc(). However, this code returns ESRCH
if the process couldn't have been found in the fast place, and since the
complete handler function will be called by sysctl_root() every
iteration, and thus will do a pfind() and return ESRCH if it failed and
not EAGAIN as it will later on in the code path.

The machine is a 6.0-STABLE SMP machine of 30-Mar-2006. No debugging
options are in the kernel as the machine has quite some load. The only
console messages were a lot of 'calcru' messages.

Any help is very much appreciated. For now, I'd like to propose a change
to kern/kern_sysctl.c:userland_sysctl(), to ensure this will never keep
looping on EAGAIN states (preferably, it should trigger a panic or at
least a KASSERT should such a condition occour). I know this is a
bandaid for a problem we don't really quite understand yet, but this may
ease debugging later on (especially as it will help us understand where
exactly it is going bad)

Any comments? It looks to me this deadlock is quite rare (in fact, I've
never seen it before), but I believe it is serious enough to be
addressed, even with such a bandaid until the real solution is presented
by someone who knows the sysctl internals better than I do.

Thanks,

-- 
Rink P.W. Springer                                - http://rink.nu
"It is such a quiet thing, to fall. But yet a far more terrible thing,
 to admit it."                                    - Darth Traya

Guy Helmer

2007-Feb-22 22:21 UTC

head link

Deadlock in state 'sysctl lock'

Rink Springer wrote:> Hi people,
>
> At work, one of our SpamAssassin/ClamAV filtering machines just entered
> a deadlock state:
>
> FreeBSD/i386 (xxx.qsp.nl) (cuad0)
>
> login: root
> load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
> load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
> load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
> load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
> load: 0.00  cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k
>
> After inspection, I believe the following code in
> kern/kern_sysctl.c:userland_sysctl() is the culprit:
>
>         SYSCTL_LOCK();
>
>         do {
>                 req.oldidx = 0;
>                 req.newidx = 0;
>                 error = sysctl_root(0, name, namelen, &req);
>         } while (error == EAGAIN);
>
>         if (req.lock == REQ_WIRED && req.validlen > 0)
>                 vsunlock(req.oldptr, req.validlen);
>
>         SYSCTL_UNLOCK();
>
> Clearly, should sysctl_root() always return EAGAIN, this will cause a
> serious deadlock condition. It appears this is possible.
>
> The only plausible reference to sysctl's returning EGAIN seems to be in
> kern/kern_proc.c:sysctl_out_proc(). However, this code returns ESRCH
> if the process couldn't have been found in the fast place, and since
the
> complete handler function will be called by sysctl_root() every
> iteration, and thus will do a pfind() and return ESRCH if it failed and
> not EAGAIN as it will later on in the code path.
>
> The machine is a 6.0-STABLE SMP machine of 30-Mar-2006. No debugging
> options are in the kernel as the machine has quite some load. The only
> console messages were a lot of 'calcru' messages.
>
> Any help is very much appreciated. For now, I'd like to propose a
change
> to kern/kern_sysctl.c:userland_sysctl(), to ensure this will never keep
> looping on EAGAIN states (preferably, it should trigger a panic or at
> least a KASSERT should such a condition occour). I know this is a
> bandaid for a problem we don't really quite understand yet, but this
may
> ease debugging later on (especially as it will help us understand where
> exactly it is going bad)
>
> Any comments? It looks to me this deadlock is quite rare (in fact, I've
> never seen it before), but I believe it is serious enough to be
> addressed, even with such a bandaid until the real solution is presented
> by someone who knows the sysctl internals better than I do.
>
>   Interesting.  Twice I have had a 6.2 system stuck where sendmail was 
holding the sysctl lock while another process was holding the proctree 
and/or allproc lock, if I remember correctly.

Guy

freebsd stable - Feb 2007 - Deadlock in state 'sysctl lock'

Deadlock in state 'sysctl lock'

Deadlock in state 'sysctl lock'