Hello,
Perhaps I found something weird while running 9.2-RC3 FreeBSD
9.2-RC3 #0 r255393 (ZFS-only setup).
Quick history of the problem:
- Lately, using a very recent -STABLE, the host would hang randomly while
building ports with poudriere (-J2) and using X11, without producing a
core dump (solid deadlock, apparently). It works perfectly when using the
console only, and it can run a large build overnight without hanging.
Being on X11 I could not find out what was happening on the console;
desktop PC does not have a proper serial port so there's not much I can
see. In any case it does not reboot automatically.
- To rule out recent -STABLE changes I moved to 9.2-RC3 using SVN, but the
system kept hanging on the same conditions.
- I also enabled DDB to get a minidump, but still I could only get solid
locks.
- I downgraded the nvidia-driver port, just in case it has something to do
with the crashes, but the crashes continued.
- I downgraded to a known-safe -STABLE of July, then June, but the host
would still crash. The very weird thing is that I have been always
building stuff while using X11, and it never hanged. After downgrading
both the OS and nvidia-driver I effectively got back a configuration that
did not hang at the time, but the issue persisted.
- However, this time I managed to get a minidump from the old -STABLE. I
saved it here:
http://olgeni.olgeni.com/~olgeni/core.txt.0
- After seeing the reference to kqueue, I remembered another thing that
changed when the crashes started: gio-fam-backend went away, and glib20
uses kqueue (r324037).
- I tried the same workload while using X11 with openbox only, and it
worked fine.
- Then, I came back to Gnome but made sure that anything related to gvfsd
was periodically killed by a script, and the system returned to normal
(i.e. flawless builds).
- I remember that the gamin implementation uses to open and poll a lot of
files, even files that were not used by the X11 environment or Nautilus
specifically, and the gamin daemon could steal a good 5% of CPU for
polling; restarting it brought it to 0%.
- Not sure if it is related in any way, but running a standard
"buildworld"
does not crash the host. The only difference that I could think of is
that poudriere uses jails.
Unfortunately I'm not able to get a minidump for the latest RC, but at this
point I suspect that something is going on with glib20 and kqueue on both
-STABLE and -RC.
If anybody has any idea I can test it easily, as it usually takes only a
few minutes to hang everything.
--
jimmy