We're trying to debug a problem: a server that reboots spontaneously when
this user's large, multithreaded program's running. Sometimes it
won't do
it for hours, other times it's literally every 10 min. I've run iostat,
netstat, have top running, tail -f /var/log/dmesg, *nada*. Nothing out of
the ordinary.
One thing that's constant: as the system's coming back up, we see a segv
of pbs_mom (we're using torque for clustering), and every time it saves
the core dump, then a second or so later,
Jun 26 14:29:58 <servername> abrtd: Package 'torque-mom' isn't
signed with
proper key
Jun 26 14:29:58 <servername> abrtd: Corrupted or bad dump
/var/spool/abrt/ccpp-2012-06-26-14:29:57-3086 (res:2), deleting
I've changed /etc/abrt/abrtd.conf to tell it to save cores from programs
not associated with packages (his program's not), and still; neither the
man page nor googling has found anything.
So, does anyone know either why it thinks it's corrupt, or how I can make
it *stop* deleting it?
       mark