On Sun, Jul 6, 2008 at 7:44 PM, John Hanks <griznog at gmail.com>
wrote:> Hello,
>
> I have several systems which I recently updated with
>
> yum -y update
>
> to all the latest packages. These systems use yum-priorities and use
> the CentOS (priority 1) EPEL (priority 5) and rpmforge (priority 10)
> repositories. After the updates, dhcpd stopped working with a SIGPIPE
> error which occurs shortly after it attempts to fork into the
> background. I worked around that problem by building a new server with
> no additional repos, only CentOS and dhcpd works fine on that system.
> Since then I have found the problem, or similar problems with a few
> more applications. Here is what the tail of an strace of pbs_mom as it
> attempts to fork into the background:
>
> listen(5, 512) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6
> setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(6, {sa_family=AF_INET, sin_port=htons(15003),
> sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> listen(6, 512) = 0
> fcntl(4, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
> clone(Process 23938 attached (waiting for parent)
> Process 23938 resumed (parent 23937 ready)
> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x2aaaaad30db0) = 23938
> [pid 23937] exit_group(0) = ?
> getsockname(3, 0x7fff6b7728a0, [128]) = -1 ENOTSOCK (Socket
> operation on non-socket)
> fcntl(3, F_GETFD) = 0
> dup(3) = 7
> fcntl(7, F_SETFD, 0) = 0
> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> close(3) = 0
> fcntl(8, F_GETFD) = 0
> dup2(8, 3) = 3
> fcntl(3, F_SETFD, 0) = 0
> close(8) = 0
> write(3,
"\25\3\1\0\22\334\362\36\233\253\205\2633\323\322q\4\3T\rxK\210",
> 23) = -1 EPIPE (Broken pipe)
> --- SIGPIPE (Broken pipe) @ 0 (0) ---
> Process 23938 detached
>
>
> This is pretty much the same thing that happened to dhcpd. In both
> cases they applications work fine in debug mode when they don't
> attempt to fork, but quietly die when ran normally. A third set of
> apps, wrappers for the client part of torque (pbs_mom) do this:
>
> stat("/usr/local/sbin/pbs_iff", {st_mode=S_IFREG|S_ISUID|0755,
> st_size=21412, ...}) = 0
> pipe([5, 6]) = 0
> clone(Process 24068 attached (waiting for parent)
> Process 24068 resumed (parent 24067 ready)
> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x2aaaaad31ce0) = 24068
> [pid 24067] close(6) = 0
> [pid 24067] fcntl(5, F_GETFL) = 0 (flags O_RDONLY)
> [pid 24067] read(5, <unfinished ...>
> [pid 24068] getsockname(3, {sa_family=AF_INET, sin_port=htons(41855),
> sin_addr=inet_addr("129.123.148.49")}, [1164321820984213520]) = 0
> [pid 24068] getpeername(3, {sa_family=AF_INET, sin_port=htons(636),
> sin_addr=inet_addr("129.123.20.92")}, [68719476752]) = 0
> [pid 24068] fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
> [pid 24068] dup(3) = 7
> [pid 24068] fcntl(7, F_SETFD, FD_CLOEXEC) = 0
> [pid 24068] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
> [pid 24068] close(3) = 0
> [pid 24068] fcntl(8, F_GETFD) = 0
> [pid 24068] dup2(8, 3) = 3
> [pid 24068] fcntl(3, F_SETFD, 0) = 0
> [pid 24068] close(8) = 0
> [pid 24068] write(3,
>
"\25\3\1\0\22\346h\357n\r\17x\374B\312\217\374x\276\311\217\342%", 23)
> = -1 EPIPE (Broken pipe)
> [pid 24068] --- SIGPIPE (Broken pipe) @ 0 (0) ---
> Process 24068 detached
> <... read resumed> "", 4) = 0
> --- SIGCHLD (Child exited) @ 0 (0) ---
> close(5) = 0
> wait4(24068, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}], 0, NULL)
= 24068
> close(4) = 0
> write(2, "No Permission.\n", 15No Permission.
> ) = 15
> write(2, "qstat: cannot connect to server "..., 63qstat: cannot
> connect to server moab.hpc.usu.edu (errno=15007)
> ) = 63
> exit_group(-1) = ?
>
> Once again, the app dies after it attempts to fork into the
> background. There are other things running on these systems that can
> successfully fork and I have been unable to figure out any pattern,
> other than if I don't use additional repos then it doesn't seem to
> break. That may be coincidental though, I haven't repeated it enough
> yet to be certain.
>
> Any hints or suggestions would be appreciated. Unfortunately I noticed
> this after deciding it was "safe" to update *all* my machines and
so
> I'm suffering through a lot of rebuilds/restores because of this.
>
> Thanks,
>
> jbh
>
Just fouund yet another system demonstrating pipe related weirdness.
Here's the tail of an strace where this app (qsub, another part of
Torque) hangs after the SIGPIPE:
write(5, "\3\34\177\25\4\32", 6) = 6
write(5, "WINSIZE 36,137,822,504\0\0R(A\240:\0\0\0"..., 80) = 80
write(1, "qsub: job 7.jobs.hpc.usu.edu rea"..., 36qsub: job
7.jobs.hpc.usu.edu ready
) = 36
rt_sigaction(SIGINT, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGTERM, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGALRM, {SIG_IGN}, NULL, 8) = 0
rt_sigaction(SIGTSTP, {SIG_IGN}, NULL, 8) = 0
clone(Process 3149 attached (waiting for parent)
Process 3149 resumed (parent 3143 ready)
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x2b43da67f770) = 3149
[pid 3149] getsockname(4, <unfinished ...>
[pid 3143] rt_sigaction(SIGCHLD, {0x402c00, [], SA_RESTORER,
0x3aa08301b0}, NULL, 8) = 0
[pid 3143] fcntl(0, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE)
[pid 3143] read(0, <unfinished ...>
[pid 3149] <... getsockname resumed> {sa_family=AF_INET,
sin_port=htons(52700), sin_addr=inet_addr("129.123.148.50")}, [16]) 0
[pid 3149] getpeername(4, {sa_family=AF_INET, sin_port=htons(636),
sin_addr=inet_addr("129.123.20.92")}, [68719476752]) = 0
[pid 3149] fcntl(4, F_GETFD) = 0x1 (flags FD_CLOEXEC)
[pid 3149] dup(4) = 6
[pid 3149] fcntl(6, F_SETFD, FD_CLOEXEC) = 0
[pid 3149] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
[pid 3149] close(4) = 0
[pid 3149] fcntl(7, F_GETFD) = 0
[pid 3149] dup2(7, 4) = 4
[pid 3149] fcntl(4, F_SETFD, 0) = 0
[pid 3149] close(7) = 0
[pid 3149] write(4,
"\25\3\1\0\22%\341U\3202\323i\207\240Z\220iTL\202\'\264\t", 23) =
-1
EPIPE (Broken pipe)
[pid 3149] --- SIGPIPE (Broken pipe) @ 0 (0) ---
Process 3149 detached
<... read resumed> 0x7fffd0682aff, 1) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE}],
WNOHANG|WSTOPPED, NULL) = 3149
kill(3149, SIGTERM) = -1 ESRCH (No such process)
ioctl(0, SNDCTL_TMR_START or TCSETS, {B9600 opost isig icanon echo ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B9600 opost isig icanon echo ...}) = 0
exit_group(0) = ?
jbh