thr3ads.net - CentOS - [CentOS] another bizarre thing... [Aug 2019]

If this information is useful, please help other people find it:
Share via:

Young, Gregory

2019-Aug-08 17:06 UTC

[CentOS] another bizarre thing...

Is this on both EL6 and EL7? If only EL7, it could be control groups causing the
issue. The idea of cgroups is to prevent zombie processes, but if you need your
program to spawn another process then restart itself while the other process
continues to run, you need to launch it in a different control group, or the
shutdown of the parent process will also kill the child. In my case, we have an
upgrade script which needs to get called, then shut down the calling process in
order to upgrade it. For example:

# Clear any errors in the upgrade control group.
/bin/systemctl reset-failed upgrade-trigger

# Launch the upgrader in its own control group.
/bin/systemd-run --unit=upgrade-trigger --slice=upgrade-trigger /bin/bash
/opt/myapp/Upgrade.sh "$1" "$2"

If we don't do this, the upgrade fails as the upgrader get's terminated
when the parent application is shut down.

Gregory Young 

-----Original Message-----
From: CentOS <centos-bounces at centos.org> On Behalf Of Fred Smith
Sent: August 7, 2019 1:39 PM
To: centos at centos.org
Subject: Re: [CentOS] another bizarre thing...

On Mon, Aug 05, 2019 at 08:57:45PM -0400, Fred Smith
wrote:> Hi all!
> 
> I'm stuck on something really bizarre that is happening to a product I 
> "own" at work. It's a C program, built on CentOS, runs on
CentOs or
> RHEL, has been in circulation since the early 00's, is in use at 
> hundreds of sites.
> 
> recently, at multiple customer sites it has started just going away.
> no core file (yes, ulimit is configured), nothing in any of its
> (several) log files. it's just gone.
> 
> running it under strace until it dies reveals that every thread has 
> been given a SIGKILL.
> 
> How does one figure out who deliverd a SIGKILL? For other, non-fatal, 
> signals it is possible to glean the PID of the sending process in a 
> signal  handler, but obviously you can't do that for SIGKILL because 
> the app doesn't survive the signal.
> 
> I'm grasping at straws here, and am open to almost any kind of 
> suggestion that can be followed-up (as compared to "beats me"
which is
> where I am now).
OK, more information.

Found a recipe to cause systemtap to emit a line of text identifying the sender
of the SIGKILL.

probe signal.send {
  if (sig_name == "SIGKILL")
    printf("%s was sent to %s (pid:%d) by %s uid:%d\n",
           sig_name, pid_name, sig_pid, execname(), uid())

unfortunately, it says the program is killing itself:

	SIGKILL was sent to myprog (pid:12269) by myprog uid:1000

So,... now I'm wondering how one figures that out. nowhere in my source code
does it explicitly raise any signal, much less SIGKILL.
So there must be some underlying library or system call or something doing it.

--
---- Fred Smith -- fredex at fcshome.stoneham.ma.us
-----------------------------
                       I can do all things through Christ 
                              who strengthens me.
------------------------------ Philippians 4:13 -------------------------------
_______________________________________________
CentOS mailing list
CentOS at centos.org
https://lists.centos.org/mailman/listinfo/centos

Fred Smith

2019-Aug-08 23:48 UTC

head link

[CentOS] another bizarre thing...

On Thu, Aug 08, 2019 at 05:06:06PM +0000, Young, Gregory
wrote:> Is this on both EL6 and EL7? If only EL7, it could be control groups
causing the issue. The idea of cgroups is to prevent zombie processes, but if
you need your program to spawn another process then restart itself while the
other process continues to run, you need to launch it in a different control
group, or the shutdown of the parent process will also kill the child. In my
case, we have an upgrade script which needs to get called, then shut down the
calling process in order to upgrade it. For example:
> 
> # Clear any errors in the upgrade control group.
> /bin/systemctl reset-failed upgrade-trigger) 
> 
> # Launch the upgrader in its own control group.
> /bin/systemd-run --unit=upgrade-trigger --slice=upgrade-trigger /bin/bash
/opt/myapp/Upgrade.sh "$1" "$2"
> 
> 
> If we don't do this, the upgrade fails as the upgrader get's
terminated when the parent application is shut down.
> 
well, we aren't INTENTINALLY using control groups. do we get put
into one by the very act of launching a program w hich then creates
threads, and they then all coexist until they're told to stop?

I think it's not the scenario you describe, the main program launches
from an init script, does some sanity checks, loads some config files,
then spawns the number of threads defined by its configuration. then
all the threads, including the main prog, hang around doing stuff until
they're told to stop, which happens all at once for all of them.
On a good day, anyway. what is happening now is they will all run
fine for some time (anhour or twelve) then they all receive a SIGKILL.

Accordiing to a systemtap script I found online, it thinks the program
is killing itself, but as the guy who wrote it, I don't think so.
the script can be seen below in earlier mail.

As for if it also fails on C6, I don't know. I've asked our support
team to see if they have a C6/EL6 customer who will let them install
the latest version for 6 and see what happens, but so far, no joy.

Fred
> Subject: Re: [CentOS] another bizarre thing...
> 
> On Mon, Aug 05, 2019 at 08:57:45PM -0400, Fred Smith wrote:
> > Hi all!
> > 
> > I'm stuck on something really bizarre that is happening to a
product I
> > "own" at work. It's a C program, built on CentOS, runs
on CentOs or
> > RHEL, has been in circulation since the early 00's, is in use at 
> > hundreds of sites.
> > 
> > recently, at multiple customer sites it has started just going away.
> > no core file (yes, ulimit is configured), nothing in any of its
> > (several) log files. it's just gone.
> > 
> > running it under strace until it dies reveals that every thread has 
> > been given a SIGKILL.
> > 
> > How does one figure out who deliverd a SIGKILL? For other, non-fatal, 
> > signals it is possible to glean the PID of the sending process in a 
> > signal  handler, but obviously you can't do that for SIGKILL
because
> > the app doesn't survive the signal.
> > 
> > I'm grasping at straws here, and am open to almost any kind of 
> > suggestion that can be followed-up (as compared to "beats
me" which is
> > where I am now).
> 
> OK, more information.
> 
> Found a recipe to cause systemtap to emit a line of text identifying the
sender of the SIGKILL.
> 
> probe signal.send {
>   if (sig_name == "SIGKILL")
>     printf("%s was sent to %s (pid:%d) by %s uid:%d\n",
>            sig_name, pid_name, sig_pid, execname(), uid())
> 
> unfortunately, it says the program is killing itself:
> 
> 	SIGKILL was sent to myprog (pid:12269) by myprog uid:1000
> 
> So,... now I'm wondering how one figures that out. nowhere in my source
code does it explicitly raise any signal, much less SIGKILL.
> So there must be some underlying library or system call or something doing
it.
> 
> --
> ---- Fred Smith -- fredex at fcshome.stoneham.ma.us
-----------------------------
>                        I can do all things through Christ 
>                               who strengthens me.
> ------------------------------ Philippians 4:13
------------------------------- _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
-- 
---- Fred Smith -- fredex at fcshome.stoneham.ma.us
-----------------------------
  "And he will be called Wonderful Counselor, Mighty God, Everlasting
Father,
  Prince of Peace. Of the increase of his government there will be no end. He 
 will reign on David's throne and over his kingdom, establishing and
upholding
      it with justice and righteousness from that time on and forever."
------------------------------- Isaiah 9:7 (niv) ------------------------------

Young, Gregory

2019-Aug-09 14:24 UTC

head link

[CentOS] another bizarre thing...

Hi Fred,

Yep, that's exactly how control groups work in CentOS 7. You don't need
to define them (normally), they get assigned when the init script or systemd
service launches it. As I mentioned, the idea is to ensure none of those child
threads become zombies if the parent dies/crashes/gets killed. For
troubleshooting, you could try moving the child threads into their own cgroup,
which might help reduce the noise when the parent process gets killed. Of
course, you will have to manually kill the child processes during this testing,
but it might clear enough of the strace logging for you to see where the parent
process is getting killed. Don't forget to undo this debugging step when
done, or you will end up with zombies when you legitimately want to shut down
the process.

Also, if you haven't already, you may want to convert it to use the systemd
".service" file launching. It gives you a lot of control over startup
timeouts, restarts, shutdown commands, process branching, etc. if nothing else,
it might help you identify when the process dies, and restart it without
intervention...

Gregory Young 

-----Original Message-----
From: CentOS <centos-bounces at centos.org> On Behalf Of Fred Smith
Sent: August 8, 2019 7:48 PM
To: centos at centos.org
Subject: Re: [CentOS] another bizarre thing...

On Thu, Aug 08, 2019 at 05:06:06PM +0000, Young, Gregory
wrote:> Is this on both EL6 and EL7? If only EL7, it could be control groups
causing the issue. The idea of cgroups is to prevent zombie processes, but if
you need your program to spawn another process then restart itself while the
other process continues to run, you need to launch it in a different control
group, or the shutdown of the parent process will also kill the child. In my
case, we have an upgrade script which needs to get called, then shut down the
calling process in order to upgrade it. For example:
> 
> # Clear any errors in the upgrade control group.
> /bin/systemctl reset-failed upgrade-trigger)
> 
> # Launch the upgrader in its own control group.
> /bin/systemd-run --unit=upgrade-trigger --slice=upgrade-trigger /bin/bash
/opt/myapp/Upgrade.sh "$1" "$2"
> 
> 
> If we don't do this, the upgrade fails as the upgrader get's
terminated when the parent application is shut down.
> 
well, we aren't INTENTINALLY using control groups. do we get put into one by
the very act of launching a program w hich then creates threads, and they then
all coexist until they're told to stop?

I think it's not the scenario you describe, the main program launches from
an init script, does some sanity checks, loads some config files, then spawns
the number of threads defined by its configuration. then all the threads,
including the main prog, hang around doing stuff until they're told to stop,
which happens all at once for all of them.
On a good day, anyway. what is happening now is they will all run fine for some
time (anhour or twelve) then they all receive a SIGKILL.

Accordiing to a systemtap script I found online, it thinks the program is
killing itself, but as the guy who wrote it, I don't think so.
the script can be seen below in earlier mail.

As for if it also fails on C6, I don't know. I've asked our support team
to see if they have a C6/EL6 customer who will let them install the latest
version for 6 and see what happens, but so far, no joy.

Fred
> Subject: Re: [CentOS] another bizarre thing...
> 
> On Mon, Aug 05, 2019 at 08:57:45PM -0400, Fred Smith wrote:
> > Hi all!
> > 
> > I'm stuck on something really bizarre that is happening to a
product
> > I "own" at work. It's a C program, built on CentOS, runs
on CentOs
> > or RHEL, has been in circulation since the early 00's, is in use
at
> > hundreds of sites.
> > 
> > recently, at multiple customer sites it has started just going away.
> > no core file (yes, ulimit is configured), nothing in any of its
> > (several) log files. it's just gone.
> > 
> > running it under strace until it dies reveals that every thread has 
> > been given a SIGKILL.
> > 
> > How does one figure out who deliverd a SIGKILL? For other, 
> > non-fatal, signals it is possible to glean the PID of the sending 
> > process in a signal  handler, but obviously you can't do that for 
> > SIGKILL because the app doesn't survive the signal.
> > 
> > I'm grasping at straws here, and am open to almost any kind of 
> > suggestion that can be followed-up (as compared to "beats
me" which
> > is where I am now).
> 
> OK, more information.
> 
> Found a recipe to cause systemtap to emit a line of text identifying the
sender of the SIGKILL.
> 
> probe signal.send {
>   if (sig_name == "SIGKILL")
>     printf("%s was sent to %s (pid:%d) by %s uid:%d\n",
>            sig_name, pid_name, sig_pid, execname(), uid())
> 
> unfortunately, it says the program is killing itself:
> 
> 	SIGKILL was sent to myprog (pid:12269) by myprog uid:1000
> 
> So,... now I'm wondering how one figures that out. nowhere in my source
code does it explicitly raise any signal, much less SIGKILL.
> So there must be some underlying library or system call or something doing
it.
> 
> --
> ---- Fred Smith -- fredex at fcshome.stoneham.ma.us
-----------------------------
>                        I can do all things through Christ 
>                               who strengthens me.
> ------------------------------ Philippians 4:13 
> ------------------------------- 
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
--
---- Fred Smith -- fredex at fcshome.stoneham.ma.us
-----------------------------
  "And he will be called Wonderful Counselor, Mighty God, Everlasting
Father,
  Prince of Peace. Of the increase of his government there will be no end. He 
will reign on David's throne and over his kingdom, establishing and
upholding
      it with justice and righteousness from that time on and forever."
------------------------------- Isaiah 9:7 (niv) ------------------------------
_______________________________________________
CentOS mailing list
CentOS at centos.org
https://lists.centos.org/mailman/listinfo/centos

Apparently Analagous Threads

Search for more possibly parallel threads

CentOS - Aug 2019 - another bizarre thing...

[CentOS] another bizarre thing...

[CentOS] another bizarre thing...

[CentOS] another bizarre thing...

Apparently Analagous Threads