hello ssh list,
long time user of openssh, but relatively new to the concept of ssh
multiplexing. i'm experiencing some issues and i haven't figured out how
to
troubleshoot it just yet. would appreciate some help if possible.
i'm using ssh as a communications mechanism to pass text file based
messages between 2 hosts. There are programs on each side that send and
receive these messages. When I found out about ssh multiplexing, i was
excited to use it because we were seeing several hundred ssh connections
going back and forth between the 2 hosts. when i tried ssh multiplexing,
the message latency dropped dramatically by 10 fold! however, now that this
mechanism has been in use for a week, I'm starting to see some problems.
First, this is the .ssh/config contents:
Host *
ControlPath ~/.ssh/cm-%r@%h:%p
ControlMaster auto
ControlPersist 10m
Everything seems to work for a few days, but then ssh starts to hang, and
we start seeing several hundred ssh processes all trying to send their
message but cannot. When i try to run ssh by hand, this is what i get:
$ ssh -vvv boss at ui1
OpenSSH_6.6.1, OpenSSL 1.0.1e-fips 11 Feb 2013
debug1: Reading configuration data /var/lib/worker/.ssh/config
debug1: /var/lib/worker/.ssh/config line 1: Applying options for *
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 56: Applying options for *
debug1: auto-mux: Trying existing master
And it hangs at that point indefinitely until Ctrl-C.
At this point in time, we do see the ssh mux process still running:
$ ps -eo pid,user,args | awk '$2=="worker" &&
$3=="ssh:" && $5=="[mux]"
{print}'
29305 worker ssh: /var/lib/worker/.ssh/cm-boss at ui1:22 [mux]
I tried to attach strace to the ssh mux process, and this is what i see
when the problem is happening:
select(1024, [3 5 9], [], NULL, {0, 11336}) = 0 (Timeout)
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778030739}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778085461}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778109973}) = 0
select(1024, [3 4 5 9], [], NULL, NULL) = 1 (in [4])
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778186890}) = 0
accept(4, 0x7ffe26b34360, [128]) = -1 EMFILE (Too many open files)
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778263743}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778298340}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778343707}) = 0
select(1024, [3 5 9], [], NULL, {1, 0}) = 0 (Timeout)
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778457543}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778518096}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778546349}) = 0
select(1024, [3 4 5 9], [], NULL, NULL) = 1 (in [4])
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778627517}) = 0
accept(4, 0x7ffe26b34360, [128]) = -1 EMFILE (Too many open files)
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778693493}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778725395}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778749417}) = 0
select(1024, [3 5 9], [], NULL, {1, 0}) = 0 (Timeout)
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 778904087}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 778963540}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 778988943}) = 0
select(1024, [3 4 5 9], [], NULL, NULL) = 1 (in [4])
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779072887}) = 0
accept(4, 0x7ffe26b34360, [128]) = -1 EMFILE (Too many open files)
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779158255}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779191597}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779216201}) = 0
select(1024, [3 5 9], [], NULL, {1, 0}) = 0 (Timeout)
clock_gettime(0x7 /* CLOCK_??? */, {17873816, 779334945}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873816, 779393178}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873816, 779418473}) = 0
Does this indicate a open file limit for this user? Or is this something
else? This is ulimit -a for that user:
-bash-4.2$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 2062375
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Any advice on how to troubleshoot this further? Thanks in advance...