Baptiste Jonglez
2022-Aug-31 13:24 UTC
Race condition when using ControlMaster=auto with simultaneous connections
Hello, I'm trying to multiplex many simultaneous SSH connections through a single master connection, and I'm hitting a race condition while doing this. This is not a bug; I'm either hitting a limit in the design of OpenSSH or misusing it. The use-case is to use Ansible to configure many hosts simultaneously, while all connections need to go through a single "SSH bastion" via ProxyJump. For efficiency and to avoid hitting MaxStartups limits, I would like to use a control master for the connection to the bastion, via the following client configuration: Host bastion.example.com ControlMaster auto ControlPath /dev/shm/ssh-%h ControlPersist 30 Host !bastion.example.com *.example.com ProxyJump bastion.example.com However, this does not work when making simultaneous connections: all SSH connections create a new, separate connection to the bastion. Here is a simple way to reproduce: $ for i in {1..3}; do ssh myhost.example.com "sleep 1" & done ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing What happens is the following: 1) each SSH process tries to connect to the control socket and fails (this is expected, the control socket is not yet bound) 2) each SSH process then creates a new SSH connection 3) once connected, each process tries to bind to the control socket 4a) one process successfully binds the control socket 4b) all other processes fail to bind the control socket (error message above) 5) in both cases, each process is now using its own separate SSH connection to the bastion The window for the race condition is between 1) and 4), so it's rather large: it includes the time to establish a new SSH connection. I believe that taking a lock between steps 1) and 4) could solve the issue: 1.1) each process tries to take an exclusive lock related to the control socket 1.1a) one process gets the lock and can continue creating a SSH connection 1.1b) all other processes wait on the lock; when the lock is released, they go back to step 1) to connect to the control socket 4.1) once the control socket has been bound, the "lucky process" releases the lock Does it make sense? Would the project accept a patch implementing this as an additional option? Thanks, Baptiste -- Baptiste Jonglez Research Engineer, Inria <https://www.inria.fr/> STACK team <https://stack-research-group.gitlabpages.inria.fr/web/>
Demi Marie Obenour
2022-Sep-01 03:37 UTC
Race condition when using ControlMaster=auto with simultaneous connections
On 8/31/22 09:24, Baptiste Jonglez wrote:> Hello, > > I'm trying to multiplex many simultaneous SSH connections through a single > master connection, and I'm hitting a race condition while doing this. > This is not a bug; I'm either hitting a limit in the design of OpenSSH or > misusing it. > > The use-case is to use Ansible to configure many hosts simultaneously, > while all connections need to go through a single "SSH bastion" via ProxyJump. > For efficiency and to avoid hitting MaxStartups limits, I would like to > use a control master for the connection to the bastion, via the following > client configuration: > > Host bastion.example.com > ControlMaster auto > ControlPath /dev/shm/ssh-%h > ControlPersist 30 > > Host !bastion.example.com *.example.com > ProxyJump bastion.example.com > > However, this does not work when making simultaneous connections: all SSH > connections create a new, separate connection to the bastion. Here is a > simple way to reproduce: > > $ for i in {1..3}; do ssh myhost.example.com "sleep 1" & done > ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing > ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing > > What happens is the following: > > 1) each SSH process tries to connect to the control socket and fails > (this is expected, the control socket is not yet bound) > > 2) each SSH process then creates a new SSH connection > > 3) once connected, each process tries to bind to the control socket > > 4a) one process successfully binds the control socket > 4b) all other processes fail to bind the control socket (error message above) > > 5) in both cases, each process is now using its own separate SSH connection to the bastion > > The window for the race condition is between 1) and 4), so it's rather > large: it includes the time to establish a new SSH connection. > > I believe that taking a lock between steps 1) and 4) could solve the issue: > > 1.1) each process tries to take an exclusive lock related to the control socket > 1.1a) one process gets the lock and can continue creating a SSH connection > 1.1b) all other processes wait on the lock; when the lock is released, they > go back to step 1) to connect to the control socket > > 4.1) once the control socket has been bound, the "lucky process" releases the lock > > Does it make sense? Would the project accept a patch implementing this as > an additional option?Not sure if this is related, but I would like to have an option to *only* use the control socket. -- Sincerely, Demi Marie Obenour (she/her/hers) -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_0xB288B55FFF9C22C1.asc Type: application/pgp-keys Size: 4885 bytes Desc: OpenPGP public key URL: <http://lists.mindrot.org/pipermail/openssh-unix-dev/attachments/20220831/b9be508f/attachment.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: <http://lists.mindrot.org/pipermail/openssh-unix-dev/attachments/20220831/b9be508f/attachment.asc>