Jeremy Guthrie
2012-Oct-17 19:11 UTC
SuSE Linux Enterprise Server OpenSSH 5.1p1 nagle issue?
I have a system in place where it appears that TCP will make a massive change in behavior mid-stream with existing SSH sessions. We noticed the issue first with an application using an SSH forward. However, we were able to rule that out by generating the same TCP characteristics by having a perl script dump text out to a terminal simulating a large data flow from the far end(ssh server) back to us(ssh client). The issue manifests roughly as follows: 1. Generate a bunch of terminal output(500k) 2. Sleep 15 seconds 3. Go back to step 1 After repeating steps 1-3 for some random amount of time(sometimes 3 minutes, sometimes 50+), the SSH server will go from streaming the output back to the client @ 4-4.5 mbps(normal-behavior.png), down to 30-40kbps(bad-behavior.png). Most of the time, SSH stays in this 30-40kbps state for as long as their is data in the TCP queue. ie. during peaks, netstat will show the queue having 90-100k of data waiting to be transmitted. We think that Nagle may be taking effect randomly for some reason. When I 'strace -f ssh user at hostname', I don't see the TCP_NODELAY flag being set so that could certainly be true. I look in the ssh docs and I don't see anything about NoDelay but there use to be something according to O'Reilly docs. When I examine the source code, it looks like setting the TCP_NODELAY is some kind of default. The odd thing is that I have hundreds of boxes running this same release of software and no one else is exhibiting this issue. Does anyone have any ideas? -- *Jeremy Guthrie*
Jeremy Guthrie
2012-Oct-17 19:14 UTC
SuSE Linux Enterprise Server OpenSSH 5.1p1 nagle issue?
Correction, when looking at debug, I see: debug2: channel 0: request shell confirm 1 debug2: fd 3 setting TCP_NODELAY debug2: callback done Nagle isn't our issue. Any other ideas on what might be causing the different behavior out of SSH/TCP? On 10/17/12 2:11 PM, Jeremy Guthrie wrote:> I have a system in place where it appears that TCP will make a massive > change in behavior mid-stream with existing SSH sessions. We noticed > the issue first with an application using an SSH forward. However, we > were able to rule that out by generating the same TCP characteristics > by having a perl script dump text out to a terminal simulating a large > data flow from the far end(ssh server) back to us(ssh client). > > The issue manifests roughly as follows: > 1. Generate a bunch of terminal output(500k) > 2. Sleep 15 seconds > 3. Go back to step 1 > > After repeating steps 1-3 for some random amount of time(sometimes 3 > minutes, sometimes 50+), the SSH server will go from streaming the > output back to the client @ 4-4.5 mbps(normal-behavior.png), down to > 30-40kbps(bad-behavior.png). Most of the time, SSH stays in this > 30-40kbps state for as long as their is data in the TCP queue. ie. > during peaks, netstat will show the queue having 90-100k of data > waiting to be transmitted. > > We think that Nagle may be taking effect randomly for some reason. > When I 'strace -f ssh user at hostname', I don't see the TCP_NODELAY flag > being set so that could certainly be true. I look in the ssh docs and > I don't see anything about NoDelay but there use to be something > according to O'Reilly docs. When I examine the source code, it looks > like setting the TCP_NODELAY is some kind of default. > > The odd thing is that I have hundreds of boxes running this same > release of software and no one else is exhibiting this issue. > > Does anyone have any ideas? > > -- > *Jeremy Guthrie*-- *Jeremy Guthrie* Technical Architect - Orchestration | *CDW* 5520 Research Park | Madison, WI 53711 Phone: 608.298.1061 | Fax: 608.288.3007 | NOC: 608.298.1102 | Toll Free: 866.202.1807
Scott Neugroschl
2012-Oct-17 19:47 UTC
SuSE Linux Enterprise Server OpenSSH 5.1p1 nagle issue?
[redacted]> The odd thing is that I have hundreds of boxes running this same > release of software and no one else is exhibiting this issue. > > Does anyone have any ideas?Flaky NIC? I had two identical boxes and one would crawl on the network. Turned out it was a bad NIC.
Ángel González
2012-Oct-17 19:57 UTC
SuSE Linux Enterprise Server OpenSSH 5.1p1 nagle issue?
I guess you attached two graphics of the behavior. The mailing list stripped them. You can post a link if you wish. Providing the script you used may help other people to reproduce it and, hopefully, find/fix the source. Regards
Jeremy Guthrie
2012-Oct-17 20:20 UTC
SuSE Linux Enterprise Server OpenSSH 5.1p1 nagle issue?
I do not think so. I have two boxes at the site and both boxes exhibit the same issue. I forgot to point out that the problem appears to be on a per-flow basis. One flow can be performing poorly, and another ssh session/flow to/from the same host will be just fine and flying at 4mbps. On 10/17/12 2:47 PM, Scott Neugroschl wrote:> [redacted] >> The odd thing is that I have hundreds of boxes running this same >> release of software and no one else is exhibiting this issue. >> >> Does anyone have any ideas? > Flaky NIC? I had two identical boxes and one would crawl on the > network. > Turned out it was a bad NIC. > > > _______________________________________________ > openssh-unix-dev mailing list > openssh-unix-dev at mindrot.org > https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev > >-- *Jeremy Guthrie* Technical Architect - Orchestration | *CDW* 5520 Research Park | Madison, WI 53711 Phone: 608.298.1061 | Fax: 608.288.3007 | NOC: 608.298.1102 | Toll Free: 866.202.1807
Jeremy Guthrie
2012-Oct-17 20:22 UTC
SuSE Linux Enterprise Server OpenSSH 5.1p1 nagle issue?
I will see what I can do about getting the text from the screen shots out there. As for the test script, it is below: #!/usr/bin/perl use Time::HiRes qw( gettimeofday tv_interval ); open(TIMER,">timer.txt") ; select(TIMER) ; $|=1 ; select(STDERR) ; $|=1 ; select(STDOUT) ; $|=1 ; while (1) { my $t0 = [gettimeofday]; print "f"x1000000; my $get_interval = tv_interval($t0); my $block = scalar(localtime(time())) . " " . $get_interval . "\n" ; print TIMER $block ; print STDERR "\n$block\n" ; sleep 15; } BTW, thanks for the help on this everyone! On 10/17/12 2:57 PM, ?ngel Gonz?lez wrote:> I guess you attached two graphics of the behavior. The mailing list > stripped them. You can post a link if you wish. > > Providing the script you used may help other people to reproduce it and, > hopefully, find/fix the source. > > Regards > > > >-- *Jeremy Guthrie* Technical Architect - Orchestration | *CDW* 5520 Research Park | Madison, WI 53711 Phone: 608.298.1061 | Fax: 608.288.3007 | NOC: 608.298.1102 | Toll Free: 866.202.1807
Jeremy Guthrie
2012-Oct-17 20:28 UTC
SuSE Linux Enterprise Server OpenSSH 5.1p1 nagle issue?
One node: packet lengths > 1500 are TSO taking affect, if I turn off TSO, I see under normal conditions the appropriate size/sequence of packets. Example of the connection performing correctly: 450 1.746243 SSH 2642 Encrypted response packet len=2576 clientend -> headend 451 1.746347 TCP 66 43269 > 22 [ACK] Seq=555192630 Ack=3074656266 Win=16666 Len=0 TSval=1456827748 TSecr=798139682 headend -> clientend 452 1.746369 SSH 3930 Encrypted response packet len=3864 clientend -> headend 453 1.746386 TCP 66 43269 > 22 [ACK] Seq=555192630 Ack=3074657554 Win=16666 Len=0 TSval=1456827748 TSecr=798139682 headend -> clientend 454 1.746396 TCP 66 43269 > 22 [ACK] Seq=555192630 Ack=3074654978 Win=16666 Len=0 TSval=1456827748 TSecr=798139682 headend -> clientend 455 1.791951 TCP 66 43269 > 22 [ACK] Seq=555192630 Ack=3074660130 Win=16666 Len=0 TSval=1456827760 TSecr=798139694 headend -> clientend 456 1.791973 SSH 5218 Encrypted response packet len=5152 clientend -> headend 457 1.792501 TCP 66 43269 > 22 [ACK] Seq=555192630 Ack=3074662706 Win=16666 Len=0 TSval=1456827760 TSecr=798139694 headend -> clientend 458 1.792692 TCP 66 43269 > 22 [ACK] Seq=555192630 Ack=3074665282 Win=16666 Len=0 TSval=1456827760 TSecr=798139694 headend -> clientend 459 1.792718 SSH 5218 Encrypted response packet len=5152 clientend -> headend 460 1.792741 TCP 66 43269 > 22 [ACK] Seq=555192630 Ack=3074667858 Win=16666 Len=0 TSval=1456827760 TSecr=798139694 headend -> clientend 461 1.849577 TCP 66 43269 > 22 [ACK] Seq=555192630 Ack=3074670434 Win=16666 Len=0 TSval=1456827774 TSecr=798139706 headend -> clientend 462 1.849600 SSH 5218 Encrypted response packet len=5152 clientend -> headend 463 1.849620 TCP 66 43269 > 22 [ACK] Seq=555192630 Ack=3074673010 Win=16666 Len=0 TSval=1456827774 TSecr=798139706 headend -> clientend 464 1.849746 TCP 66 43269 > 22 [ACK] Seq=555192630 Ack=3074675586 Win=16666 Len=0 TSval=1456827774 TSecr=798139706 headend -> clientend 465 1.849764 SSH 6506 Encrypted response packet len=6440 clientend -> headend Example of the connection showing a ping-ponging effect using only 1288 byte packets 1288 41.112341 TCP 66 43269 > 22 [ACK] Seq=555193638 Ack=3075980138 Win=16666 Len=0 TSval=1456837590 TSecr=798149524 headend -> clientend 1289 41.386131 SSH 1354 Encrypted response packet len=1288 clientend -> headend 1290 41.432684 TCP 66 43269 > 22 [ACK] Seq=555193638 Ack=3075981426 Win=16666 Len=0 TSval=1456837670 TSecr=798149604 headend -> clientend 1291 41.702157 SSH 1354 Encrypted response packet len=1288 clientend -> headend 1292 41.753245 TCP 66 43269 > 22 [ACK] Seq=555193638 Ack=3075982714 Win=16666 Len=0 TSval=1456837750 TSecr=798149683 headend -> clientend 1293 42.022128 SSH 1354 Encrypted response packet len=1288 clientend -> headend -- *Jeremy Guthrie* Technical Architect - Orchestration | *CDW* 5520 Research Park | Madison, WI 53711 Phone: 608.298.1061 | Fax: 608.288.3007 | NOC: 608.298.1102 | Toll Free: 866.202.1807