I''m working with a colleague on a network latency problem he is having
on a T5220 server. Basically, his server reads some data off the
network interface via sockets from a different hardware device, and then
the T5220 server responds with some data written out to the same the
network interface.
There is a timing "requirement", where the reads happen about every 80
milliseconds, and the responding writes need to happen and be complete
in about 240 milliseconds. On other hardware (and even on Solaris 8),
typical responses are in the 20 msec range, and had no latency issues at
all.
What he is seeing is the write response happens OK for a while, then it
slows down where the response (socket write from the T5220) takes longer
than 240 msecs.
He used the socketsnoop.d script from the DTrace Toolkit, and had the
following observations:
> The problem in a nutshell is that there is a message (controls for
> hardware) sent about every 80 msec from the T5220/Solaris v10 to an
> MVME6100/VxWorks v5.5.1 via TCP/IP over the e1000g1 interface. The
> message needs to get to the MVME6100 within 240 msec of being sent or
> the signal processor will run out of commands. The process sending the
> message is running at a real time priority and only one other controller
> process has a higher real time priority. From a dtrace that monitored
> fbt:ip:tcp_output:entry, I see 5 socket writes each time my message is
> sent:
>
>
>
> TIME WALLTIME UID PID DIR SIZE CMD (wrapped
on next line)
>
> 499194.679681840 2008 Oct 20 14:23:35.61 0 21542 W 14600
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.680047812 2008 Oct 20 14:23:35.62 0 21542 W 14600
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.680408073 2008 Oct 20 14:23:35.62 0 21542 W 14600
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.680613575 2008 Oct 20 14:23:35.62 0 21542 W 14600
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.680732307 2008 Oct 20 14:23:35.62 0 21542 W 4160
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched.
>
>
>
> When the problem occurred I saw only 3 write followed by a 2 references to
> sched:
>
>
>
> 499194.761630757 2008 Oct 20 14:23:35.70 0 21542 W 14600
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.761992335 2008 Oct 20 14:23:35.70 0 21542 W 14600
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.762465836 2008 Oct 20 14:23:35.70 0 21542 W 14600
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.763077839 2008 Oct 20 14:23:35.70 0 0 W 14600 sched
> 499194.763134514 2008 Oct 20 14:23:35.70 0 0 W 4160 sched
>
>
>
The 5 socket write()''s above, (four at 14600 bytes, and the fifth @
4160
bytes) from the rc_radar_sched process, are normal, and within the time
window we need. But when it takes longer than 240 msecs, we only see 3
socket write()''s at 14600 bytes from rc_radar_sched, followed by two
socket write()''s from the sched process (the first is 14600 bytes, and
the second 4160). This is when the delay happens for subsequent
write()''s.
What could be going on here? Any more data we should dtrace?
Thanks,
Pat