Hello,
Recently I''ve received some complaints that there is excessive but
intermittent latency of network traffic to domUs on some of my
servers.
Upon investigation it seems that on some servers, indeed traffic is
occasionally delayed by up to 140ms where something like 5ms RTT
would be expected. The average RTT is not unusual; since this is
only occasional packets it only affects the worst case and standard
deviation.
What seems likely is that these servers are overloaded for CPU. I
have tried the various tweaks of the credit scheduler but the fact
remains that the credit scheduler has a 30ms time slice, so I
believe that when the server is so loaded that domUs are competing
for CPU time, I could expect that a CPU hog gets the CPu for 30ms
before handing over to a non-hog who gets a 30ms penalty on an RTT
measurement.
Clearly the answer is to not overload the servers, and it''s one I
completely agree with. However, I am not sure how best to measure
this. I do not have control over what the domUs do, so their CPU
usage profile can change. I need to monitor this in order to know
when I need to move or restrict a domU. I need to know when there is
actual overloading taking place, without having to measure network
traffic RTT.
For a long while I''ve been measuring CPU usage for the entirety of a
physical piece of hardware by watching the CPU time counters as
displayed by "xm list --long". By feeding that into stats software
like MRTG or Cacti, that gives me the time used per period which in
turn gives me the percentage of CPU used by every domU and the dom0.
Using the above method, one of my possibly overloaded servers shows
about 87% average CPU usage. Up until now, I thought that was
acceptable. I think the problem is that this is based on 5 minute
averages.
Notably the problems most often happen at the top of the hour and on
5 minute intervals, and also at 4am. Sounds like typical cron job
frequencies, right? It''s not caused by cron jobs on the dom0 (there
are almost none, and I disabled them all to verify).
I''m thinking that at the top of the hour and sometimes at 5 minute
intervals there are several domUs competing for CPU for a short
amount of time, and not getting it. This is being averaged away over
the 5 minute span so as to appear reasonable even though it''s
actually causing some problems.
So, how are other people monitoring their CPU usage? Are you doing
more frequent polls such as every minute or even more frequent than
that?
Is there a better way to read a domU''s CPU time counter than parsing
the output of "xm list --long"? Is it available cheaply from
somewhere in /sys or is there an API or anything?
Cheers,
Andy