thr3ads.net - Ovirt devel - [Ovirt-devel] Re: List of performance stats to monitor [Apr 2008]

If this information is useful, please help other people find it:
Share via:
Hugh O. Brock
2008-Apr-22 22:37 UTC
[Ovirt-devel] Re: List of performance stats to monitor

On Fri, Apr 18, 2008 at 04:47:05PM -0400, mark wagner
wrote:>Mark, this looks great, I'm forwarding it out to ovirt-devel with
comments in-line (and a few edits).
>
> Here is a crack at the performance stats to monitor and a
"prioritized"
> list for implementation (which is what Tim was looking for). The assumption
> I'm making is that list is for the beta release and that we are only 
> interested in performance statistics of the host at this point in time.  
> The main goal as I see it is to use the performance stats as a basic health
> check of the data center.  The ability to dig into certain areas is also 
> important, but not the main priority.
>
> In order to come with the list for basic monitoring, it took a bit of 
> thinking differently from the performance tuning tools that we do.   
> Basically we tend to drill into very specific things while a basic health 
> check is a different aspect.
>
> The main things involve number of VMs, CPU, Disk and memory consumption.  
> Networking would be fourth on the list.
>
> The other aspect of this involves the proposed aggregate level monitoring.
> Rolling the stats up to a resource pool level is useful for capacity 
> planning type of activities but doesn't always reflect potential
resource
> issues within the pool.   Sort of analogous to looking at the average CPU 
> consumption on a 16 CPU system and seeing a 6% ave utilization on the box 
> and wondering why a  single threaded app is not performing well.   However,
> after poking around a bit, aggregate model does seem to work in many places
> and seems appropriate for this level of work.
Sure... I think ultimately we're going to want both a ton of different
ways to aggreagate, as *well* as the ability to drill all the way down
to individual VMs. But understood we don't have to have all that in
place for the beta.
> So I've grouped things into three levels of priority. These should
apply
> equally across the aggregate levels unless indicated otherwise. I am also 
> using the data that we appear to be able to get from collectd and 
> potentially libvirt although I have yet to set it up and try it.
>
> Group 1
> ------------
> Load average
>    the 1, 5, and 15 min averages like top provides.    we should consider 
> using a "stacked view" to show individual machines with in as
well (lower
> priority)
>
> Storage Space
>    used and available
>
> Memory
>    In use and available (like top)
>    Allocated and unassigned (  so if you have 16GB on a host and only have 
> 4GB allocated to VM, you'd have 12 unassigned)
>
> VM's
>    not sure if we get this out of libvirt and stored but, number of 
> configured VMs, number of running vms, ( number of zombies ?)
Yes, this is all critical, and I think easily doable with current
libvirt + collectd combo.
> Group 2
> -------------
> CPU utilization
>    display the normal user, nice, sys, idle, wait type of stats for a 
> single host
>
> Network stats
>    Throughput rates, error rollups (note rolling up all errors makes it 
> easier to spot things )
>
> Disk Stats
>    io/sec, bytes / sec, wait times, (Q's ?)
>
Mmmm I am salivating... However I see all of this kind of data as
drill-down (and, without diving too far into implementation, as
generated outside of the WUI and handed over to it in .png form or
something of that nature)
> Group 3
> ----------------
> Load Average
>    we should consider using a "stacked view" to show individual
machines
> with in as well - move up priority list if easy to do
>
> Network
>    specific error info broken out by type - maybe limited to host by host 
> w/no rollup
>
>
> That is the high level break down of what I think we should try.  Most of 
> these stats use plugins to collectd. We can also potentially write our own 
> if needed.  For instance, when doing DB tuning one of the things I use is 
> iostat to get data on the disk queues and latencies.
>
> I'll dig into more as well over the weekend.  There were several sites
I
> saw that use collectd and have some stuff on the web for instance 
> http://csg.sph.umich.edu/docs/cluster/stats/ (click on the graphs)
This looks great. We'll need to coordinate between you, Jay, Ian, and
Tim on the Group 1 bits to figure out how we're going to get them onto
the screen in the UI. Group 2 bits I'm happy pulling pngs out with
rrdtool for the beta, I think -- does anyone else have thoughts on
this?

Thanks, and let me know,
--Hugh
Ovirt devel - Apr 2008 - Re: List of performance stats to monitor

[Ovirt-devel] Re: List of performance stats to monitor