Hugh O. Brock
2008-Apr-22 22:37 UTC
[Ovirt-devel] Re: List of performance stats to monitor
On Fri, Apr 18, 2008 at 04:47:05PM -0400, mark wagner wrote:>Mark, this looks great, I'm forwarding it out to ovirt-devel with comments in-line (and a few edits).> > Here is a crack at the performance stats to monitor and a "prioritized" > list for implementation (which is what Tim was looking for). The assumption > I'm making is that list is for the beta release and that we are only > interested in performance statistics of the host at this point in time. > The main goal as I see it is to use the performance stats as a basic health > check of the data center. The ability to dig into certain areas is also > important, but not the main priority. > > In order to come with the list for basic monitoring, it took a bit of > thinking differently from the performance tuning tools that we do. > Basically we tend to drill into very specific things while a basic health > check is a different aspect. > > The main things involve number of VMs, CPU, Disk and memory consumption. > Networking would be fourth on the list. > > The other aspect of this involves the proposed aggregate level monitoring. > Rolling the stats up to a resource pool level is useful for capacity > planning type of activities but doesn't always reflect potential resource > issues within the pool. Sort of analogous to looking at the average CPU > consumption on a 16 CPU system and seeing a 6% ave utilization on the box > and wondering why a single threaded app is not performing well. However, > after poking around a bit, aggregate model does seem to work in many places > and seems appropriate for this level of work.Sure... I think ultimately we're going to want both a ton of different ways to aggreagate, as *well* as the ability to drill all the way down to individual VMs. But understood we don't have to have all that in place for the beta.> So I've grouped things into three levels of priority. These should apply > equally across the aggregate levels unless indicated otherwise. I am also > using the data that we appear to be able to get from collectd and > potentially libvirt although I have yet to set it up and try it. > > Group 1 > ------------ > Load average > the 1, 5, and 15 min averages like top provides. we should consider > using a "stacked view" to show individual machines with in as well (lower > priority) > > Storage Space > used and available > > Memory > In use and available (like top) > Allocated and unassigned ( so if you have 16GB on a host and only have > 4GB allocated to VM, you'd have 12 unassigned) > > VM's > not sure if we get this out of libvirt and stored but, number of > configured VMs, number of running vms, ( number of zombies ?)Yes, this is all critical, and I think easily doable with current libvirt + collectd combo.> Group 2 > ------------- > CPU utilization > display the normal user, nice, sys, idle, wait type of stats for a > single host > > Network stats > Throughput rates, error rollups (note rolling up all errors makes it > easier to spot things ) > > Disk Stats > io/sec, bytes / sec, wait times, (Q's ?) >Mmmm I am salivating... However I see all of this kind of data as drill-down (and, without diving too far into implementation, as generated outside of the WUI and handed over to it in .png form or something of that nature)> Group 3 > ---------------- > Load Average > we should consider using a "stacked view" to show individual machines > with in as well - move up priority list if easy to do > > Network > specific error info broken out by type - maybe limited to host by host > w/no rollup > > > That is the high level break down of what I think we should try. Most of > these stats use plugins to collectd. We can also potentially write our own > if needed. For instance, when doing DB tuning one of the things I use is > iostat to get data on the disk queues and latencies. > > I'll dig into more as well over the weekend. There were several sites I > saw that use collectd and have some stuff on the web for instance > http://csg.sph.umich.edu/docs/cluster/stats/ (click on the graphs)This looks great. We'll need to coordinate between you, Jay, Ian, and Tim on the Group 1 bits to figure out how we're going to get them onto the screen in the UI. Group 2 bits I'm happy pulling pngs out with rrdtool for the beta, I think -- does anyone else have thoughts on this? Thanks, and let me know, --Hugh