Chris Lalancette wrote:> All (but especially those working on the UI),
>      One thing that we are woefully weak on right now is showing the state
of
> the managed nodes in the datacenter/collection (Iain and I had something of
a
> conversation about this on Friday).  In fact, we have no state at all, once
the
> node has contacted us initially.  This obviously needs to change; we need
to
> know the state of the host.  On the backend, we actually already have a
daemon
> to periodically check machines we manage, called host-status.  We need to
> display this data on the UI.
>      However, we actually need something further.  Take the following
situation:
>
> 1.  3 virtual machines are started on some node, node X
> 2.  node X crashes for whatever reason
> 3.  Admin reboots node X
>
> At this point, you would think it would be safe to restart the 3 VMs that
were
> on node X when it crashed.
> However, it is actually not; we can't be sure
> whether node X actually crashed, or we couldn't contact it at the
moment due to
> some (transient) network failure.
>   
Um, I think I know where you are going but in the example you give, the 
admin reboots the node, so we know that it has been rebooted, correct ?
Or are saying that the admin initiated a reboot via the wui and we can't 
tell if the wui actually rebooted the system ?
If we have reestablished connectivity to the host we could check the 
"uptime" of the box to determine how long it has been up. Not sure if
we
can get this from libvirt or collectd but there other ways to get the 
data. In a properly implemented system we could also look for other 
signs of activity, for instance, is the host reading from or writing to 
storage?
Also, not sure that it has just be a "transient" network failure.  The
mechanisms we are currently implementing to get this data are based on 
UDP which is unreliable.  If there is heavy network usage, there is no 
guarantee of the data getting through.
 
There are also lots of other pieces to solving this lack of contact.  Is 
this the only host that we don't see or are there others "missing"
as
well.  Being able to include some knowledge of network topology and 
power "grids"  at some point in the future will help determine if a 
circuit breaker popped or if a switch is down.>      The result of this is that we need some sort of fence, that will
*really*
> shoot the node in the head, and make sure we don't corrupt guest disks.
Not sure that resetting a host running guests is the best thing to do in 
an attempt to not corrupt the data on the disks, especially if we are 
not sure that the guests / system are really down or that we just can't 
get to it via the network. >  Since
> we don't currently have the code for that, Dan suggested the
"manual fence";
> that is, the admin walks over and has to manual power cycle the box.  I
think
> that's the right short-term solution.  This requires a UI to move a
managed node
> from the "unknown/can't be contacted" state, back to
"alive", once the admin has
> rebooted the box.
>
> I know this is a long e-mail, so the short of it is:
> 1)  We need to display host health/status on the UI
> 2)  We need the ability in the UI to move a host from one (arbitrary?)
state to
> another (arbitrary?) state.
>   
And if some of these states include things like "reboot initiated, boot 
initiated", etc we can refine our need to shoot things in the head. As 
an example, if we tell a host to reboot, when it comes back up, we 
should be able to check "uptime" to make sure that it really did
reboot.
Also, by logging the command to reboot and the time contact is 
reestablished, we can build a "map" of reboot times for the hosts
going
forward. Establishing baselines for individual hosts would allow us to 
trigger warnings in the future  once a host has exceeded its 
"typical"
reboot time.> Thoughts?
> Chris Lalancette
>
>   
I understand that there are valid concerns about being able to determine 
the state of guests.  I do think that this can be mitigated by also 
monitoring the state of the guests. I think that people assume that we 
can't do that because collectd and libvirt can't get us info on things 
like a windoze guest.  However, I would propose that we look at the 
issues to be solved and then pick an implementation instead of picking 
an implementation and saying what can't be done.> _______________________________________________
> Ovirt-devel mailing list
> Ovirt-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/ovirt-devel
>   
So, one piece of this puzzle that hasn't been really clear to me is how 
we are controlling the host.  Are there plans to try using an IPMI 
interface ?
Will we be able to tap into remote controlled power strips ?
Closing thought I'd like to share:
When I break it down, I view oVirt as Network Management system. It 
provisions, monitors and manages systems over the network. I'm not 
saying its not specialized, but if you compare it to management systems 
out there, it does the same basic functionality just on a different set 
of problems. I think that if we approach it that way, we will find that 
there are lot of solutions to these issues available to us.
-mark