Chris Lalancette wrote:> All (but especially those working on the UI),
> One thing that we are woefully weak on right now is showing the state
of
> the managed nodes in the datacenter/collection (Iain and I had something of
a
> conversation about this on Friday). In fact, we have no state at all, once
the
> node has contacted us initially. This obviously needs to change; we need
to
> know the state of the host. On the backend, we actually already have a
daemon
> to periodically check machines we manage, called host-status. We need to
> display this data on the UI.
> However, we actually need something further. Take the following
situation:
>
> 1. 3 virtual machines are started on some node, node X
> 2. node X crashes for whatever reason
> 3. Admin reboots node X
>
> At this point, you would think it would be safe to restart the 3 VMs that
were
> on node X when it crashed.
> However, it is actually not; we can't be sure
> whether node X actually crashed, or we couldn't contact it at the
moment due to
> some (transient) network failure.
>
Um, I think I know where you are going but in the example you give, the
admin reboots the node, so we know that it has been rebooted, correct ?
Or are saying that the admin initiated a reboot via the wui and we can't
tell if the wui actually rebooted the system ?
If we have reestablished connectivity to the host we could check the
"uptime" of the box to determine how long it has been up. Not sure if
we
can get this from libvirt or collectd but there other ways to get the
data. In a properly implemented system we could also look for other
signs of activity, for instance, is the host reading from or writing to
storage?
Also, not sure that it has just be a "transient" network failure. The
mechanisms we are currently implementing to get this data are based on
UDP which is unreliable. If there is heavy network usage, there is no
guarantee of the data getting through.
There are also lots of other pieces to solving this lack of contact. Is
this the only host that we don't see or are there others "missing"
as
well. Being able to include some knowledge of network topology and
power "grids" at some point in the future will help determine if a
circuit breaker popped or if a switch is down.> The result of this is that we need some sort of fence, that will
*really*
> shoot the node in the head, and make sure we don't corrupt guest disks.
Not sure that resetting a host running guests is the best thing to do in
an attempt to not corrupt the data on the disks, especially if we are
not sure that the guests / system are really down or that we just can't
get to it via the network. > Since
> we don't currently have the code for that, Dan suggested the
"manual fence";
> that is, the admin walks over and has to manual power cycle the box. I
think
> that's the right short-term solution. This requires a UI to move a
managed node
> from the "unknown/can't be contacted" state, back to
"alive", once the admin has
> rebooted the box.
>
> I know this is a long e-mail, so the short of it is:
> 1) We need to display host health/status on the UI
> 2) We need the ability in the UI to move a host from one (arbitrary?)
state to
> another (arbitrary?) state.
>
And if some of these states include things like "reboot initiated, boot
initiated", etc we can refine our need to shoot things in the head. As
an example, if we tell a host to reboot, when it comes back up, we
should be able to check "uptime" to make sure that it really did
reboot.
Also, by logging the command to reboot and the time contact is
reestablished, we can build a "map" of reboot times for the hosts
going
forward. Establishing baselines for individual hosts would allow us to
trigger warnings in the future once a host has exceeded its
"typical"
reboot time.> Thoughts?
> Chris Lalancette
>
>
I understand that there are valid concerns about being able to determine
the state of guests. I do think that this can be mitigated by also
monitoring the state of the guests. I think that people assume that we
can't do that because collectd and libvirt can't get us info on things
like a windoze guest. However, I would propose that we look at the
issues to be solved and then pick an implementation instead of picking
an implementation and saying what can't be done.> _______________________________________________
> Ovirt-devel mailing list
> Ovirt-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/ovirt-devel
>
So, one piece of this puzzle that hasn't been really clear to me is how
we are controlling the host. Are there plans to try using an IPMI
interface ?
Will we be able to tap into remote controlled power strips ?
Closing thought I'd like to share:
When I break it down, I view oVirt as Network Management system. It
provisions, monitors and manages systems over the network. I'm not
saying its not specialized, but if you compare it to management systems
out there, it does the same basic functionality just on a different set
of problems. I think that if we approach it that way, we will find that
there are lot of solutions to these issues available to us.
-mark