You can look at /var/log/ovirt-server/db-omatic.log . Probably the node
times out because it does not answer to heartbeat anymore.
To get more detail you can run the db-omatic script in no-daemon mode
(/usr/share/ovirt-server/db-omatic/db_omatic.rb -n)
I see that very often on fedora 13, a bit less on fedora 12.
This is because the ruby aqmp bindings get stuck when they have to
handle too many threads.
There's no fix for this yet, but a workaround : whenever that happens,
restart everything in the node and server with this script :
http://ovirt.pastebin.com/JjNpEDak
http://ovirt.pastebin.com/tPAPJBpB
You can put that script in a cron job.
On 08/17/2010 05:35 AM, Justin Clacherty wrote:> After running for a while the node becomes "unavailable" in
the server
> UI. All VMs running on that node also become unavailable. The node is
> still running fine as are all the VMs, they're just no longer
manageable.
>
> I looked on the node and everything appeared to be running fine. Looked
> on the server and ovirt-taskomatic was stopped (this seems to happen
> quite a bit). Restarted it but that didn't help. Restarting Matahari
> on the node sends information to the server but the node does not become
> available. The only way I've been able to get it back is to shutdown
> all the VMs and reboot the node and management server. Is anyone else
> seeing this happen? What else can I look at when it happens again?
>
> Cheers,
> Justin.
>
> _______________________________________________
> Ovirt-devel mailing list
> Ovirt-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/ovirt-devel
>