Hi All, this is going to be along explanation, I beg your pardon.
I''m a happy ... I mean REAL happy, Xen user since about one year.
I have two production servers running some 8-10 VMs.
The two hosts run Debian as Dom0, whereas DomUs are assorted
linux/Windows distributions. No issues about DomUs.
The two hosts share an iSCSI SAN where the DomUs images are stored.
In this configuration, the two hosts allow hot/live, warm and cold VM
migration from one another, just great!!!
Now, a few days ago, one of the two servers crashed, it rebooted with no
noticeable problems, no events in the system and Xen log files, no issues
with iSCSI and LVM, no data corruption, all VMs running happily.
Nevertheless, since then it''s been the end of VM world as we know it.
What happens is that the networking subsystem appears to be badly
damaged, i.e. ping latency time on the xenbr0 from the LAN increased
several order of magnitude: from a normal 0.2-0.3 ms up to 300 ms.
Given this latency, the DomUs network performance, accessibility from
LAN/WAN, is degraded down to unacceptable, a simple file transfer sends
latency to the order of THOUSANDS of ms.
I could imagine that some piece of HW went broke during the crash, such
as the NIC but this is not the case, hold on the best has yet to come.
This degrade is not permanent, but it follows some predictive rule, here
is what I discovered:
Let me state first the name of the two servers, for clarity sake: one is
called "villano" (the one that crashed), the other is
"rocciamelone" ... I
chose the names after two conspicuous mountains in my valley :-)
- latency on villano is degraded, so I move all domUs on rocciamelone,
whose latency is OK
rocciamelone latency OK, villano KO, domUs net OK
- shutdown villano
rocciamelone latency still OK
- rebooting villano, domUs still on rocciamelone
villano latency OK
rocciamelone latency KO !!!!!! useless domUs :-(
- Move domUs to villano
villano latency OK, rocciamelone latency stilla KO (as expected)
- reboot rocciamelone, domUs still on villano
rocciamelone latency OK after reboot
villano latency KO after rocciamelone reboot, so long to domUs
The pattern is clear, this drives me mad, it seems like every time one of
the two hosts reboots, it takes away latency efficiency from the other and
render it useless.
So now, I have to keep one host disconnected in order for the other to be
operational, forget about fancy hot and warm standby, we''re out in the
cold.
Disgnostics: nothing wrong with syslog and xend.log, ethtool on NIC is OK
either.
LAN and SAN use different NICs and subnets, no issue on the SAN network.
So I tried some kind of analysis of system behavior, like monitoring
traffic on NICs and vifs ... here something weird happens.
On the healty system, I scripted a periodic "netstat" check (5 s) to
keep
track of exchanged data volumes, well, it looks like this probe is rather
intrusive (it shouldn''t be), in fact, while my scripts are running, I
notice a partial degrade of latency, let''s say up to around 2-8 ms.
This degrade is reversible, disappears as soon as I kill the scripts.
Any idea? Anything I could check/troubleshoot?
Help will be GREATLY appreciated,
Ezio
--
Ezio Ostorero, Catania
Seltz e limone col sale. Arriminatu, non annacatu
_______________________________________________
Xen-users mailing list
Xen-users@lists.xen.org
http://lists.xen.org/xen-users