thr3ads.net - Xen users - Xen networking degrade [Jul 2013]

If this information is useful, please help other people find it:
Share via:
Ezio Ostorero
2013-Jul-08 15:01 UTC
Xen networking degrade

Hi All, this is going to be along explanation, I beg your pardon.

  I''m a happy ... I mean REAL happy, Xen user since about one year.

  I have two production servers running some 8-10 VMs.
  The two hosts run Debian as Dom0, whereas DomUs are assorted
linux/Windows distributions. No issues about DomUs.

  The two hosts share an iSCSI SAN where the DomUs images are stored.

  In this configuration, the two hosts allow hot/live, warm and cold VM
migration from one another, just great!!!

  Now, a few days ago, one of the two servers crashed, it rebooted with no
noticeable problems, no events in the system and Xen log files, no issues
with iSCSI and LVM, no data corruption, all VMs running happily.

  Nevertheless, since then it''s been the end of VM world as we know it.

  What happens is that the networking subsystem appears to be badly
damaged, i.e. ping latency time on the xenbr0 from the LAN increased
several order of magnitude: from a normal 0.2-0.3 ms up to 300 ms.
  Given this latency, the DomUs network performance, accessibility from
LAN/WAN, is degraded down to unacceptable, a simple file transfer sends
latency to the order of THOUSANDS of ms.

  I could imagine that some piece of HW went broke during the crash, such
as the NIC but this is not the case, hold on the best has yet to come.

  This degrade is not permanent, but it follows some predictive rule, here
is what I discovered:

  Let me state first the name of the two servers, for clarity sake: one is
called "villano" (the one that crashed), the other is
"rocciamelone" ... I
chose the names after two conspicuous mountains in my valley :-)

   - latency on villano is degraded, so I move all domUs on rocciamelone,
   whose latency is OK
   rocciamelone latency OK, villano KO, domUs net OK

   - shutdown villano
   rocciamelone latency still OK

   - rebooting villano, domUs still on rocciamelone
   villano latency OK
   rocciamelone latency KO !!!!!! useless domUs :-(

   - Move domUs to villano
   villano latency OK, rocciamelone latency stilla KO (as expected)

   - reboot rocciamelone, domUs still on villano
   rocciamelone latency OK after reboot
   villano latency KO after rocciamelone reboot, so long to domUs

  The pattern is clear, this drives me mad, it seems like every time one of
the two hosts reboots, it takes away latency efficiency from the other and
render it useless.
  So now, I have to keep one host disconnected in order for the other to be
operational, forget about fancy hot and warm standby, we''re out in the
cold.

  Disgnostics: nothing wrong with syslog and xend.log, ethtool on NIC is OK
either.
  LAN and SAN use different NICs and subnets, no issue on the SAN network.
  So I tried some kind of analysis of system behavior, like monitoring
traffic on NICs and vifs ... here something weird happens.

  On the healty system, I scripted a periodic "netstat" check (5 s) to
keep
track of exchanged data volumes, well, it looks like this probe is rather
intrusive (it shouldn''t be), in fact, while my scripts are running, I
notice a partial degrade of latency, let''s say up to around 2-8 ms.
  This degrade is reversible, disappears as soon as I kill the scripts.

  Any idea? Anything I could check/troubleshoot?

  Help will be GREATLY appreciated,

            Ezio


-- 
Ezio Ostorero, Catania
Seltz e limone col sale. Arriminatu, non annacatu


_______________________________________________
Xen-users mailing list
Xen-users@lists.xen.org
http://lists.xen.org/xen-users
Xen users - Jul 2013 - Xen networking degrade

Xen networking degrade