Hi all!
I've been experimenting with Xen on CentOS 5 and RHAS 5 for a while now with
mixed success
I thought I'd describe my latest challenge. I'll describe this from
memory since all
the equipment is at work and not contactable from here.
I think I've described this config to the list before but here it is again:
I have 2 x HP DL585 servers each with 4 Dual core Opterons (non-vmx) and 16GB
RAM
configured as Xen servers. These run CentOS 5.1 with the latest updates
applied. These
system both attach to an iSCSI target which is an HP DL385 running ietd and
serving SAN
based storage.
Everything runs fine if I do no migration. I was having a "soft
lockup" problem which has
been solved by installing the latest test kernel from redhat. They've
adjusted a timer as
described in this report (comment 5):
https://bugzilla.redhat.com/show_bug.cgi?id=250994
Anyway things are pretty stable now but... If I do a series of migrations, live
or not,
from one server to the other eventually I will get the process to fail. This
could take
up to an hour with the migrations happening every 5 minutes. It can also happen
first
try. The message in the xend.log file says that it is unable to find the device
number
for the virtualised storage i.e. sda. In my configuration I have dom0
connecting to the
LUNs used by the VMs to the domU's are not doing iSCSI. I'm passing
the
"/dev/disks/by-path/iscsixxx:sda1" info in via the xen config file.
If I mount and unmount the same LUN's filesystem to the dom0 over and over
again it works
every time so there's no fundamental problem with the iSCSI connection
itself. Each
server is using 3 gige interfaces: One for normal LAN access, a dedicated
network for the
iSCSI and a crossover cable between the third gige interfaces on the servers for
the
migration channel.
I have a xentop running on both dom0's and can tell that its failed when the
vm appears in
the target but the memory used by the vm doesn't start incrementing. The
end result is
that the VM is hung and has to be destroyed and re-created. The error happens
immediately
after the failed migration is initiated.
I've tried doing a read of the first few blocks of the LUN on the target
immediately
before initiating the migration which is successful but makes no difference.
I realize this is pretty sketchy information but I was wondering if others were
seeing
similar problems. The ability to do reliable migration is basically the prime
motivation
for us wanting to do virtualization at all. Otherwise we'd just run the
required services
on the main machine.
Anyway, any help would be appreciated.
Brett