Artur Linhart - Linux communication
2007-Nov-15 15:37 UTC
[Xen-users] [SPAM] Problem with the DomU crash after one part of raid1 reported an read error
Hello,
I have following problem on our server, running W2K3SRVx64 DomUs
over Debian Etch under Xen 3.1.0:
There is following storage configuration:
omega:~# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sdc2[0] sde2[2](S) sdd2[1]
488287552 blocks [2/2] [UU]
md2 : active raid1 sdc1[0] sde1[2](S) sdd1[1]
96256 blocks [2/2] [UU]
md1 : active raid1 sda2[0] sdb2[1]
488287552 blocks [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
96256 blocks [2/2] [UU]
- and the arrays md1-md3 are used in the volume group for the
LVM-managed logical volumes, used as block devices for the virtual
instances.
Today I encountered the problem with one physical disk connected into the
raid array, what caused the crash of one virtual domain - the output from
kern.log looks like:
Nov 15 14:14:19 omega kernel: sd 0:0:1:0: SCSI error: return code 0x08000002
Nov 15 14:14:19 omega kernel: sdb: Current: sense key: Medium Error
Nov 15 14:14:19 omega kernel: Additional sense: Unrecovered read error
Nov 15 14:14:19 omega kernel: Info fld=0x12832f4d
Nov 15 14:14:19 omega kernel: end_request: I/O error, dev sdb, sector
310587213
Nov 15 14:14:19 omega kernel: raid1: sdb2: rescheduling sector 310394432
Nov 15 14:14:19 omega kernel: raid1: sdb2: rescheduling sector 310394440
Nov 15 14:14:24 omega kernel: raid1: sda2: redirecting sector 310394432 to
another mirror
Nov 15 14:14:28 omega kernel: raid1: sda2: redirecting sector 310394440 to
another mirror
Nov 15 14:14:28 omega kernel: qemu-dm[6305]: segfault at 0000000000000000
rip 0000000000000000 rsp 0000000041000ca8 error 14
Nov 15 14:14:28 omega kernel: xenbr0: port 4(tap0) entering disabled state
Nov 15 14:14:28 omega kernel: device tap0 left promiscuous mode
Nov 15 14:14:28 omega kernel: audit(1195132468.260:16): dev=tap0 prom=0
old_prom=256 auid=4294967295
Nov 15 14:14:28 omega kernel: xenbr0: port 4(tap0) entering disabled state
The question is, even if the disk /dev/sdb would fail, why has the virtual
instance died with the segfault?
In xend.log there is nothing logged about this problem...
The given instance has been still reported with the xm list or xm top
command, but spent 0 CPU time and there was no possibility to connect to
this instance over VNC, it was also ping unreachable...
After the xm shutdown it took some time but then it has been possibly
destroyed... After the xm create the instance normally continued to work as
usually...
What can I do to have it running more stable? I think, there could be some
read-timeout during the operation from the given device what caused the
instance segfault has came out before the raid subsystem could take the data
from the disk mirror... I throught always the virtual instance should
survive such a problem if running from md device...
Any helps or advices are appreciated...
With best regards
Archie
_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users