Artur Linhart - Linux communication
2007-Nov-15 15:37 UTC
[Xen-users] [SPAM] Problem with the DomU crash after one part of raid1 reported an read error
Hello, I have following problem on our server, running W2K3SRVx64 DomUs over Debian Etch under Xen 3.1.0: There is following storage configuration: omega:~# cat /proc/mdstat Personalities : [raid1] md3 : active raid1 sdc2[0] sde2[2](S) sdd2[1] 488287552 blocks [2/2] [UU] md2 : active raid1 sdc1[0] sde1[2](S) sdd1[1] 96256 blocks [2/2] [UU] md1 : active raid1 sda2[0] sdb2[1] 488287552 blocks [2/2] [UU] md0 : active raid1 sda1[0] sdb1[1] 96256 blocks [2/2] [UU] - and the arrays md1-md3 are used in the volume group for the LVM-managed logical volumes, used as block devices for the virtual instances. Today I encountered the problem with one physical disk connected into the raid array, what caused the crash of one virtual domain - the output from kern.log looks like: Nov 15 14:14:19 omega kernel: sd 0:0:1:0: SCSI error: return code 0x08000002 Nov 15 14:14:19 omega kernel: sdb: Current: sense key: Medium Error Nov 15 14:14:19 omega kernel: Additional sense: Unrecovered read error Nov 15 14:14:19 omega kernel: Info fld=0x12832f4d Nov 15 14:14:19 omega kernel: end_request: I/O error, dev sdb, sector 310587213 Nov 15 14:14:19 omega kernel: raid1: sdb2: rescheduling sector 310394432 Nov 15 14:14:19 omega kernel: raid1: sdb2: rescheduling sector 310394440 Nov 15 14:14:24 omega kernel: raid1: sda2: redirecting sector 310394432 to another mirror Nov 15 14:14:28 omega kernel: raid1: sda2: redirecting sector 310394440 to another mirror Nov 15 14:14:28 omega kernel: qemu-dm[6305]: segfault at 0000000000000000 rip 0000000000000000 rsp 0000000041000ca8 error 14 Nov 15 14:14:28 omega kernel: xenbr0: port 4(tap0) entering disabled state Nov 15 14:14:28 omega kernel: device tap0 left promiscuous mode Nov 15 14:14:28 omega kernel: audit(1195132468.260:16): dev=tap0 prom=0 old_prom=256 auid=4294967295 Nov 15 14:14:28 omega kernel: xenbr0: port 4(tap0) entering disabled state The question is, even if the disk /dev/sdb would fail, why has the virtual instance died with the segfault? In xend.log there is nothing logged about this problem... The given instance has been still reported with the xm list or xm top command, but spent 0 CPU time and there was no possibility to connect to this instance over VNC, it was also ping unreachable... After the xm shutdown it took some time but then it has been possibly destroyed... After the xm create the instance normally continued to work as usually... What can I do to have it running more stable? I think, there could be some read-timeout during the operation from the given device what caused the instance segfault has came out before the raid subsystem could take the data from the disk mirror... I throught always the virtual instance should survive such a problem if running from md device... Any helps or advices are appreciated... With best regards Archie _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users