We have a CentOS 6.2 server with KVM. That server hosts 2 virtual machines, both with Centos 6.2, too. Regularly, one or both of the virtual machines pass to state "pause" without apparent reason. On resume, I do get have messages, like the following in /var/log/messages. Feb 28 21:50:45 achernar fcoemon: Failed to connect to lldpad Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Unhandled error code Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 06 db 70 78 00 00 38 00 Feb 29 08:23:56 achernar kernel: end_request: I/O error, dev sda, sector 115044472 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252047 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252048 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252049 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252050 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252051 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252052 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252053 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:57 achernar fcoemon: error 111 Connection refused I could not find any sensible message on the pysical host, neither in /var/log/messages nor in /var/log/libvirt. We do have an almost identical server, same hardware, same software which does not have this problem. How could I proceed to better diagnose the cause of the troubles? Regards, -- Peter Hopfgartner web : www.r3-gis.com
Hi Peter, I saw from your log message that you've lost connection to the storage device which you use in the server machine. My suggestion is please examine the connection between server and storage. CMIIW. Rgds, Wahyu Powered by Telkomsel BlackBerry? -----Original Message----- From: Peter Hopfgartner <peter.hopfgartner at r3-gis.com> Sender: centos-virt-bounces at centos.org Date: Wed, 29 Feb 2012 08:53:09 To: Discussion about the virtualization on CentOS<centos-virt at centos.org> Reply-To: Discussion about the virtualization on CentOS <centos-virt at centos.org> Subject: [CentOS-virt] Guests pausing suddenly We have a CentOS 6.2 server with KVM. That server hosts 2 virtual machines, both with Centos 6.2, too. Regularly, one or both of the virtual machines pass to state "pause" without apparent reason. On resume, I do get have messages, like the following in /var/log/messages. Feb 28 21:50:45 achernar fcoemon: Failed to connect to lldpad Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Unhandled error code Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 06 db 70 78 00 00 38 00 Feb 29 08:23:56 achernar kernel: end_request: I/O error, dev sda, sector 115044472 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252047 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252048 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252049 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252050 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252051 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252052 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, logical block 14252053 Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 Feb 29 08:23:57 achernar fcoemon: error 111 Connection refused I could not find any sensible message on the pysical host, neither in /var/log/messages nor in /var/log/libvirt. We do have an almost identical server, same hardware, same software which does not have this problem. How could I proceed to better diagnose the cause of the troubles? Regards, -- Peter Hopfgartner web : www.r3-gis.com _______________________________________________ CentOS-virt mailing list CentOS-virt at centos.org http://lists.centos.org/mailman/listinfo/centos-virt
The problem got slightly better when I upgraded all kernels, on host and guest, so that the "MTBF" went from 3-4 days to approx 50. Still, the problem is not solved, yet. A maybe stupid question: If the kernel in the guest sees an I/O error on sda, could this be a real error on the physical disk, even if there are no notices in the physical hosts log files, or is this more of a software problem? As the next step, I'll try to update the physical servers firmware. Any suggestion on this topic is welcome, even more then before. Reagrds, Peter On 02/29/2012 08:53 AM, Peter Hopfgartner wrote:> We have a CentOS 6.2 server with KVM. That server hosts 2 virtual > machines, both with Centos 6.2, too. > > Regularly, one or both of the virtual machines pass to state "pause" > without apparent reason. > On resume, I do get have messages, like the following in /var/log/messages. > > Feb 28 21:50:45 achernar fcoemon: Failed to connect to lldpad > Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Unhandled error code > Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] Result: > hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT > Feb 29 08:23:56 achernar kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 > 06 db 70 78 00 00 38 00 > Feb 29 08:23:56 achernar kernel: end_request: I/O error, dev sda, sector > 115044472 > Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, > logical block 14252047 > Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 > Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, > logical block 14252048 > Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 > Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, > logical block 14252049 > Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 > Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, > logical block 14252050 > Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 > Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, > logical block 14252051 > Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 > Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, > logical block 14252052 > Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 > Feb 29 08:23:56 achernar kernel: Buffer I/O error on device dm-0, > logical block 14252053 > Feb 29 08:23:56 achernar kernel: lost page write due to I/O error on dm-0 > Feb 29 08:23:57 achernar fcoemon: error 111 Connection refused > > > I could not find any sensible message on the pysical host, neither in > /var/log/messages nor in /var/log/libvirt. > > We do have an almost identical server, same hardware, same software > which does not have this problem. > > How could I proceed to better diagnose the cause of the troubles? > > Regards, >-- Peter Hopfgartner web : http://www.r3-gis.com