Hello, I have 3 physical servers with some virtual machines on each. When I look at dmesg on one of them I get theses errors : ************************************************************************* [34783.559174] hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } [34783.559248] hda: task_in_intr: error=0x04 { AbortedCommand } [34783.559289] ide: failed opcode was: 0xec [121232.732355] hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } [121232.732413] hda: task_in_intr: error=0x04 { AbortedCommand } [121232.732455] ide: failed opcode was: 0xec [207708.187565] hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } [207708.187623] hda: task_in_intr: error=0x04 { AbortedCommand } [207708.187664] ide: failed opcode was: 0xec [294224.164969] hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } [294224.165029] hda: task_in_intr: error=0x04 { AbortedCommand } [294224.165075] ide: failed opcode was: 0xec [380705.378232] hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } [380705.378232] hda: task_in_intr: error=0x04 { AbortedCommand } [380705.378232] ide: failed opcode was: 0xec [467193.505658] hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } [467193.505717] hda: task_in_intr: error=0x04 { AbortedCommand } [467193.505758] ide: failed opcode was: 0xec [553683.657031] hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } [553683.657091] hda: task_in_intr: error=0x04 { AbortedCommand } [553683.657132] ide: failed opcode was: 0xec [640176.673218] hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } [640176.673218] hda: task_in_intr: error=0x04 { AbortedCommand } [640176.673218] ide: failed opcode was: 0xec [726657.593721] hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error } [726657.593721] hda: task_in_intr: error=0x04 { AbortedCommand } [726657.593721] ide: failed opcode was: 0xec: ****************************************************************** You''ll see the full dmesg output in the attached file. I found with google some comments about these errors saying that it means the disk is dying. But this is a relatively recent server (1 year) with 6 disks in RAID 10. Since I started that server in prod, it crashed 3 times. It responds to pings but no ssh access (on xen domain and virtal machines either). Some services on virtual machines continue to respond, other don''t. The only solution is a hard reboot. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
I forgot to say : I''m under Debian Lenny 64 bits and using Xen shipped with the distro. On 05/25/2010 08:43 AM, Nicolas Michel wrote:> Hello, > > I have 3 physical servers with some virtual machines on each. > When I look at dmesg on one of them I get theses errors : > > ************************************************************************* > [34783.559174] hda: task_in_intr: status=0x51 { DriveReady SeekComplete > Error } > [34783.559248] hda: task_in_intr: error=0x04 { AbortedCommand } > [34783.559289] ide: failed opcode was: 0xec > [121232.732355] hda: task_in_intr: status=0x51 { DriveReady SeekComplete > Error } > [121232.732413] hda: task_in_intr: error=0x04 { AbortedCommand } > [121232.732455] ide: failed opcode was: 0xec > [207708.187565] hda: task_in_intr: status=0x51 { DriveReady SeekComplete > Error } > [207708.187623] hda: task_in_intr: error=0x04 { AbortedCommand } > [207708.187664] ide: failed opcode was: 0xec > [294224.164969] hda: task_in_intr: status=0x51 { DriveReady SeekComplete > Error } > [294224.165029] hda: task_in_intr: error=0x04 { AbortedCommand } > [294224.165075] ide: failed opcode was: 0xec > [380705.378232] hda: task_in_intr: status=0x51 { DriveReady SeekComplete > Error } > [380705.378232] hda: task_in_intr: error=0x04 { AbortedCommand } > [380705.378232] ide: failed opcode was: 0xec > [467193.505658] hda: task_in_intr: status=0x51 { DriveReady SeekComplete > Error } > [467193.505717] hda: task_in_intr: error=0x04 { AbortedCommand } > [467193.505758] ide: failed opcode was: 0xec > [553683.657031] hda: task_in_intr: status=0x51 { DriveReady SeekComplete > Error } > [553683.657091] hda: task_in_intr: error=0x04 { AbortedCommand } > [553683.657132] ide: failed opcode was: 0xec > [640176.673218] hda: task_in_intr: status=0x51 { DriveReady SeekComplete > Error } > [640176.673218] hda: task_in_intr: error=0x04 { AbortedCommand } > [640176.673218] ide: failed opcode was: 0xec > [726657.593721] hda: task_in_intr: status=0x51 { DriveReady SeekComplete > Error } > [726657.593721] hda: task_in_intr: error=0x04 { AbortedCommand } > [726657.593721] ide: failed opcode was: 0xec: > ****************************************************************** > > You''ll see the full dmesg output in the attached file. > I found with google some comments about these errors saying that it > means the disk is dying. But this is a relatively recent server (1 year) > with 6 disks in RAID 10. > > Since I started that server in prod, it crashed 3 times. It responds to > pings but no ssh access (on xen domain and virtal machines either). Some > services on virtual machines continue to respond, other don''t. The only > solution is a hard reboot. > > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Tue, May 25, 2010 at 1:43 PM, Nicolas Michel <nicolas.michel@lemail.be> wrote:> I found with google some comments about these errors saying that it means > the disk is dying. But this is a relatively recent server (1 year) with 6 > disks in RAID 10.That doesn''t mean it will automatically guarantee to be error-free.> > Since I started that server in prod, it crashed 3 times. It responds to > pings but no ssh access (on xen domain and virtal machines either). Some > services on virtual machines continue to respond, other don''t. The only > solution is a hard reboot.Does the other working machines have similar config (hardware, OS, kernel, etc.)? If yes, then it''s hardware problem. No way around it. There are cases when it''s not actually hardware problem, but kernel problem (like when using opensuse 11.2 with HP smart array). In these cases I''d try with liveCD/DVD of other distros first. This does not seem to be case with your setup though. -- Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
I know RAID don''t guarantee there is no errors. My two others physical machines that hosts each a Xen domain controller are not the same hardware at all but the same OS (Debian Lenny 64 bits). They don''t have these errors and never crashed. You think I should try another kernel more up-to-date? On 05/25/2010 09:04 AM, Fajar A. Nugraha wrote:> On Tue, May 25, 2010 at 1:43 PM, Nicolas Michel > <nicolas.michel@lemail.be> wrote: >> I found with google some comments about these errors saying that it means >> the disk is dying. But this is a relatively recent server (1 year) with 6 >> disks in RAID 10. > > That doesn''t mean it will automatically guarantee to be error-free. > >> >> Since I started that server in prod, it crashed 3 times. It responds to >> pings but no ssh access (on xen domain and virtal machines either). Some >> services on virtual machines continue to respond, other don''t. The only >> solution is a hard reboot. > > Does the other working machines have similar config (hardware, OS, > kernel, etc.)? If yes, then it''s hardware problem. No way around it. > > There are cases when it''s not actually hardware problem, but kernel > problem (like when using opensuse 11.2 with HP smart array). In these > cases I''d try with liveCD/DVD of other distros first. This does not > seem to be case with your setup though. >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Tue, May 25, 2010 at 4:05 PM, Nicolas Michel <nicolas.michel@lemail.be> wrote:> I know RAID don''t guarantee there is no errors. > > My two others physical machines that hosts each a Xen domain controller are > not the same hardware at all but the same OS (Debian Lenny 64 bits). They > don''t have these errors and never crashed. > > You think I should try another kernel more up-to-date?One thing to confirm first. Is hda the first disk? AFAIK Lenny should come kernel 2.6.26, and newer kernels use sda instead of hda. If it is the first disk, I''d start with picking a live CD/DVD of a distro with recent kernel. Ubuntu Lucid would do. Boot it, and do something like dd if=/dev/sda of=/dev/null bs=16M ... which basically reads all the disk contents. See whether it can complete without errors. If yes, then I''d try to compile newer kernel for this server. Possibly 2.6.29 or 2.6.31 (since 2.6.32 needs Xen 4.0 to run correctly). If it shows read errors though, you''d know for sure that it''s hardware problem. -- Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Thank you for your help. I just looked at my exact kernel version : 2.6.26-2-xen-amd64 It is Lenny: ~# lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 5.0.2 (lenny) Release: 5.0.2 Codename: lenny I''ll try your test but I don''t know when because this server is in prod for the moment. Maybe that WE. Thank you, On 05/25/2010 12:26 PM, Fajar A. Nugraha wrote:> On Tue, May 25, 2010 at 4:05 PM, Nicolas Michel > <nicolas.michel@lemail.be> wrote: >> I know RAID don''t guarantee there is no errors. >> >> My two others physical machines that hosts each a Xen domain controller are >> not the same hardware at all but the same OS (Debian Lenny 64 bits). They >> don''t have these errors and never crashed. >> >> You think I should try another kernel more up-to-date? > > One thing to confirm first. Is hda the first disk? AFAIK Lenny should > come kernel 2.6.26, and newer kernels use sda instead of hda. > > If it is the first disk, I''d start with picking a live CD/DVD of a > distro with recent kernel. Ubuntu Lucid would do. Boot it, and do > something like > > dd if=/dev/sda of=/dev/null bs=16M > > ... which basically reads all the disk contents. See whether it can > complete without errors. If yes, then I''d try to compile newer kernel > for this server. Possibly 2.6.29 or 2.6.31 (since 2.6.32 needs Xen 4.0 > to run correctly). If it shows read errors though, you''d know for sure > that it''s hardware problem. >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users