David Della Vecchia
2011-Oct-18 17:04 UTC
[Xen-users] Severe megasas_raid issues when using Xen dom0 linux kernels
I''ve tried debian stable and testing, centos5 and 6 with xen 3.1-4.1 (about 5 different versions in between). I''m currently running xen 4.1.1 release on centos6 with M.A.Young''s centos6 xen dom0 kernel. For some reason the raid array freaks out and swaps to read-only mode for the entire virtual device the hardware raid array provides. I''ve tried both raid 0 and raid1 (2 1tb SCSI drives). I''ve had this issue in every xen install I''ve tried on this box, no matter what kernel version (tried as new as 3.0.1 in debian wheezy) or xen version (compiled and installed the unstable branch to test) i use. The server was running stable and fine for about a week this time before this: [root@gibson ~]# df -h -bash: /bin/df: Input/output error [root@gibson ~]# w -bash: /usr/bin/w: Input/output error [root@gibson ~]# modinfo megasas_raid -bash: /sbin/modinfo: Input/output error part of the /var/log/messages: Oct 17 13:21:09 gibson kernel: megasas: [ 0]waiting for 1 commands to complete Oct 17 13:21:10 gibson kernel: megaraid_sas: no pending cmds after reset Oct 17 13:21:10 gibson kernel: megasas: reset successful Oct 17 13:21:20 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0 retries=0 Oct 17 13:21:20 gibson kernel: megasas: [ 0]waiting for 1 commands to complete Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset Oct 17 13:21:21 gibson kernel: megasas: reset successful Oct 17 13:21:21 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=2a retries=0 Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset Oct 17 13:21:21 gibson kernel: megasas: reset successful Oct 17 13:21:41 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0 retries=0 Oct 17 13:21:41 gibson kernel: megasas: [ 0]waiting for 1 commands to complete Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset Oct 17 13:21:42 gibson kernel: megasas: reset successful Oct 17 13:21:42 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=2a retries=0 Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset Oct 17 13:21:42 gibson kernel: megasas: reset successful Oct 17 13:22:02 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0 retries=0 Oct 17 13:22:02 gibson kernel: megasas: [ 0]waiting for 1 commands to complete [root@gibson ~]# ls -al /bin/ ls: cannot access /bin/ntfs-3g.secaudit: Input/output error ls: cannot access /bin/ntfstruncate: Input/output error ls: cannot access /bin/ntfsdump_logfile: Input/output error ls: cannot access /bin/ntfsls: Input/output error ls: cannot access /bin/ntfsdecrypt: Input/output error ls: cannot access /bin/ntfs-3g.usermap: Input/output error ls: cannot access /bin/ntfsmount: Input/output error ls: cannot access /bin/ntfsfix: Input/output error ls: cannot access /bin/ntfscluster: Input/output error total 8192 dr-xr-xr-x. 2 root root 4096 Oct 15 14:49 . drwxr-xr-x. 29 root root 4096 Oct 17 12:34 .. -rwxr-xr-x. 1 root root 123 Nov 10 2010 alsaunmute -rwxr-xr-x 1 root root 27808 May 30 10:55 arch lrwxrwxrwx. 1 root root 4 Oct 13 10:36 awk -> gawk -rwxr-xr-x 1 root root 26264 May 30 10:55 basename -rwxr-xr-x 1 root root 943248 May 30 11:46 bash -rwxr-xr-x 1 root root 51344 May 30 10:55 cat -rwxr-xr-x 1 root root 12200 Jun 25 05:02 cgclassify -rwxr-xr-x 1 root root 12352 Jun 25 05:02 cgcreate -rwxr-xr-x 1 root root 11528 Jun 25 05:02 cgdelete -rwsr-xr-x 1 root root 12136 Jun 25 05:02 cgexec -rwxr-xr-x 1 root root 15760 Jun 25 05:02 cgget -rwxr-xr-x 1 root root 13160 Jun 25 05:02 cgset -rwxr-xr-x 1 root root 55472 May 30 10:55 chgrp -rwxr-xr-x 1 root root 52472 May 30 10:55 chmod -rwxr-xr-x 1 root root 57496 May 30 10:55 chown -rwxr-xr-x 1 root root 122344 May 30 10:55 cp -rwxr-xr-x 1 root root 136096 Nov 10 2010 cpio lrwxrwxrwx. 1 root root 4 Oct 13 11:00 csh -> tcsh -rwxr-xr-x 1 root root 45472 May 30 10:55 cut -rwxr-xr-x 1 root root 109896 Aug 18 2010 dash -rwxr-xr-x 1 root root 59552 May 30 10:55 date -rwxr-xr-x 1 root root 12552 Jun 25 06:47 dbus-cleanup-sockets -rwxr-xr-x. 1 root root 339048 Jun 25 06:47 dbus-daemon -rwxr-xr-x 1 root root 18464 Jun 25 06:47 dbus-monitor -rwxr-xr-x 1 root root 22376 Jun 25 06:47 dbus-send -rwxr-xr-x 1 root root 10912 Jun 25 06:47 dbus-uuidgen -rwxr-xr-x 1 root root 54040 May 30 10:55 dd -rwxr-xr-x 1 root root 70256 May 30 10:55 df -rwxr-xr-x 1 root root 9896 Jun 25 02:46 dmesg lrwxrwxrwx. 1 root root 8 Oct 13 10:36 dnsdomainname -> hostname lrwxrwxrwx. 1 root root 8 Oct 13 10:36 domainname -> hostname -rwxr-xr-x 1 root root 81120 Nov 11 2010 dumpkeys -rwxr-xr-x 1 root root 27648 May 30 10:55 echo -rwxr-xr-x 2 root root 53352 Nov 11 2010 ed -rwxr-xr-x 1 root root 106528 Aug 25 2010 egrep -rwxr-xr-x 1 root root 26368 May 30 10:55 env lrwxrwxrwx. 1 root root 2 Oct 13 10:59 ex -> vi -rwxr-xr-x 1 root root 24592 May 30 10:55 false -rwxr-xr-x 1 root root 71328 Aug 25 2010 fgrep -rwxr-xr-x 1 root root 238640 Nov 11 2010 find -rwxr-xr-x 1 root root 382456 Nov 11 2010 gawk -rwxr-xr-x 1 root root 33416 Nov 11 2010 gettext -rwxr-xr-x 1 root root 110160 Aug 25 2010 grep lrwxrwxrwx. 1 root root 3 Oct 13 10:36 gtar -> tar -rwxr-xr-x. 1 root root 61 Nov 11 2010 gunzip -rwxr-xr-x 1 root root 68544 Nov 11 2010 gzip -rwxr-xr-x 1 root root 16192 Aug 24 2010 hostname -rwxr-xr-x 1 root root 14872 Jun 25 00:09 ipcalc lrwxrwxrwx. 1 root root 20 Oct 13 10:36 iptables-xml -> /sbin/iptables-multi -rwxr-xr-x 1 root root 11248 Nov 11 2010 kbd_mode -rwxr-xr-x 1 root root 24648 Aug 22 2010 keyctl -rwxr-xr-x 1 root root 15128 Jun 25 02:46 kill -rwxr-xr-x 1 root root 26256 May 30 10:55 link -rwxr-xr-x 1 root root 49568 May 30 10:55 ln -rwxr-xr-x 1 root root 112136 Nov 11 2010 loadkeys -rwxr-xr-x 1 root root 30992 Jun 25 02:46 login -rwxr-xr-x 1 root root 58368 Sep 12 13:32 lowntfs-3g -rwxr-xr-x 1 root root 111744 May 30 10:55 ls -rwxr-xr-x 1 root root 14008 Jun 25 05:02 lscgroup -rwxr-xr-x 1 root root 12488 Jun 25 05:02 lssubsys lrwxrwxrwx. 1 root root 5 Oct 13 10:37 mail -> mailx -rwxr-xr-x 1 root root 390360 Aug 22 2010 mailx -rwxr-xr-x 1 root root 48544 May 30 10:55 mkdir -rwxr-xr-x 1 root root 32352 May 30 10:55 mknod -rwxr-xr-x 1 root root 37352 May 30 10:55 mktemp -rwxr-xr-x 1 root root 41144 Jun 25 02:46 more -rwsr-xr-x. 1 root root 74712 Jun 25 02:46 mount -rwxr-xr-x 1 root root 9800 Aug 24 2010 mountpoint -rwxr-xr-x 1 root root 111536 May 30 10:55 mv -rwxr-xr-x 1 root root 177360 Nov 12 2010 nano -rwxr-xr-x 1 root root 127816 Aug 24 2010 netstat -rwxr-xr-x 1 root root 28816 May 30 10:55 nice lrwxrwxrwx. 1 root root 8 Oct 13 10:36 nisdomainname -> hostname -rwxr-xr-x 1 root root 53576 Sep 12 13:32 ntfs-3g -rwxr-xr-x 1 root root 11016 Sep 12 13:32 ntfs-3g.probe -?????????? ? ? ? ? ? ntfs-3g.secaudit -?????????? ? ? ? ? ? ntfs-3g.usermap -rwxr-xr-x 1 root root 29896 Sep 12 13:32 ntfscat -rwxr-xr-x 1 root root 32992 Sep 12 13:32 ntfsck -?????????? ? ? ? ? ? ntfscluster -rwxr-xr-x 1 root root 36320 Sep 12 13:32 ntfscmp -?????????? ? ? ? ? ? ntfsdecrypt -?????????? ? ? ? ? ? ntfsdump_logfile -?????????? ? ? ? ? ? ntfsfix -rwxr-xr-x 1 root root 57240 Sep 12 13:32 ntfsinfo -?????????? ? ? ? ? ? ntfsls -rwxr-xr-x 1 root root 30448 Sep 12 13:32 ntfsmftalloc l?????????? ? ? ? ? ? ntfsmount -rwxr-xr-x 1 root root 34000 Sep 12 13:32 ntfsmove -?????????? ? ? ? ? ? ntfstruncate -rwxr-xr-x 1 root root 42240 Sep 12 13:32 ntfswipe -rwsr-xr-x 1 root root 41432 Nov 11 2010 ping -rwsr-xr-x 1 root root 36256 Nov 11 2010 ping6 -rwxr-xr-x 1 root root 35640 Oct 31 2010 plymouth -rwxr-xr-x 1 root root 86776 Nov 11 2010 ps -rwxr-xr-x 1 root root 31656 May 30 10:55 pwd -rwxr-xr-x 1 root root 11528 Jun 25 02:46 raw -rwxr-xr-x 1 root root 40056 May 30 10:55 readlink -rwxr-xr-x 2 root root 53352 Nov 11 2010 red -rwxr-xr-x. 1 root root 576 Apr 16 2008 redhat_lsb_init -rwxr-xr-x 1 root root 57504 May 30 10:55 rm -rwxr-xr-x 1 root root 40544 May 30 10:55 rmdir lrwxrwxrwx. 1 root root 4 Oct 13 10:39 rnano -> nano -rwxr-xr-x 1 root root 29904 Nov 11 2010 rpm lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rvi -> vi lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rview -> vi -rwxr-xr-x 1 root root 72248 Aug 22 2010 sed -rwxr-xr-x 1 root root 42312 Nov 11 2010 setfont -rwxr-xr-x 1 root root 23600 Aug 22 2010 setserial lrwxrwxrwx. 1 root root 4 Oct 13 10:36 sh -> bash -rwxr-xr-x 1 root root 27880 May 30 10:55 sleep -rwxr-xr-x 1 root root 99000 May 30 10:55 sort -rwxr-xr-x 1 root root 65864 May 30 10:55 stty -rwsr-xr-x 1 root root 36440 May 30 10:55 su -rwxr-xr-x 1 root root 25464 May 30 10:55 sync -rwxr-xr-x 1 root root 384920 Nov 11 2010 tar -rwxr-xr-x 1 root root 14808 Jun 25 02:46 taskset -rwxr-xr-x 1 root root 391288 Jun 25 02:05 tcsh -rwxr-xr-x 1 root root 51952 May 30 10:55 touch -rwxr-xr-x. 1 root root 11392 Nov 11 2010 tracepath -rwxr-xr-x. 1 root root 12304 Nov 11 2010 tracepath6 -rwxr-xr-x 1 root root 57384 Nov 11 2010 traceroute lrwxrwxrwx. 1 root root 10 Oct 13 10:39 traceroute6 -> traceroute -rwxr-xr-x 1 root root 24592 May 30 10:55 true -rwsr-xr-x. 1 root root 49280 Jun 25 02:46 umount -rwxr-xr-x 1 root root 27808 May 30 10:55 uname -rwxr-xr-x. 1 root root 2555 Nov 11 2010 unicode_start -rwxr-xr-x. 1 root root 363 Nov 11 2010 unicode_stop -rwxr-xr-x 1 root root 26264 May 30 10:55 unlink -rwxr-xr-x 1 root root 10208 Jun 25 00:09 usleep -rwxr-xr-x 1 root root 771800 Jun 25 04:43 vi lrwxrwxrwx. 1 root root 2 Oct 13 10:59 view -> vi lrwxrwxrwx. 1 root root 8 Oct 13 10:36 ypdomainname -> hostname -rwxr-xr-x. 1 root root 62 Nov 11 2010 zcat Here is the rough partition information for my main drive: /boot primary ext3 1gb /dev/sda1 /dev/sda2 extended lvm pv 925gb vg_gibson lvm-volumegroup 925gb / lv_root ext3 36gb swap lv_swap 2gb Server Specs: Dell Poweredge R710 32GB ECC Unbuffered Ram 2x Intel Xeon Quad Core HT 2.3Ghz (16 "cores" total) 2x 1TB WD SCSI Drives in Raid-1 Drive Nitty Gritty: Product ID: WDC WD1002FBYS-0 Revision: 0C06 Size: 953344MB Heres some more information about the raid controller also attained from the raid controller config utility: Product Name: PERC 6/i Package: 6.2.0-0013 FW Version: 1.22.02-0612 BIOS Version: 2.04.00 CtrlR Version: 1.02-015B Boot Block: 1.00.00.01-0011 Application & OS Specs: CentOS 6 w/2.6.32-131 M.A.Young centos6 xen dom0 kernel Diagnostic Attempts and Results: I''ve done a consistency check on the raid array and everything comes back as clean and optimal. I''ve ran bad block checks, partition table corruption, mbr corruption, everything i can think of. It all comes back as clean and working fine. Because of these results i have not been able to force my dedicated hosting company to replace any of the hardware. They are upgrading the raid controller software as its about 1 minor version out of date just to see if that could be the issue, i''ll report back if that mysteriously fixes it but i''m not holding my breath. I''ve read somewhere that the 2.6.x kernels have an old version of the megaraid_sas module that will cause problems but the version included in the M.A.Young centos6 kernel is version 5.3 which is far beyond the 4.3 version that article recommends upgrading to so i''m really at a loss. Besides the version being so new the problem described in that article (the kernel not finding the drive at all on boot) is not the issue i''m having. It just freaks out randomly (i''m sure its not really randomly, just appears that way) and the OS swaps to read-only mode and the only way to reboot is basically to push the button on the front of the box. Please, if anyone can direct me towards a solution or at least down a path i have yet to try i would greatly appreciate it. I''m at my wits end, i''ve been fighting this mysterious monster for over a month now and it always seems to strike right before i''m about to go live with my services (first time it happened was right after i started adding customers to the box). Thanks in advance, David _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Steven Timm
2011-Oct-18 17:18 UTC
Re: [Xen-users] Severe megasas_raid issues when using Xen dom0 linux kernels
Have you tried to use the MegaRAID monitor to see if you can diagnose some hardware problem with the RAID? There is one you can download and run on the linux dom0, there should be a monitor you can get to from the BIOS as well.. those error messages look very much like an actual hardware fault on the RAID array. I have a lot of megasas raid both under SL5 and SL6 and have used them as xen dom0 and kvm vm hosts without problems, several different versions of xen. Steve Timm On Tue, 18 Oct 2011, David Della Vecchia wrote:> I''ve tried debian stable and testing, centos5 and 6 with xen 3.1-4.1 (about > 5 different versions in between). I''m currently running xen 4.1.1 release on > centos6 with M.A.Young''s centos6 xen dom0 kernel. For some reason the raid > array freaks out and swaps to read-only mode for the entire virtual device > the hardware raid array provides. I''ve tried both raid 0 and raid1 (2 1tb > SCSI drives). I''ve had this issue in every xen install I''ve tried on this > box, no matter what kernel version (tried as new as 3.0.1 in debian wheezy) > or xen version (compiled and installed the unstable branch to test) i use. > The server was running stable and fine for about a week this time before > this: > > > [root@gibson ~]# df -h > -bash: /bin/df: Input/output error > [root@gibson ~]# w > -bash: /usr/bin/w: Input/output error > [root@gibson ~]# modinfo megasas_raid > -bash: /sbin/modinfo: Input/output error > > part of the /var/log/messages: > > Oct 17 13:21:09 gibson kernel: megasas: [ 0]waiting for 1 commands to > complete > Oct 17 13:21:10 gibson kernel: megaraid_sas: no pending cmds after reset > Oct 17 13:21:10 gibson kernel: megasas: reset successful > Oct 17 13:21:20 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0 > retries=0 > Oct 17 13:21:20 gibson kernel: megasas: [ 0]waiting for 1 commands to > complete > Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset > Oct 17 13:21:21 gibson kernel: megasas: reset successful > Oct 17 13:21:21 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 > cmd=2a retries=0 > Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset > Oct 17 13:21:21 gibson kernel: megasas: reset successful > Oct 17 13:21:41 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0 > retries=0 > Oct 17 13:21:41 gibson kernel: megasas: [ 0]waiting for 1 commands to > complete > Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset > Oct 17 13:21:42 gibson kernel: megasas: reset successful > Oct 17 13:21:42 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 > cmd=2a retries=0 > Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset > Oct 17 13:21:42 gibson kernel: megasas: reset successful > Oct 17 13:22:02 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0 > retries=0 > Oct 17 13:22:02 gibson kernel: megasas: [ 0]waiting for 1 commands to > complete > > > [root@gibson ~]# ls -al /bin/ > ls: cannot access /bin/ntfs-3g.secaudit: Input/output error > ls: cannot access /bin/ntfstruncate: Input/output error > ls: cannot access /bin/ntfsdump_logfile: Input/output error > ls: cannot access /bin/ntfsls: Input/output error > ls: cannot access /bin/ntfsdecrypt: Input/output error > ls: cannot access /bin/ntfs-3g.usermap: Input/output error > ls: cannot access /bin/ntfsmount: Input/output error > ls: cannot access /bin/ntfsfix: Input/output error > ls: cannot access /bin/ntfscluster: Input/output error > total 8192 > dr-xr-xr-x. 2 root root 4096 Oct 15 14:49 . > drwxr-xr-x. 29 root root 4096 Oct 17 12:34 .. > -rwxr-xr-x. 1 root root 123 Nov 10 2010 alsaunmute > -rwxr-xr-x 1 root root 27808 May 30 10:55 arch > lrwxrwxrwx. 1 root root 4 Oct 13 10:36 awk -> gawk > -rwxr-xr-x 1 root root 26264 May 30 10:55 basename > -rwxr-xr-x 1 root root 943248 May 30 11:46 bash > -rwxr-xr-x 1 root root 51344 May 30 10:55 cat > -rwxr-xr-x 1 root root 12200 Jun 25 05:02 cgclassify > -rwxr-xr-x 1 root root 12352 Jun 25 05:02 cgcreate > -rwxr-xr-x 1 root root 11528 Jun 25 05:02 cgdelete > -rwsr-xr-x 1 root root 12136 Jun 25 05:02 cgexec > -rwxr-xr-x 1 root root 15760 Jun 25 05:02 cgget > -rwxr-xr-x 1 root root 13160 Jun 25 05:02 cgset > -rwxr-xr-x 1 root root 55472 May 30 10:55 chgrp > -rwxr-xr-x 1 root root 52472 May 30 10:55 chmod > -rwxr-xr-x 1 root root 57496 May 30 10:55 chown > -rwxr-xr-x 1 root root 122344 May 30 10:55 cp > -rwxr-xr-x 1 root root 136096 Nov 10 2010 cpio > lrwxrwxrwx. 1 root root 4 Oct 13 11:00 csh -> tcsh > -rwxr-xr-x 1 root root 45472 May 30 10:55 cut > -rwxr-xr-x 1 root root 109896 Aug 18 2010 dash > -rwxr-xr-x 1 root root 59552 May 30 10:55 date > -rwxr-xr-x 1 root root 12552 Jun 25 06:47 dbus-cleanup-sockets > -rwxr-xr-x. 1 root root 339048 Jun 25 06:47 dbus-daemon > -rwxr-xr-x 1 root root 18464 Jun 25 06:47 dbus-monitor > -rwxr-xr-x 1 root root 22376 Jun 25 06:47 dbus-send > -rwxr-xr-x 1 root root 10912 Jun 25 06:47 dbus-uuidgen > -rwxr-xr-x 1 root root 54040 May 30 10:55 dd > -rwxr-xr-x 1 root root 70256 May 30 10:55 df > -rwxr-xr-x 1 root root 9896 Jun 25 02:46 dmesg > lrwxrwxrwx. 1 root root 8 Oct 13 10:36 dnsdomainname -> hostname > lrwxrwxrwx. 1 root root 8 Oct 13 10:36 domainname -> hostname > -rwxr-xr-x 1 root root 81120 Nov 11 2010 dumpkeys > -rwxr-xr-x 1 root root 27648 May 30 10:55 echo > -rwxr-xr-x 2 root root 53352 Nov 11 2010 ed > -rwxr-xr-x 1 root root 106528 Aug 25 2010 egrep > -rwxr-xr-x 1 root root 26368 May 30 10:55 env > lrwxrwxrwx. 1 root root 2 Oct 13 10:59 ex -> vi > -rwxr-xr-x 1 root root 24592 May 30 10:55 false > -rwxr-xr-x 1 root root 71328 Aug 25 2010 fgrep > -rwxr-xr-x 1 root root 238640 Nov 11 2010 find > -rwxr-xr-x 1 root root 382456 Nov 11 2010 gawk > -rwxr-xr-x 1 root root 33416 Nov 11 2010 gettext > -rwxr-xr-x 1 root root 110160 Aug 25 2010 grep > lrwxrwxrwx. 1 root root 3 Oct 13 10:36 gtar -> tar > -rwxr-xr-x. 1 root root 61 Nov 11 2010 gunzip > -rwxr-xr-x 1 root root 68544 Nov 11 2010 gzip > -rwxr-xr-x 1 root root 16192 Aug 24 2010 hostname > -rwxr-xr-x 1 root root 14872 Jun 25 00:09 ipcalc > lrwxrwxrwx. 1 root root 20 Oct 13 10:36 iptables-xml -> > /sbin/iptables-multi > -rwxr-xr-x 1 root root 11248 Nov 11 2010 kbd_mode > -rwxr-xr-x 1 root root 24648 Aug 22 2010 keyctl > -rwxr-xr-x 1 root root 15128 Jun 25 02:46 kill > -rwxr-xr-x 1 root root 26256 May 30 10:55 link > -rwxr-xr-x 1 root root 49568 May 30 10:55 ln > -rwxr-xr-x 1 root root 112136 Nov 11 2010 loadkeys > -rwxr-xr-x 1 root root 30992 Jun 25 02:46 login > -rwxr-xr-x 1 root root 58368 Sep 12 13:32 lowntfs-3g > -rwxr-xr-x 1 root root 111744 May 30 10:55 ls > -rwxr-xr-x 1 root root 14008 Jun 25 05:02 lscgroup > -rwxr-xr-x 1 root root 12488 Jun 25 05:02 lssubsys > lrwxrwxrwx. 1 root root 5 Oct 13 10:37 mail -> mailx > -rwxr-xr-x 1 root root 390360 Aug 22 2010 mailx > -rwxr-xr-x 1 root root 48544 May 30 10:55 mkdir > -rwxr-xr-x 1 root root 32352 May 30 10:55 mknod > -rwxr-xr-x 1 root root 37352 May 30 10:55 mktemp > -rwxr-xr-x 1 root root 41144 Jun 25 02:46 more > -rwsr-xr-x. 1 root root 74712 Jun 25 02:46 mount > -rwxr-xr-x 1 root root 9800 Aug 24 2010 mountpoint > -rwxr-xr-x 1 root root 111536 May 30 10:55 mv > -rwxr-xr-x 1 root root 177360 Nov 12 2010 nano > -rwxr-xr-x 1 root root 127816 Aug 24 2010 netstat > -rwxr-xr-x 1 root root 28816 May 30 10:55 nice > lrwxrwxrwx. 1 root root 8 Oct 13 10:36 nisdomainname -> hostname > -rwxr-xr-x 1 root root 53576 Sep 12 13:32 ntfs-3g > -rwxr-xr-x 1 root root 11016 Sep 12 13:32 ntfs-3g.probe > -?????????? ? ? ? ? ? ntfs-3g.secaudit > -?????????? ? ? ? ? ? ntfs-3g.usermap > -rwxr-xr-x 1 root root 29896 Sep 12 13:32 ntfscat > -rwxr-xr-x 1 root root 32992 Sep 12 13:32 ntfsck > -?????????? ? ? ? ? ? ntfscluster > -rwxr-xr-x 1 root root 36320 Sep 12 13:32 ntfscmp > -?????????? ? ? ? ? ? ntfsdecrypt > -?????????? ? ? ? ? ? ntfsdump_logfile > -?????????? ? ? ? ? ? ntfsfix > -rwxr-xr-x 1 root root 57240 Sep 12 13:32 ntfsinfo > -?????????? ? ? ? ? ? ntfsls > -rwxr-xr-x 1 root root 30448 Sep 12 13:32 ntfsmftalloc > l?????????? ? ? ? ? ? ntfsmount > -rwxr-xr-x 1 root root 34000 Sep 12 13:32 ntfsmove > -?????????? ? ? ? ? ? ntfstruncate > -rwxr-xr-x 1 root root 42240 Sep 12 13:32 ntfswipe > -rwsr-xr-x 1 root root 41432 Nov 11 2010 ping > -rwsr-xr-x 1 root root 36256 Nov 11 2010 ping6 > -rwxr-xr-x 1 root root 35640 Oct 31 2010 plymouth > -rwxr-xr-x 1 root root 86776 Nov 11 2010 ps > -rwxr-xr-x 1 root root 31656 May 30 10:55 pwd > -rwxr-xr-x 1 root root 11528 Jun 25 02:46 raw > -rwxr-xr-x 1 root root 40056 May 30 10:55 readlink > -rwxr-xr-x 2 root root 53352 Nov 11 2010 red > -rwxr-xr-x. 1 root root 576 Apr 16 2008 redhat_lsb_init > -rwxr-xr-x 1 root root 57504 May 30 10:55 rm > -rwxr-xr-x 1 root root 40544 May 30 10:55 rmdir > lrwxrwxrwx. 1 root root 4 Oct 13 10:39 rnano -> nano > -rwxr-xr-x 1 root root 29904 Nov 11 2010 rpm > lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rvi -> vi > lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rview -> vi > -rwxr-xr-x 1 root root 72248 Aug 22 2010 sed > -rwxr-xr-x 1 root root 42312 Nov 11 2010 setfont > -rwxr-xr-x 1 root root 23600 Aug 22 2010 setserial > lrwxrwxrwx. 1 root root 4 Oct 13 10:36 sh -> bash > -rwxr-xr-x 1 root root 27880 May 30 10:55 sleep > -rwxr-xr-x 1 root root 99000 May 30 10:55 sort > -rwxr-xr-x 1 root root 65864 May 30 10:55 stty > -rwsr-xr-x 1 root root 36440 May 30 10:55 su > -rwxr-xr-x 1 root root 25464 May 30 10:55 sync > -rwxr-xr-x 1 root root 384920 Nov 11 2010 tar > -rwxr-xr-x 1 root root 14808 Jun 25 02:46 taskset > -rwxr-xr-x 1 root root 391288 Jun 25 02:05 tcsh > -rwxr-xr-x 1 root root 51952 May 30 10:55 touch > -rwxr-xr-x. 1 root root 11392 Nov 11 2010 tracepath > -rwxr-xr-x. 1 root root 12304 Nov 11 2010 tracepath6 > -rwxr-xr-x 1 root root 57384 Nov 11 2010 traceroute > lrwxrwxrwx. 1 root root 10 Oct 13 10:39 traceroute6 -> traceroute > -rwxr-xr-x 1 root root 24592 May 30 10:55 true > -rwsr-xr-x. 1 root root 49280 Jun 25 02:46 umount > -rwxr-xr-x 1 root root 27808 May 30 10:55 uname > -rwxr-xr-x. 1 root root 2555 Nov 11 2010 unicode_start > -rwxr-xr-x. 1 root root 363 Nov 11 2010 unicode_stop > -rwxr-xr-x 1 root root 26264 May 30 10:55 unlink > -rwxr-xr-x 1 root root 10208 Jun 25 00:09 usleep > -rwxr-xr-x 1 root root 771800 Jun 25 04:43 vi > lrwxrwxrwx. 1 root root 2 Oct 13 10:59 view -> vi > lrwxrwxrwx. 1 root root 8 Oct 13 10:36 ypdomainname -> hostname > -rwxr-xr-x. 1 root root 62 Nov 11 2010 zcat > > Here is the rough partition information for my main drive: > > /boot primary ext3 1gb /dev/sda1 > /dev/sda2 extended lvm pv 925gb > vg_gibson lvm-volumegroup 925gb > / lv_root ext3 36gb > swap lv_swap 2gb > > Server Specs: > > Dell Poweredge R710 > 32GB ECC Unbuffered Ram > 2x Intel Xeon Quad Core HT 2.3Ghz (16 "cores" total) > 2x 1TB WD SCSI Drives in Raid-1 > > Drive Nitty Gritty: > Product ID: WDC WD1002FBYS-0 > Revision: 0C06 > Size: 953344MB > > Heres some more information about the raid controller also attained from the > raid controller config utility: > > Product Name: PERC 6/i > Package: 6.2.0-0013 > FW Version: 1.22.02-0612 > BIOS Version: 2.04.00 > CtrlR Version: 1.02-015B > Boot Block: 1.00.00.01-0011 > > Application & OS Specs: > CentOS 6 w/2.6.32-131 M.A.Young centos6 xen dom0 kernel > > Diagnostic Attempts and Results: > > I''ve done a consistency check on the raid array and everything comes back as > clean and optimal. I''ve ran bad block checks, partition table corruption, > mbr corruption, everything i can think of. It all comes back as clean and > working fine. Because of these results i have not been able to force my > dedicated hosting company to replace any of the hardware. They are upgrading > the raid controller software as its about 1 minor version out of date just > to see if that could be the issue, i''ll report back if that mysteriously > fixes it but i''m not holding my breath. > > I''ve read somewhere that the 2.6.x kernels have an old version of the > megaraid_sas module that will cause problems but the version included in the > M.A.Young centos6 kernel is version 5.3 which is far beyond the 4.3 version > that article recommends upgrading to so i''m really at a loss. Besides the > version being so new the problem described in that article (the kernel not > finding the drive at all on boot) is not the issue i''m having. It just > freaks out randomly (i''m sure its not really randomly, just appears that > way) and the OS swaps to read-only mode and the only way to reboot is > basically to push the button on the front of the box. > > Please, if anyone can direct me towards a solution or at least down a path i > have yet to try i would greatly appreciate it. I''m at my wits end, i''ve been > fighting this mysterious monster for over a month now and it always seems to > strike right before i''m about to go live with my services (first time it > happened was right after i started adding customers to the box). > > Thanks in advance, > David >-- ------------------------------------------------------------------ Steven C. Timm, Ph.D (630) 840-8525 timm@fnal.gov http://home.fnal.gov/~timm/ Fermilab Computing Division, Scientific Computing Facilities, Grid Facilities Department, FermiGrid Services Group, Group Leader. Lead of FermiCloud project. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
David Della Vecchia
2011-Oct-18 17:29 UTC
Re: [Xen-users] Severe megasas_raid issues when using Xen dom0 linux kernels
Thank you for that suggestion, i will look into trying to run the megaraid monitor on the domU if I''m ever able to get the box to boot up successfully. It''s occurred to me that since i use iscsi for domU storage theres really no reason i need raid on the main box so i may do away with it all together. Thanks, David On Tue, Oct 18, 2011 at 1:18 PM, Steven Timm <timm@fnal.gov> wrote:> Have you tried to use the MegaRAID monitor to see if you can > diagnose some hardware problem with the RAID? There is one > you can download and run on the linux dom0, there should be a monitor > you can get to from the BIOS as well.. those error messages look very > much like an actual hardware fault on the RAID array. > > I have a lot of megasas raid both under SL5 and SL6 and have used them > as xen dom0 and kvm vm hosts without problems, several different versions > of xen. > > Steve Timm > > > > > On Tue, 18 Oct 2011, David Della Vecchia wrote: > > I''ve tried debian stable and testing, centos5 and 6 with xen 3.1-4.1 >> (about >> 5 different versions in between). I''m currently running xen 4.1.1 release >> on >> centos6 with M.A.Young''s centos6 xen dom0 kernel. For some reason the raid >> array freaks out and swaps to read-only mode for the entire virtual device >> the hardware raid array provides. I''ve tried both raid 0 and raid1 (2 1tb >> SCSI drives). I''ve had this issue in every xen install I''ve tried on this >> box, no matter what kernel version (tried as new as 3.0.1 in debian >> wheezy) >> or xen version (compiled and installed the unstable branch to test) i use. >> The server was running stable and fine for about a week this time before >> this: >> >> >> [root@gibson ~]# df -h >> -bash: /bin/df: Input/output error >> [root@gibson ~]# w >> -bash: /usr/bin/w: Input/output error >> [root@gibson ~]# modinfo megasas_raid >> -bash: /sbin/modinfo: Input/output error >> >> part of the /var/log/messages: >> >> Oct 17 13:21:09 gibson kernel: megasas: [ 0]waiting for 1 commands to >> complete >> Oct 17 13:21:10 gibson kernel: megaraid_sas: no pending cmds after reset >> Oct 17 13:21:10 gibson kernel: megasas: reset successful >> Oct 17 13:21:20 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >> cmd=0 >> retries=0 >> Oct 17 13:21:20 gibson kernel: megasas: [ 0]waiting for 1 commands to >> complete >> Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset >> Oct 17 13:21:21 gibson kernel: megasas: reset successful >> Oct 17 13:21:21 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >> cmd=2a retries=0 >> Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset >> Oct 17 13:21:21 gibson kernel: megasas: reset successful >> Oct 17 13:21:41 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >> cmd=0 >> retries=0 >> Oct 17 13:21:41 gibson kernel: megasas: [ 0]waiting for 1 commands to >> complete >> Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset >> Oct 17 13:21:42 gibson kernel: megasas: reset successful >> Oct 17 13:21:42 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >> cmd=2a retries=0 >> Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset >> Oct 17 13:21:42 gibson kernel: megasas: reset successful >> Oct 17 13:22:02 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >> cmd=0 >> retries=0 >> Oct 17 13:22:02 gibson kernel: megasas: [ 0]waiting for 1 commands to >> complete >> >> >> [root@gibson ~]# ls -al /bin/ >> ls: cannot access /bin/ntfs-3g.secaudit: Input/output error >> ls: cannot access /bin/ntfstruncate: Input/output error >> ls: cannot access /bin/ntfsdump_logfile: Input/output error >> ls: cannot access /bin/ntfsls: Input/output error >> ls: cannot access /bin/ntfsdecrypt: Input/output error >> ls: cannot access /bin/ntfs-3g.usermap: Input/output error >> ls: cannot access /bin/ntfsmount: Input/output error >> ls: cannot access /bin/ntfsfix: Input/output error >> ls: cannot access /bin/ntfscluster: Input/output error >> total 8192 >> dr-xr-xr-x. 2 root root 4096 Oct 15 14:49 . >> drwxr-xr-x. 29 root root 4096 Oct 17 12:34 .. >> -rwxr-xr-x. 1 root root 123 Nov 10 2010 alsaunmute >> -rwxr-xr-x 1 root root 27808 May 30 10:55 arch >> lrwxrwxrwx. 1 root root 4 Oct 13 10:36 awk -> gawk >> -rwxr-xr-x 1 root root 26264 May 30 10:55 basename >> -rwxr-xr-x 1 root root 943248 May 30 11:46 bash >> -rwxr-xr-x 1 root root 51344 May 30 10:55 cat >> -rwxr-xr-x 1 root root 12200 Jun 25 05:02 cgclassify >> -rwxr-xr-x 1 root root 12352 Jun 25 05:02 cgcreate >> -rwxr-xr-x 1 root root 11528 Jun 25 05:02 cgdelete >> -rwsr-xr-x 1 root root 12136 Jun 25 05:02 cgexec >> -rwxr-xr-x 1 root root 15760 Jun 25 05:02 cgget >> -rwxr-xr-x 1 root root 13160 Jun 25 05:02 cgset >> -rwxr-xr-x 1 root root 55472 May 30 10:55 chgrp >> -rwxr-xr-x 1 root root 52472 May 30 10:55 chmod >> -rwxr-xr-x 1 root root 57496 May 30 10:55 chown >> -rwxr-xr-x 1 root root 122344 May 30 10:55 cp >> -rwxr-xr-x 1 root root 136096 Nov 10 2010 cpio >> lrwxrwxrwx. 1 root root 4 Oct 13 11:00 csh -> tcsh >> -rwxr-xr-x 1 root root 45472 May 30 10:55 cut >> -rwxr-xr-x 1 root root 109896 Aug 18 2010 dash >> -rwxr-xr-x 1 root root 59552 May 30 10:55 date >> -rwxr-xr-x 1 root root 12552 Jun 25 06:47 dbus-cleanup-sockets >> -rwxr-xr-x. 1 root root 339048 Jun 25 06:47 dbus-daemon >> -rwxr-xr-x 1 root root 18464 Jun 25 06:47 dbus-monitor >> -rwxr-xr-x 1 root root 22376 Jun 25 06:47 dbus-send >> -rwxr-xr-x 1 root root 10912 Jun 25 06:47 dbus-uuidgen >> -rwxr-xr-x 1 root root 54040 May 30 10:55 dd >> -rwxr-xr-x 1 root root 70256 May 30 10:55 df >> -rwxr-xr-x 1 root root 9896 Jun 25 02:46 dmesg >> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 dnsdomainname -> hostname >> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 domainname -> hostname >> -rwxr-xr-x 1 root root 81120 Nov 11 2010 dumpkeys >> -rwxr-xr-x 1 root root 27648 May 30 10:55 echo >> -rwxr-xr-x 2 root root 53352 Nov 11 2010 ed >> -rwxr-xr-x 1 root root 106528 Aug 25 2010 egrep >> -rwxr-xr-x 1 root root 26368 May 30 10:55 env >> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 ex -> vi >> -rwxr-xr-x 1 root root 24592 May 30 10:55 false >> -rwxr-xr-x 1 root root 71328 Aug 25 2010 fgrep >> -rwxr-xr-x 1 root root 238640 Nov 11 2010 find >> -rwxr-xr-x 1 root root 382456 Nov 11 2010 gawk >> -rwxr-xr-x 1 root root 33416 Nov 11 2010 gettext >> -rwxr-xr-x 1 root root 110160 Aug 25 2010 grep >> lrwxrwxrwx. 1 root root 3 Oct 13 10:36 gtar -> tar >> -rwxr-xr-x. 1 root root 61 Nov 11 2010 gunzip >> -rwxr-xr-x 1 root root 68544 Nov 11 2010 gzip >> -rwxr-xr-x 1 root root 16192 Aug 24 2010 hostname >> -rwxr-xr-x 1 root root 14872 Jun 25 00:09 ipcalc >> lrwxrwxrwx. 1 root root 20 Oct 13 10:36 iptables-xml -> >> /sbin/iptables-multi >> -rwxr-xr-x 1 root root 11248 Nov 11 2010 kbd_mode >> -rwxr-xr-x 1 root root 24648 Aug 22 2010 keyctl >> -rwxr-xr-x 1 root root 15128 Jun 25 02:46 kill >> -rwxr-xr-x 1 root root 26256 May 30 10:55 link >> -rwxr-xr-x 1 root root 49568 May 30 10:55 ln >> -rwxr-xr-x 1 root root 112136 Nov 11 2010 loadkeys >> -rwxr-xr-x 1 root root 30992 Jun 25 02:46 login >> -rwxr-xr-x 1 root root 58368 Sep 12 13:32 lowntfs-3g >> -rwxr-xr-x 1 root root 111744 May 30 10:55 ls >> -rwxr-xr-x 1 root root 14008 Jun 25 05:02 lscgroup >> -rwxr-xr-x 1 root root 12488 Jun 25 05:02 lssubsys >> lrwxrwxrwx. 1 root root 5 Oct 13 10:37 mail -> mailx >> -rwxr-xr-x 1 root root 390360 Aug 22 2010 mailx >> -rwxr-xr-x 1 root root 48544 May 30 10:55 mkdir >> -rwxr-xr-x 1 root root 32352 May 30 10:55 mknod >> -rwxr-xr-x 1 root root 37352 May 30 10:55 mktemp >> -rwxr-xr-x 1 root root 41144 Jun 25 02:46 more >> -rwsr-xr-x. 1 root root 74712 Jun 25 02:46 mount >> -rwxr-xr-x 1 root root 9800 Aug 24 2010 mountpoint >> -rwxr-xr-x 1 root root 111536 May 30 10:55 mv >> -rwxr-xr-x 1 root root 177360 Nov 12 2010 nano >> -rwxr-xr-x 1 root root 127816 Aug 24 2010 netstat >> -rwxr-xr-x 1 root root 28816 May 30 10:55 nice >> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 nisdomainname -> hostname >> -rwxr-xr-x 1 root root 53576 Sep 12 13:32 ntfs-3g >> -rwxr-xr-x 1 root root 11016 Sep 12 13:32 ntfs-3g.probe >> -?????????? ? ? ? ? ? ntfs-3g.secaudit >> -?????????? ? ? ? ? ? ntfs-3g.usermap >> -rwxr-xr-x 1 root root 29896 Sep 12 13:32 ntfscat >> -rwxr-xr-x 1 root root 32992 Sep 12 13:32 ntfsck >> -?????????? ? ? ? ? ? ntfscluster >> -rwxr-xr-x 1 root root 36320 Sep 12 13:32 ntfscmp >> -?????????? ? ? ? ? ? ntfsdecrypt >> -?????????? ? ? ? ? ? ntfsdump_logfile >> -?????????? ? ? ? ? ? ntfsfix >> -rwxr-xr-x 1 root root 57240 Sep 12 13:32 ntfsinfo >> -?????????? ? ? ? ? ? ntfsls >> -rwxr-xr-x 1 root root 30448 Sep 12 13:32 ntfsmftalloc >> l?????????? ? ? ? ? ? ntfsmount >> -rwxr-xr-x 1 root root 34000 Sep 12 13:32 ntfsmove >> -?????????? ? ? ? ? ? ntfstruncate >> -rwxr-xr-x 1 root root 42240 Sep 12 13:32 ntfswipe >> -rwsr-xr-x 1 root root 41432 Nov 11 2010 ping >> -rwsr-xr-x 1 root root 36256 Nov 11 2010 ping6 >> -rwxr-xr-x 1 root root 35640 Oct 31 2010 plymouth >> -rwxr-xr-x 1 root root 86776 Nov 11 2010 ps >> -rwxr-xr-x 1 root root 31656 May 30 10:55 pwd >> -rwxr-xr-x 1 root root 11528 Jun 25 02:46 raw >> -rwxr-xr-x 1 root root 40056 May 30 10:55 readlink >> -rwxr-xr-x 2 root root 53352 Nov 11 2010 red >> -rwxr-xr-x. 1 root root 576 Apr 16 2008 redhat_lsb_init >> -rwxr-xr-x 1 root root 57504 May 30 10:55 rm >> -rwxr-xr-x 1 root root 40544 May 30 10:55 rmdir >> lrwxrwxrwx. 1 root root 4 Oct 13 10:39 rnano -> nano >> -rwxr-xr-x 1 root root 29904 Nov 11 2010 rpm >> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rvi -> vi >> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rview -> vi >> -rwxr-xr-x 1 root root 72248 Aug 22 2010 sed >> -rwxr-xr-x 1 root root 42312 Nov 11 2010 setfont >> -rwxr-xr-x 1 root root 23600 Aug 22 2010 setserial >> lrwxrwxrwx. 1 root root 4 Oct 13 10:36 sh -> bash >> -rwxr-xr-x 1 root root 27880 May 30 10:55 sleep >> -rwxr-xr-x 1 root root 99000 May 30 10:55 sort >> -rwxr-xr-x 1 root root 65864 May 30 10:55 stty >> -rwsr-xr-x 1 root root 36440 May 30 10:55 su >> -rwxr-xr-x 1 root root 25464 May 30 10:55 sync >> -rwxr-xr-x 1 root root 384920 Nov 11 2010 tar >> -rwxr-xr-x 1 root root 14808 Jun 25 02:46 taskset >> -rwxr-xr-x 1 root root 391288 Jun 25 02:05 tcsh >> -rwxr-xr-x 1 root root 51952 May 30 10:55 touch >> -rwxr-xr-x. 1 root root 11392 Nov 11 2010 tracepath >> -rwxr-xr-x. 1 root root 12304 Nov 11 2010 tracepath6 >> -rwxr-xr-x 1 root root 57384 Nov 11 2010 traceroute >> lrwxrwxrwx. 1 root root 10 Oct 13 10:39 traceroute6 -> traceroute >> -rwxr-xr-x 1 root root 24592 May 30 10:55 true >> -rwsr-xr-x. 1 root root 49280 Jun 25 02:46 umount >> -rwxr-xr-x 1 root root 27808 May 30 10:55 uname >> -rwxr-xr-x. 1 root root 2555 Nov 11 2010 unicode_start >> -rwxr-xr-x. 1 root root 363 Nov 11 2010 unicode_stop >> -rwxr-xr-x 1 root root 26264 May 30 10:55 unlink >> -rwxr-xr-x 1 root root 10208 Jun 25 00:09 usleep >> -rwxr-xr-x 1 root root 771800 Jun 25 04:43 vi >> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 view -> vi >> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 ypdomainname -> hostname >> -rwxr-xr-x. 1 root root 62 Nov 11 2010 zcat >> >> Here is the rough partition information for my main drive: >> >> /boot primary ext3 1gb /dev/sda1 >> /dev/sda2 extended lvm pv 925gb >> vg_gibson lvm-volumegroup 925gb >> / lv_root ext3 36gb >> swap lv_swap 2gb >> >> Server Specs: >> >> Dell Poweredge R710 >> 32GB ECC Unbuffered Ram >> 2x Intel Xeon Quad Core HT 2.3Ghz (16 "cores" total) >> 2x 1TB WD SCSI Drives in Raid-1 >> >> Drive Nitty Gritty: >> Product ID: WDC WD1002FBYS-0 >> Revision: 0C06 >> Size: 953344MB >> >> Heres some more information about the raid controller also attained from >> the >> raid controller config utility: >> >> Product Name: PERC 6/i >> Package: 6.2.0-0013 >> FW Version: 1.22.02-0612 >> BIOS Version: 2.04.00 >> CtrlR Version: 1.02-015B >> Boot Block: 1.00.00.01-0011 >> >> Application & OS Specs: >> CentOS 6 w/2.6.32-131 M.A.Young centos6 xen dom0 kernel >> >> Diagnostic Attempts and Results: >> >> I''ve done a consistency check on the raid array and everything comes back >> as >> clean and optimal. I''ve ran bad block checks, partition table corruption, >> mbr corruption, everything i can think of. It all comes back as clean and >> working fine. Because of these results i have not been able to force my >> dedicated hosting company to replace any of the hardware. They are >> upgrading >> the raid controller software as its about 1 minor version out of date just >> to see if that could be the issue, i''ll report back if that mysteriously >> fixes it but i''m not holding my breath. >> >> I''ve read somewhere that the 2.6.x kernels have an old version of the >> megaraid_sas module that will cause problems but the version included in >> the >> M.A.Young centos6 kernel is version 5.3 which is far beyond the 4.3 >> version >> that article recommends upgrading to so i''m really at a loss. Besides the >> version being so new the problem described in that article (the kernel not >> finding the drive at all on boot) is not the issue i''m having. It just >> freaks out randomly (i''m sure its not really randomly, just appears that >> way) and the OS swaps to read-only mode and the only way to reboot is >> basically to push the button on the front of the box. >> >> Please, if anyone can direct me towards a solution or at least down a path >> i >> have yet to try i would greatly appreciate it. I''m at my wits end, i''ve >> been >> fighting this mysterious monster for over a month now and it always seems >> to >> strike right before i''m about to go live with my services (first time it >> happened was right after i started adding customers to the box). >> >> Thanks in advance, >> David >> >> > -- > ------------------------------**------------------------------**------ > Steven C. Timm, Ph.D (630) 840-8525 > timm@fnal.gov http://home.fnal.gov/~timm/ > Fermilab Computing Division, Scientific Computing Facilities, > Grid Facilities Department, FermiGrid Services Group, Group Leader. > Lead of FermiCloud project. >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Steven Timm
2011-Oct-18 17:36 UTC
Re: [Xen-users] Severe megasas_raid issues when using Xen dom0 linux kernels
On Tue, 18 Oct 2011, David Della Vecchia wrote:> Thank you for that suggestion, i will look into trying to run the megaraid > monitor on the domU if I''m ever able to get the box to boot up successfully.Don''t run it on the domU, it will crash everything, that domU plus all other domU''s. run on dom0. and as I said, most megaraid controllers have a menu you can get into during the boot sequence before the operating system boots, to check the health of the array. Steve> > It''s occurred to me that since i use iscsi for domU storage theres really no > reason i need raid on the main box so i may do away with it all together. > > Thanks, > David > > On Tue, Oct 18, 2011 at 1:18 PM, Steven Timm <timm@fnal.gov> wrote: > >> Have you tried to use the MegaRAID monitor to see if you can >> diagnose some hardware problem with the RAID? There is one >> you can download and run on the linux dom0, there should be a monitor >> you can get to from the BIOS as well.. those error messages look very >> much like an actual hardware fault on the RAID array. >> >> I have a lot of megasas raid both under SL5 and SL6 and have used them >> as xen dom0 and kvm vm hosts without problems, several different versions >> of xen. >> >> Steve Timm >> >> >> >> >> On Tue, 18 Oct 2011, David Della Vecchia wrote: >> >> I''ve tried debian stable and testing, centos5 and 6 with xen 3.1-4.1 >>> (about >>> 5 different versions in between). I''m currently running xen 4.1.1 release >>> on >>> centos6 with M.A.Young''s centos6 xen dom0 kernel. For some reason the raid >>> array freaks out and swaps to read-only mode for the entire virtual device >>> the hardware raid array provides. I''ve tried both raid 0 and raid1 (2 1tb >>> SCSI drives). I''ve had this issue in every xen install I''ve tried on this >>> box, no matter what kernel version (tried as new as 3.0.1 in debian >>> wheezy) >>> or xen version (compiled and installed the unstable branch to test) i use. >>> The server was running stable and fine for about a week this time before >>> this: >>> >>> >>> [root@gibson ~]# df -h >>> -bash: /bin/df: Input/output error >>> [root@gibson ~]# w >>> -bash: /usr/bin/w: Input/output error >>> [root@gibson ~]# modinfo megasas_raid >>> -bash: /sbin/modinfo: Input/output error >>> >>> part of the /var/log/messages: >>> >>> Oct 17 13:21:09 gibson kernel: megasas: [ 0]waiting for 1 commands to >>> complete >>> Oct 17 13:21:10 gibson kernel: megaraid_sas: no pending cmds after reset >>> Oct 17 13:21:10 gibson kernel: megasas: reset successful >>> Oct 17 13:21:20 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >>> cmd=0 >>> retries=0 >>> Oct 17 13:21:20 gibson kernel: megasas: [ 0]waiting for 1 commands to >>> complete >>> Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset >>> Oct 17 13:21:21 gibson kernel: megasas: reset successful >>> Oct 17 13:21:21 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >>> cmd=2a retries=0 >>> Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset >>> Oct 17 13:21:21 gibson kernel: megasas: reset successful >>> Oct 17 13:21:41 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >>> cmd=0 >>> retries=0 >>> Oct 17 13:21:41 gibson kernel: megasas: [ 0]waiting for 1 commands to >>> complete >>> Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset >>> Oct 17 13:21:42 gibson kernel: megasas: reset successful >>> Oct 17 13:21:42 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >>> cmd=2a retries=0 >>> Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset >>> Oct 17 13:21:42 gibson kernel: megasas: reset successful >>> Oct 17 13:22:02 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >>> cmd=0 >>> retries=0 >>> Oct 17 13:22:02 gibson kernel: megasas: [ 0]waiting for 1 commands to >>> complete >>> >>> >>> [root@gibson ~]# ls -al /bin/ >>> ls: cannot access /bin/ntfs-3g.secaudit: Input/output error >>> ls: cannot access /bin/ntfstruncate: Input/output error >>> ls: cannot access /bin/ntfsdump_logfile: Input/output error >>> ls: cannot access /bin/ntfsls: Input/output error >>> ls: cannot access /bin/ntfsdecrypt: Input/output error >>> ls: cannot access /bin/ntfs-3g.usermap: Input/output error >>> ls: cannot access /bin/ntfsmount: Input/output error >>> ls: cannot access /bin/ntfsfix: Input/output error >>> ls: cannot access /bin/ntfscluster: Input/output error >>> total 8192 >>> dr-xr-xr-x. 2 root root 4096 Oct 15 14:49 . >>> drwxr-xr-x. 29 root root 4096 Oct 17 12:34 .. >>> -rwxr-xr-x. 1 root root 123 Nov 10 2010 alsaunmute >>> -rwxr-xr-x 1 root root 27808 May 30 10:55 arch >>> lrwxrwxrwx. 1 root root 4 Oct 13 10:36 awk -> gawk >>> -rwxr-xr-x 1 root root 26264 May 30 10:55 basename >>> -rwxr-xr-x 1 root root 943248 May 30 11:46 bash >>> -rwxr-xr-x 1 root root 51344 May 30 10:55 cat >>> -rwxr-xr-x 1 root root 12200 Jun 25 05:02 cgclassify >>> -rwxr-xr-x 1 root root 12352 Jun 25 05:02 cgcreate >>> -rwxr-xr-x 1 root root 11528 Jun 25 05:02 cgdelete >>> -rwsr-xr-x 1 root root 12136 Jun 25 05:02 cgexec >>> -rwxr-xr-x 1 root root 15760 Jun 25 05:02 cgget >>> -rwxr-xr-x 1 root root 13160 Jun 25 05:02 cgset >>> -rwxr-xr-x 1 root root 55472 May 30 10:55 chgrp >>> -rwxr-xr-x 1 root root 52472 May 30 10:55 chmod >>> -rwxr-xr-x 1 root root 57496 May 30 10:55 chown >>> -rwxr-xr-x 1 root root 122344 May 30 10:55 cp >>> -rwxr-xr-x 1 root root 136096 Nov 10 2010 cpio >>> lrwxrwxrwx. 1 root root 4 Oct 13 11:00 csh -> tcsh >>> -rwxr-xr-x 1 root root 45472 May 30 10:55 cut >>> -rwxr-xr-x 1 root root 109896 Aug 18 2010 dash >>> -rwxr-xr-x 1 root root 59552 May 30 10:55 date >>> -rwxr-xr-x 1 root root 12552 Jun 25 06:47 dbus-cleanup-sockets >>> -rwxr-xr-x. 1 root root 339048 Jun 25 06:47 dbus-daemon >>> -rwxr-xr-x 1 root root 18464 Jun 25 06:47 dbus-monitor >>> -rwxr-xr-x 1 root root 22376 Jun 25 06:47 dbus-send >>> -rwxr-xr-x 1 root root 10912 Jun 25 06:47 dbus-uuidgen >>> -rwxr-xr-x 1 root root 54040 May 30 10:55 dd >>> -rwxr-xr-x 1 root root 70256 May 30 10:55 df >>> -rwxr-xr-x 1 root root 9896 Jun 25 02:46 dmesg >>> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 dnsdomainname -> hostname >>> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 domainname -> hostname >>> -rwxr-xr-x 1 root root 81120 Nov 11 2010 dumpkeys >>> -rwxr-xr-x 1 root root 27648 May 30 10:55 echo >>> -rwxr-xr-x 2 root root 53352 Nov 11 2010 ed >>> -rwxr-xr-x 1 root root 106528 Aug 25 2010 egrep >>> -rwxr-xr-x 1 root root 26368 May 30 10:55 env >>> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 ex -> vi >>> -rwxr-xr-x 1 root root 24592 May 30 10:55 false >>> -rwxr-xr-x 1 root root 71328 Aug 25 2010 fgrep >>> -rwxr-xr-x 1 root root 238640 Nov 11 2010 find >>> -rwxr-xr-x 1 root root 382456 Nov 11 2010 gawk >>> -rwxr-xr-x 1 root root 33416 Nov 11 2010 gettext >>> -rwxr-xr-x 1 root root 110160 Aug 25 2010 grep >>> lrwxrwxrwx. 1 root root 3 Oct 13 10:36 gtar -> tar >>> -rwxr-xr-x. 1 root root 61 Nov 11 2010 gunzip >>> -rwxr-xr-x 1 root root 68544 Nov 11 2010 gzip >>> -rwxr-xr-x 1 root root 16192 Aug 24 2010 hostname >>> -rwxr-xr-x 1 root root 14872 Jun 25 00:09 ipcalc >>> lrwxrwxrwx. 1 root root 20 Oct 13 10:36 iptables-xml -> >>> /sbin/iptables-multi >>> -rwxr-xr-x 1 root root 11248 Nov 11 2010 kbd_mode >>> -rwxr-xr-x 1 root root 24648 Aug 22 2010 keyctl >>> -rwxr-xr-x 1 root root 15128 Jun 25 02:46 kill >>> -rwxr-xr-x 1 root root 26256 May 30 10:55 link >>> -rwxr-xr-x 1 root root 49568 May 30 10:55 ln >>> -rwxr-xr-x 1 root root 112136 Nov 11 2010 loadkeys >>> -rwxr-xr-x 1 root root 30992 Jun 25 02:46 login >>> -rwxr-xr-x 1 root root 58368 Sep 12 13:32 lowntfs-3g >>> -rwxr-xr-x 1 root root 111744 May 30 10:55 ls >>> -rwxr-xr-x 1 root root 14008 Jun 25 05:02 lscgroup >>> -rwxr-xr-x 1 root root 12488 Jun 25 05:02 lssubsys >>> lrwxrwxrwx. 1 root root 5 Oct 13 10:37 mail -> mailx >>> -rwxr-xr-x 1 root root 390360 Aug 22 2010 mailx >>> -rwxr-xr-x 1 root root 48544 May 30 10:55 mkdir >>> -rwxr-xr-x 1 root root 32352 May 30 10:55 mknod >>> -rwxr-xr-x 1 root root 37352 May 30 10:55 mktemp >>> -rwxr-xr-x 1 root root 41144 Jun 25 02:46 more >>> -rwsr-xr-x. 1 root root 74712 Jun 25 02:46 mount >>> -rwxr-xr-x 1 root root 9800 Aug 24 2010 mountpoint >>> -rwxr-xr-x 1 root root 111536 May 30 10:55 mv >>> -rwxr-xr-x 1 root root 177360 Nov 12 2010 nano >>> -rwxr-xr-x 1 root root 127816 Aug 24 2010 netstat >>> -rwxr-xr-x 1 root root 28816 May 30 10:55 nice >>> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 nisdomainname -> hostname >>> -rwxr-xr-x 1 root root 53576 Sep 12 13:32 ntfs-3g >>> -rwxr-xr-x 1 root root 11016 Sep 12 13:32 ntfs-3g.probe >>> -?????????? ? ? ? ? ? ntfs-3g.secaudit >>> -?????????? ? ? ? ? ? ntfs-3g.usermap >>> -rwxr-xr-x 1 root root 29896 Sep 12 13:32 ntfscat >>> -rwxr-xr-x 1 root root 32992 Sep 12 13:32 ntfsck >>> -?????????? ? ? ? ? ? ntfscluster >>> -rwxr-xr-x 1 root root 36320 Sep 12 13:32 ntfscmp >>> -?????????? ? ? ? ? ? ntfsdecrypt >>> -?????????? ? ? ? ? ? ntfsdump_logfile >>> -?????????? ? ? ? ? ? ntfsfix >>> -rwxr-xr-x 1 root root 57240 Sep 12 13:32 ntfsinfo >>> -?????????? ? ? ? ? ? ntfsls >>> -rwxr-xr-x 1 root root 30448 Sep 12 13:32 ntfsmftalloc >>> l?????????? ? ? ? ? ? ntfsmount >>> -rwxr-xr-x 1 root root 34000 Sep 12 13:32 ntfsmove >>> -?????????? ? ? ? ? ? ntfstruncate >>> -rwxr-xr-x 1 root root 42240 Sep 12 13:32 ntfswipe >>> -rwsr-xr-x 1 root root 41432 Nov 11 2010 ping >>> -rwsr-xr-x 1 root root 36256 Nov 11 2010 ping6 >>> -rwxr-xr-x 1 root root 35640 Oct 31 2010 plymouth >>> -rwxr-xr-x 1 root root 86776 Nov 11 2010 ps >>> -rwxr-xr-x 1 root root 31656 May 30 10:55 pwd >>> -rwxr-xr-x 1 root root 11528 Jun 25 02:46 raw >>> -rwxr-xr-x 1 root root 40056 May 30 10:55 readlink >>> -rwxr-xr-x 2 root root 53352 Nov 11 2010 red >>> -rwxr-xr-x. 1 root root 576 Apr 16 2008 redhat_lsb_init >>> -rwxr-xr-x 1 root root 57504 May 30 10:55 rm >>> -rwxr-xr-x 1 root root 40544 May 30 10:55 rmdir >>> lrwxrwxrwx. 1 root root 4 Oct 13 10:39 rnano -> nano >>> -rwxr-xr-x 1 root root 29904 Nov 11 2010 rpm >>> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rvi -> vi >>> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rview -> vi >>> -rwxr-xr-x 1 root root 72248 Aug 22 2010 sed >>> -rwxr-xr-x 1 root root 42312 Nov 11 2010 setfont >>> -rwxr-xr-x 1 root root 23600 Aug 22 2010 setserial >>> lrwxrwxrwx. 1 root root 4 Oct 13 10:36 sh -> bash >>> -rwxr-xr-x 1 root root 27880 May 30 10:55 sleep >>> -rwxr-xr-x 1 root root 99000 May 30 10:55 sort >>> -rwxr-xr-x 1 root root 65864 May 30 10:55 stty >>> -rwsr-xr-x 1 root root 36440 May 30 10:55 su >>> -rwxr-xr-x 1 root root 25464 May 30 10:55 sync >>> -rwxr-xr-x 1 root root 384920 Nov 11 2010 tar >>> -rwxr-xr-x 1 root root 14808 Jun 25 02:46 taskset >>> -rwxr-xr-x 1 root root 391288 Jun 25 02:05 tcsh >>> -rwxr-xr-x 1 root root 51952 May 30 10:55 touch >>> -rwxr-xr-x. 1 root root 11392 Nov 11 2010 tracepath >>> -rwxr-xr-x. 1 root root 12304 Nov 11 2010 tracepath6 >>> -rwxr-xr-x 1 root root 57384 Nov 11 2010 traceroute >>> lrwxrwxrwx. 1 root root 10 Oct 13 10:39 traceroute6 -> traceroute >>> -rwxr-xr-x 1 root root 24592 May 30 10:55 true >>> -rwsr-xr-x. 1 root root 49280 Jun 25 02:46 umount >>> -rwxr-xr-x 1 root root 27808 May 30 10:55 uname >>> -rwxr-xr-x. 1 root root 2555 Nov 11 2010 unicode_start >>> -rwxr-xr-x. 1 root root 363 Nov 11 2010 unicode_stop >>> -rwxr-xr-x 1 root root 26264 May 30 10:55 unlink >>> -rwxr-xr-x 1 root root 10208 Jun 25 00:09 usleep >>> -rwxr-xr-x 1 root root 771800 Jun 25 04:43 vi >>> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 view -> vi >>> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 ypdomainname -> hostname >>> -rwxr-xr-x. 1 root root 62 Nov 11 2010 zcat >>> >>> Here is the rough partition information for my main drive: >>> >>> /boot primary ext3 1gb /dev/sda1 >>> /dev/sda2 extended lvm pv 925gb >>> vg_gibson lvm-volumegroup 925gb >>> / lv_root ext3 36gb >>> swap lv_swap 2gb >>> >>> Server Specs: >>> >>> Dell Poweredge R710 >>> 32GB ECC Unbuffered Ram >>> 2x Intel Xeon Quad Core HT 2.3Ghz (16 "cores" total) >>> 2x 1TB WD SCSI Drives in Raid-1 >>> >>> Drive Nitty Gritty: >>> Product ID: WDC WD1002FBYS-0 >>> Revision: 0C06 >>> Size: 953344MB >>> >>> Heres some more information about the raid controller also attained from >>> the >>> raid controller config utility: >>> >>> Product Name: PERC 6/i >>> Package: 6.2.0-0013 >>> FW Version: 1.22.02-0612 >>> BIOS Version: 2.04.00 >>> CtrlR Version: 1.02-015B >>> Boot Block: 1.00.00.01-0011 >>> >>> Application & OS Specs: >>> CentOS 6 w/2.6.32-131 M.A.Young centos6 xen dom0 kernel >>> >>> Diagnostic Attempts and Results: >>> >>> I''ve done a consistency check on the raid array and everything comes back >>> as >>> clean and optimal. I''ve ran bad block checks, partition table corruption, >>> mbr corruption, everything i can think of. It all comes back as clean and >>> working fine. Because of these results i have not been able to force my >>> dedicated hosting company to replace any of the hardware. They are >>> upgrading >>> the raid controller software as its about 1 minor version out of date just >>> to see if that could be the issue, i''ll report back if that mysteriously >>> fixes it but i''m not holding my breath. >>> >>> I''ve read somewhere that the 2.6.x kernels have an old version of the >>> megaraid_sas module that will cause problems but the version included in >>> the >>> M.A.Young centos6 kernel is version 5.3 which is far beyond the 4.3 >>> version >>> that article recommends upgrading to so i''m really at a loss. Besides the >>> version being so new the problem described in that article (the kernel not >>> finding the drive at all on boot) is not the issue i''m having. It just >>> freaks out randomly (i''m sure its not really randomly, just appears that >>> way) and the OS swaps to read-only mode and the only way to reboot is >>> basically to push the button on the front of the box. >>> >>> Please, if anyone can direct me towards a solution or at least down a path >>> i >>> have yet to try i would greatly appreciate it. I''m at my wits end, i''ve >>> been >>> fighting this mysterious monster for over a month now and it always seems >>> to >>> strike right before i''m about to go live with my services (first time it >>> happened was right after i started adding customers to the box). >>> >>> Thanks in advance, >>> David >>> >>> >> -- >> ------------------------------**------------------------------**------ >> Steven C. Timm, Ph.D (630) 840-8525 >> timm@fnal.gov http://home.fnal.gov/~timm/ >> Fermilab Computing Division, Scientific Computing Facilities, >> Grid Facilities Department, FermiGrid Services Group, Group Leader. >> Lead of FermiCloud project. >> >-- ------------------------------------------------------------------ Steven C. Timm, Ph.D (630) 840-8525 timm@fnal.gov http://home.fnal.gov/~timm/ Fermilab Computing Division, Scientific Computing Facilities, Grid Facilities Department, FermiGrid Services Group, Group Leader. Lead of FermiCloud project. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
David Della Vecchia
2011-Oct-18 17:54 UTC
Re: [Xen-users] Severe megasas_raid issues when using Xen dom0 linux kernels
Sorry that was a typo, i meant to say dom0. Also i have been inside the bios menu for the raid controller and have ran all the health checks and everything came back as clean and optimal. They just finished the upgrade of the raid controller firmware and the server booted up fine with no issues this time.. how bizarre. I''ll be monitoring it closely to see if the upgrade really did fix the issue. -DDV On Tue, Oct 18, 2011 at 1:36 PM, Steven Timm <timm@fnal.gov> wrote:> On Tue, 18 Oct 2011, David Della Vecchia wrote: > > Thank you for that suggestion, i will look into trying to run the megaraid >> monitor on the domU if I''m ever able to get the box to boot up >> successfully. >> > > Don''t run it on the domU, it will crash everything, that domU plus > all other domU''s. run on dom0. and as I said, most megaraid controllers > have a menu you can get into during the boot sequence before the > operating system boots, to check the health of the array. > > Steve > > > > >> It''s occurred to me that since i use iscsi for domU storage theres really >> no >> reason i need raid on the main box so i may do away with it all together. >> >> Thanks, >> David >> >> On Tue, Oct 18, 2011 at 1:18 PM, Steven Timm <timm@fnal.gov> wrote: >> >> Have you tried to use the MegaRAID monitor to see if you can >>> diagnose some hardware problem with the RAID? There is one >>> you can download and run on the linux dom0, there should be a monitor >>> you can get to from the BIOS as well.. those error messages look very >>> much like an actual hardware fault on the RAID array. >>> >>> I have a lot of megasas raid both under SL5 and SL6 and have used them >>> as xen dom0 and kvm vm hosts without problems, several different versions >>> of xen. >>> >>> Steve Timm >>> >>> >>> >>> >>> On Tue, 18 Oct 2011, David Della Vecchia wrote: >>> >>> I''ve tried debian stable and testing, centos5 and 6 with xen 3.1-4.1 >>> >>>> (about >>>> 5 different versions in between). I''m currently running xen 4.1.1 >>>> release >>>> on >>>> centos6 with M.A.Young''s centos6 xen dom0 kernel. For some reason the >>>> raid >>>> array freaks out and swaps to read-only mode for the entire virtual >>>> device >>>> the hardware raid array provides. I''ve tried both raid 0 and raid1 (2 >>>> 1tb >>>> SCSI drives). I''ve had this issue in every xen install I''ve tried on >>>> this >>>> box, no matter what kernel version (tried as new as 3.0.1 in debian >>>> wheezy) >>>> or xen version (compiled and installed the unstable branch to test) i >>>> use. >>>> The server was running stable and fine for about a week this time before >>>> this: >>>> >>>> >>>> [root@gibson ~]# df -h >>>> -bash: /bin/df: Input/output error >>>> [root@gibson ~]# w >>>> -bash: /usr/bin/w: Input/output error >>>> [root@gibson ~]# modinfo megasas_raid >>>> -bash: /sbin/modinfo: Input/output error >>>> >>>> part of the /var/log/messages: >>>> >>>> Oct 17 13:21:09 gibson kernel: megasas: [ 0]waiting for 1 commands to >>>> complete >>>> Oct 17 13:21:10 gibson kernel: megaraid_sas: no pending cmds after reset >>>> Oct 17 13:21:10 gibson kernel: megasas: reset successful >>>> Oct 17 13:21:20 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >>>> cmd=0 >>>> retries=0 >>>> Oct 17 13:21:20 gibson kernel: megasas: [ 0]waiting for 1 commands to >>>> complete >>>> Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset >>>> Oct 17 13:21:21 gibson kernel: megasas: reset successful >>>> Oct 17 13:21:21 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >>>> cmd=2a retries=0 >>>> Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset >>>> Oct 17 13:21:21 gibson kernel: megasas: reset successful >>>> Oct 17 13:21:41 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >>>> cmd=0 >>>> retries=0 >>>> Oct 17 13:21:41 gibson kernel: megasas: [ 0]waiting for 1 commands to >>>> complete >>>> Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset >>>> Oct 17 13:21:42 gibson kernel: megasas: reset successful >>>> Oct 17 13:21:42 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >>>> cmd=2a retries=0 >>>> Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset >>>> Oct 17 13:21:42 gibson kernel: megasas: reset successful >>>> Oct 17 13:22:02 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 >>>> cmd=0 >>>> retries=0 >>>> Oct 17 13:22:02 gibson kernel: megasas: [ 0]waiting for 1 commands to >>>> complete >>>> >>>> >>>> [root@gibson ~]# ls -al /bin/ >>>> ls: cannot access /bin/ntfs-3g.secaudit: Input/output error >>>> ls: cannot access /bin/ntfstruncate: Input/output error >>>> ls: cannot access /bin/ntfsdump_logfile: Input/output error >>>> ls: cannot access /bin/ntfsls: Input/output error >>>> ls: cannot access /bin/ntfsdecrypt: Input/output error >>>> ls: cannot access /bin/ntfs-3g.usermap: Input/output error >>>> ls: cannot access /bin/ntfsmount: Input/output error >>>> ls: cannot access /bin/ntfsfix: Input/output error >>>> ls: cannot access /bin/ntfscluster: Input/output error >>>> total 8192 >>>> dr-xr-xr-x. 2 root root 4096 Oct 15 14:49 . >>>> drwxr-xr-x. 29 root root 4096 Oct 17 12:34 .. >>>> -rwxr-xr-x. 1 root root 123 Nov 10 2010 alsaunmute >>>> -rwxr-xr-x 1 root root 27808 May 30 10:55 arch >>>> lrwxrwxrwx. 1 root root 4 Oct 13 10:36 awk -> gawk >>>> -rwxr-xr-x 1 root root 26264 May 30 10:55 basename >>>> -rwxr-xr-x 1 root root 943248 May 30 11:46 bash >>>> -rwxr-xr-x 1 root root 51344 May 30 10:55 cat >>>> -rwxr-xr-x 1 root root 12200 Jun 25 05:02 cgclassify >>>> -rwxr-xr-x 1 root root 12352 Jun 25 05:02 cgcreate >>>> -rwxr-xr-x 1 root root 11528 Jun 25 05:02 cgdelete >>>> -rwsr-xr-x 1 root root 12136 Jun 25 05:02 cgexec >>>> -rwxr-xr-x 1 root root 15760 Jun 25 05:02 cgget >>>> -rwxr-xr-x 1 root root 13160 Jun 25 05:02 cgset >>>> -rwxr-xr-x 1 root root 55472 May 30 10:55 chgrp >>>> -rwxr-xr-x 1 root root 52472 May 30 10:55 chmod >>>> -rwxr-xr-x 1 root root 57496 May 30 10:55 chown >>>> -rwxr-xr-x 1 root root 122344 May 30 10:55 cp >>>> -rwxr-xr-x 1 root root 136096 Nov 10 2010 cpio >>>> lrwxrwxrwx. 1 root root 4 Oct 13 11:00 csh -> tcsh >>>> -rwxr-xr-x 1 root root 45472 May 30 10:55 cut >>>> -rwxr-xr-x 1 root root 109896 Aug 18 2010 dash >>>> -rwxr-xr-x 1 root root 59552 May 30 10:55 date >>>> -rwxr-xr-x 1 root root 12552 Jun 25 06:47 dbus-cleanup-sockets >>>> -rwxr-xr-x. 1 root root 339048 Jun 25 06:47 dbus-daemon >>>> -rwxr-xr-x 1 root root 18464 Jun 25 06:47 dbus-monitor >>>> -rwxr-xr-x 1 root root 22376 Jun 25 06:47 dbus-send >>>> -rwxr-xr-x 1 root root 10912 Jun 25 06:47 dbus-uuidgen >>>> -rwxr-xr-x 1 root root 54040 May 30 10:55 dd >>>> -rwxr-xr-x 1 root root 70256 May 30 10:55 df >>>> -rwxr-xr-x 1 root root 9896 Jun 25 02:46 dmesg >>>> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 dnsdomainname -> hostname >>>> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 domainname -> hostname >>>> -rwxr-xr-x 1 root root 81120 Nov 11 2010 dumpkeys >>>> -rwxr-xr-x 1 root root 27648 May 30 10:55 echo >>>> -rwxr-xr-x 2 root root 53352 Nov 11 2010 ed >>>> -rwxr-xr-x 1 root root 106528 Aug 25 2010 egrep >>>> -rwxr-xr-x 1 root root 26368 May 30 10:55 env >>>> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 ex -> vi >>>> -rwxr-xr-x 1 root root 24592 May 30 10:55 false >>>> -rwxr-xr-x 1 root root 71328 Aug 25 2010 fgrep >>>> -rwxr-xr-x 1 root root 238640 Nov 11 2010 find >>>> -rwxr-xr-x 1 root root 382456 Nov 11 2010 gawk >>>> -rwxr-xr-x 1 root root 33416 Nov 11 2010 gettext >>>> -rwxr-xr-x 1 root root 110160 Aug 25 2010 grep >>>> lrwxrwxrwx. 1 root root 3 Oct 13 10:36 gtar -> tar >>>> -rwxr-xr-x. 1 root root 61 Nov 11 2010 gunzip >>>> -rwxr-xr-x 1 root root 68544 Nov 11 2010 gzip >>>> -rwxr-xr-x 1 root root 16192 Aug 24 2010 hostname >>>> -rwxr-xr-x 1 root root 14872 Jun 25 00:09 ipcalc >>>> lrwxrwxrwx. 1 root root 20 Oct 13 10:36 iptables-xml -> >>>> /sbin/iptables-multi >>>> -rwxr-xr-x 1 root root 11248 Nov 11 2010 kbd_mode >>>> -rwxr-xr-x 1 root root 24648 Aug 22 2010 keyctl >>>> -rwxr-xr-x 1 root root 15128 Jun 25 02:46 kill >>>> -rwxr-xr-x 1 root root 26256 May 30 10:55 link >>>> -rwxr-xr-x 1 root root 49568 May 30 10:55 ln >>>> -rwxr-xr-x 1 root root 112136 Nov 11 2010 loadkeys >>>> -rwxr-xr-x 1 root root 30992 Jun 25 02:46 login >>>> -rwxr-xr-x 1 root root 58368 Sep 12 13:32 lowntfs-3g >>>> -rwxr-xr-x 1 root root 111744 May 30 10:55 ls >>>> -rwxr-xr-x 1 root root 14008 Jun 25 05:02 lscgroup >>>> -rwxr-xr-x 1 root root 12488 Jun 25 05:02 lssubsys >>>> lrwxrwxrwx. 1 root root 5 Oct 13 10:37 mail -> mailx >>>> -rwxr-xr-x 1 root root 390360 Aug 22 2010 mailx >>>> -rwxr-xr-x 1 root root 48544 May 30 10:55 mkdir >>>> -rwxr-xr-x 1 root root 32352 May 30 10:55 mknod >>>> -rwxr-xr-x 1 root root 37352 May 30 10:55 mktemp >>>> -rwxr-xr-x 1 root root 41144 Jun 25 02:46 more >>>> -rwsr-xr-x. 1 root root 74712 Jun 25 02:46 mount >>>> -rwxr-xr-x 1 root root 9800 Aug 24 2010 mountpoint >>>> -rwxr-xr-x 1 root root 111536 May 30 10:55 mv >>>> -rwxr-xr-x 1 root root 177360 Nov 12 2010 nano >>>> -rwxr-xr-x 1 root root 127816 Aug 24 2010 netstat >>>> -rwxr-xr-x 1 root root 28816 May 30 10:55 nice >>>> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 nisdomainname -> hostname >>>> -rwxr-xr-x 1 root root 53576 Sep 12 13:32 ntfs-3g >>>> -rwxr-xr-x 1 root root 11016 Sep 12 13:32 ntfs-3g.probe >>>> -?????????? ? ? ? ? ? ntfs-3g.secaudit >>>> -?????????? ? ? ? ? ? ntfs-3g.usermap >>>> -rwxr-xr-x 1 root root 29896 Sep 12 13:32 ntfscat >>>> -rwxr-xr-x 1 root root 32992 Sep 12 13:32 ntfsck >>>> -?????????? ? ? ? ? ? ntfscluster >>>> -rwxr-xr-x 1 root root 36320 Sep 12 13:32 ntfscmp >>>> -?????????? ? ? ? ? ? ntfsdecrypt >>>> -?????????? ? ? ? ? ? ntfsdump_logfile >>>> -?????????? ? ? ? ? ? ntfsfix >>>> -rwxr-xr-x 1 root root 57240 Sep 12 13:32 ntfsinfo >>>> -?????????? ? ? ? ? ? ntfsls >>>> -rwxr-xr-x 1 root root 30448 Sep 12 13:32 ntfsmftalloc >>>> l?????????? ? ? ? ? ? ntfsmount >>>> -rwxr-xr-x 1 root root 34000 Sep 12 13:32 ntfsmove >>>> -?????????? ? ? ? ? ? ntfstruncate >>>> -rwxr-xr-x 1 root root 42240 Sep 12 13:32 ntfswipe >>>> -rwsr-xr-x 1 root root 41432 Nov 11 2010 ping >>>> -rwsr-xr-x 1 root root 36256 Nov 11 2010 ping6 >>>> -rwxr-xr-x 1 root root 35640 Oct 31 2010 plymouth >>>> -rwxr-xr-x 1 root root 86776 Nov 11 2010 ps >>>> -rwxr-xr-x 1 root root 31656 May 30 10:55 pwd >>>> -rwxr-xr-x 1 root root 11528 Jun 25 02:46 raw >>>> -rwxr-xr-x 1 root root 40056 May 30 10:55 readlink >>>> -rwxr-xr-x 2 root root 53352 Nov 11 2010 red >>>> -rwxr-xr-x. 1 root root 576 Apr 16 2008 redhat_lsb_init >>>> -rwxr-xr-x 1 root root 57504 May 30 10:55 rm >>>> -rwxr-xr-x 1 root root 40544 May 30 10:55 rmdir >>>> lrwxrwxrwx. 1 root root 4 Oct 13 10:39 rnano -> nano >>>> -rwxr-xr-x 1 root root 29904 Nov 11 2010 rpm >>>> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rvi -> vi >>>> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rview -> vi >>>> -rwxr-xr-x 1 root root 72248 Aug 22 2010 sed >>>> -rwxr-xr-x 1 root root 42312 Nov 11 2010 setfont >>>> -rwxr-xr-x 1 root root 23600 Aug 22 2010 setserial >>>> lrwxrwxrwx. 1 root root 4 Oct 13 10:36 sh -> bash >>>> -rwxr-xr-x 1 root root 27880 May 30 10:55 sleep >>>> -rwxr-xr-x 1 root root 99000 May 30 10:55 sort >>>> -rwxr-xr-x 1 root root 65864 May 30 10:55 stty >>>> -rwsr-xr-x 1 root root 36440 May 30 10:55 su >>>> -rwxr-xr-x 1 root root 25464 May 30 10:55 sync >>>> -rwxr-xr-x 1 root root 384920 Nov 11 2010 tar >>>> -rwxr-xr-x 1 root root 14808 Jun 25 02:46 taskset >>>> -rwxr-xr-x 1 root root 391288 Jun 25 02:05 tcsh >>>> -rwxr-xr-x 1 root root 51952 May 30 10:55 touch >>>> -rwxr-xr-x. 1 root root 11392 Nov 11 2010 tracepath >>>> -rwxr-xr-x. 1 root root 12304 Nov 11 2010 tracepath6 >>>> -rwxr-xr-x 1 root root 57384 Nov 11 2010 traceroute >>>> lrwxrwxrwx. 1 root root 10 Oct 13 10:39 traceroute6 -> traceroute >>>> -rwxr-xr-x 1 root root 24592 May 30 10:55 true >>>> -rwsr-xr-x. 1 root root 49280 Jun 25 02:46 umount >>>> -rwxr-xr-x 1 root root 27808 May 30 10:55 uname >>>> -rwxr-xr-x. 1 root root 2555 Nov 11 2010 unicode_start >>>> -rwxr-xr-x. 1 root root 363 Nov 11 2010 unicode_stop >>>> -rwxr-xr-x 1 root root 26264 May 30 10:55 unlink >>>> -rwxr-xr-x 1 root root 10208 Jun 25 00:09 usleep >>>> -rwxr-xr-x 1 root root 771800 Jun 25 04:43 vi >>>> lrwxrwxrwx. 1 root root 2 Oct 13 10:59 view -> vi >>>> lrwxrwxrwx. 1 root root 8 Oct 13 10:36 ypdomainname -> hostname >>>> -rwxr-xr-x. 1 root root 62 Nov 11 2010 zcat >>>> >>>> Here is the rough partition information for my main drive: >>>> >>>> /boot primary ext3 1gb /dev/sda1 >>>> /dev/sda2 extended lvm pv 925gb >>>> vg_gibson lvm-volumegroup 925gb >>>> / lv_root ext3 36gb >>>> swap lv_swap 2gb >>>> >>>> Server Specs: >>>> >>>> Dell Poweredge R710 >>>> 32GB ECC Unbuffered Ram >>>> 2x Intel Xeon Quad Core HT 2.3Ghz (16 "cores" total) >>>> 2x 1TB WD SCSI Drives in Raid-1 >>>> >>>> Drive Nitty Gritty: >>>> Product ID: WDC WD1002FBYS-0 >>>> Revision: 0C06 >>>> Size: 953344MB >>>> >>>> Heres some more information about the raid controller also attained from >>>> the >>>> raid controller config utility: >>>> >>>> Product Name: PERC 6/i >>>> Package: 6.2.0-0013 >>>> FW Version: 1.22.02-0612 >>>> BIOS Version: 2.04.00 >>>> CtrlR Version: 1.02-015B >>>> Boot Block: 1.00.00.01-0011 >>>> >>>> Application & OS Specs: >>>> CentOS 6 w/2.6.32-131 M.A.Young centos6 xen dom0 kernel >>>> >>>> Diagnostic Attempts and Results: >>>> >>>> I''ve done a consistency check on the raid array and everything comes >>>> back >>>> as >>>> clean and optimal. I''ve ran bad block checks, partition table >>>> corruption, >>>> mbr corruption, everything i can think of. It all comes back as clean >>>> and >>>> working fine. Because of these results i have not been able to force my >>>> dedicated hosting company to replace any of the hardware. They are >>>> upgrading >>>> the raid controller software as its about 1 minor version out of date >>>> just >>>> to see if that could be the issue, i''ll report back if that mysteriously >>>> fixes it but i''m not holding my breath. >>>> >>>> I''ve read somewhere that the 2.6.x kernels have an old version of the >>>> megaraid_sas module that will cause problems but the version included in >>>> the >>>> M.A.Young centos6 kernel is version 5.3 which is far beyond the 4.3 >>>> version >>>> that article recommends upgrading to so i''m really at a loss. Besides >>>> the >>>> version being so new the problem described in that article (the kernel >>>> not >>>> finding the drive at all on boot) is not the issue i''m having. It just >>>> freaks out randomly (i''m sure its not really randomly, just appears that >>>> way) and the OS swaps to read-only mode and the only way to reboot is >>>> basically to push the button on the front of the box. >>>> >>>> Please, if anyone can direct me towards a solution or at least down a >>>> path >>>> i >>>> have yet to try i would greatly appreciate it. I''m at my wits end, i''ve >>>> been >>>> fighting this mysterious monster for over a month now and it always >>>> seems >>>> to >>>> strike right before i''m about to go live with my services (first time it >>>> happened was right after i started adding customers to the box). >>>> >>>> Thanks in advance, >>>> David >>>> >>>> >>>> -- >>> ------------------------------****----------------------------** >>> --**------ >>> >>> Steven C. Timm, Ph.D (630) 840-8525 >>> timm@fnal.gov http://home.fnal.gov/~timm/ >>> Fermilab Computing Division, Scientific Computing Facilities, >>> Grid Facilities Department, FermiGrid Services Group, Group Leader. >>> Lead of FermiCloud project. >>> >>> >> > -- > ------------------------------**------------------------------**------ > Steven C. Timm, Ph.D (630) 840-8525 > timm@fnal.gov http://home.fnal.gov/~timm/ > Fermilab Computing Division, Scientific Computing Facilities, > Grid Facilities Department, FermiGrid Services Group, Group Leader. > Lead of FermiCloud project. >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Scott Meyers
2011-Oct-18 18:55 UTC
RE: [Xen-users] Severe megasas_raid issues when using Xen dom0 linux kernels
I have been working for a colocation company in town for the past two years and dealt with all sorts of issues with Dell RAID (hardware) controllers including PERC 6/i. Once a Dell RAID controller starts behaving badly, that means you MUST replace it and NOW, regardless. It is going to die and you will be offline until that RAID card is being replaced. My suggestion to every body using a Dell server with (hardware) RAID controller, buy that card as a backup; you will never know. That RAIA card will die without advanced notice. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Adi Kriegisch
2011-Oct-19 07:48 UTC
Re: [Xen-users] Severe megasas_raid issues when using Xen dom0 linux kernels
On Tue, Oct 18, 2011 at 01:04:28PM -0400, David Della Vecchia wrote:> I''ve tried debian stable and testing, centos5 and 6 with xen 3.1-4.1 > (about 5 different versions in between).[SNIP]> [root@gibson ~]# df -h > -bash: /bin/df: Input/output errorI had similar issues with a LSI controller once: A firmware upgrade to the latest firmware solved this for me. -- Adi _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users