Hello, Can any one give me some help regarding this issue? ---------- Forwarded Message ---------- Subject: [Xen-users] VM getting hang Date: Thursday 22 January 2009 23:33 From: gopikrishnan <gopikrishnan@crucialp.com> To: xen-users@lists.xensource.com Hello, We have a few XEN virtual machines in our main node. The issue goes like, at some particular time, some VPS wont come after reboot and it will get hanged during the booting process. Suppose, if we try to reboot a perfectly working vps in the same node during that time, then that also will get down. we have rebooted those VPS so may times. But they wont come up. After a few hours (may be 8-10hrs), all these VPS will come up automatically. Can any one shed some light on this. We checked the XEN logs but cudnt find anything particular. How can we tackle this in future. Any help please.. Thanks. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users ------------------------------------------------------- _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
What OS and what version of Xen? Are the VMs Hardware or para? Windows, Linux or something else? -----Original Message----- From: xen-users-bounces@lists.xensource.com [mailto:xen-users-bounces@lists.xensource.com] On Behalf Of gopikrishnan Sent: 23 January 2009 09:47 To: xen-users@lists.xensource.com Subject: Fwd: [Xen-users] VM getting hang Hello, Can any one give me some help regarding this issue? ---------- Forwarded Message ---------- Subject: [Xen-users] VM getting hang Date: Thursday 22 January 2009 23:33 From: gopikrishnan <gopikrishnan@crucialp.com> To: xen-users@lists.xensource.com Hello, We have a few XEN virtual machines in our main node. The issue goes like, at some particular time, some VPS wont come after reboot and it will get hanged during the booting process. Suppose, if we try to reboot a perfectly working vps in the same node during that time, then that also will get down. we have rebooted those VPS so may times. But they wont come up. After a few hours (may be 8-10hrs), all these VPS will come up automatically. Can any one shed some light on this. We checked the XEN logs but cudnt find anything particular. How can we tackle this in future. Any help please.. Thanks. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users ------------------------------------------------------- _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users The SAQ Group Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ SAQ is the trading name of SEMTEC Limited. Registered in England & Wales Company Number: 06481952 http://www.saqnet.co.uk AS29219 SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business. Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support. ISPA Member Find us in http://www.thebestof.co.uk/petersfield _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Fri, Jan 23, 2009 at 4:46 PM, gopikrishnan <gopikrishnan@crucialp.com> wrote:> We have a few XEN virtual machines in our main node. The issue goes like, at > some particular time, some VPS wont come after reboot and it will get hanged > during the booting process. Suppose, if we try to reboot a perfectly working > vps in the same node during that time, then that also will get down.Let me guess. You''re using external storage? Probably a non-enterprise one? If yes, then most likely it''s your storage (disks, controller, SAN switches) that''s acting up. If not, then you better give more info about your setup. Saying "help, my machine hangs" without any more info won''t get you anywhere. Start with : - what OS/xen combo that you use - what your hardware setup is like - what do log files say (/var/log/messages, etc.) - what do iostat, top, and xmtop shows Regards, Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hi, Thanks for getting back to me. We are using CentOS release 5.2 (DOM 0) and Xen version 3.1.2-92.1.18.el5. All the VPS (DOM U) are various distros of LINUX. And we are using an LVM setup in this DOM 0. xm top ========xentop - 21:53:55 Xen 3.1.2-92.1.18.el5 36 domains: 1 running, 35 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown Mem: 8386596k total, 6288484k used, 2098112k free CPUs: 4 @ 2660MHz =========== And unfortunately, we need to reboot the DOM 0 today and at that time, we got the following error message: ==========sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device =========== Hope this can give some hint. On Friday 23 January 2009 15:37, Fajar A. Nugraha wrote:> On Fri, Jan 23, 2009 at 4:46 PM, gopikrishnan <gopikrishnan@crucialp.com>wrote:> > We have a few XEN virtual machines in our main node. The issue goes > > like, at some particular time, some VPS wont come after reboot and it > > will get hanged during the booting process. Suppose, if we try to reboot > > a perfectly working vps in the same node during that time, then that also > > will get down. > > Let me guess. You''re using external storage? Probably a non-enterprise one? > If yes, then most likely it''s your storage (disks, controller, SAN > switches) that''s acting up. > > If not, then you better give more info about your setup. Saying "help, > my machine hangs" without any more info won''t get you anywhere. Start > with : > - what OS/xen combo that you use > - what your hardware setup is like > - what do log files say (/var/log/messages, etc.) > - what do iostat, top, and xmtop shows > > Regards, > > Fajar > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Fri, Jan 23, 2009 at 5:49 PM, gopikrishnan <gopikrishnan@crucialp.com> wrote:> We are using CentOS release 5.2 (DOM 0) and Xen version 3.1.2-92.1.18.el5. > All the VPS (DOM U) are various distros of LINUX. > > And we are using an LVM setup in this DOM 0. > > xm top > ========> xentop - 21:53:55 Xen 3.1.2-92.1.18.el5 > 36 domains: 1 running, 35 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown > Mem: 8386596k total, 6288484k used, 2098112k free CPUs: 4 @ 2660MHz > ===========Actually what I need from xm top is what kind of CPU load dom0 and domUs are using, which ones uses most CPU. So cutting the output like you did makes it pretty much useless. Never mind though, I think I already found the problem.> > And unfortunately, we need to reboot the DOM 0 today and at that time, we got > the following error message: > ==========> sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > ===========>Seems like my wild guess was right : It WAS disk problem :D The solution is simple : fix your disk. I think sd 0:0:0:0 is sda (e.g the first disk if its internal, the first LUN if its external). If you''re using internal disks with HW raid controller, I think you can see which disks are broken from BIOS, or by looking for disks whose LED is on. Regards Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hey, Thanks. I''ll check the HDD status and let you know. On Friday 23 January 2009 16:39, Fajar A. Nugraha wrote:> Seems like my wild guess was right : It WAS disk problem :D > The solution is simple : fix your disk. I think sd 0:0:0:0 is sda (e.g > the first disk if its internal, the first LUN if its external). > > If you''re using internal disks with HW raid controller, I think you > can see which disks are broken from BIOS, or by looking for disks > whose LED is on. > > Regards > > Fajar_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hey, I checked the RAID status and could get the following information: =======================================================[root@ cmdline]# ./arcconf GETCONFIG 1 Controllers found: 1 ---------------------------------------------------------------------- Controller information ---------------------------------------------------------------------- Controller Status : Optimal Channel description : SAS/SATA Controller Model : Adaptec 2405 Controller Serial Number : 8B221057C35 Physical Slot : 7 Temperature : 56 C/ 132 F (Normal) Installed memory : 128 MB Copyback : Disabled Background consistency check : Disabled Automatic Failover : Enabled Global task priority : High Stayawake period : Disabled Spinup limit internal drives : 0 Spinup limit external drives : 0 Defunct disk drive count : 0 Logical devices/Failed/Degraded : 1/0/0 -------------------------------------------------------- Controller Version Information -------------------------------------------------------- BIOS : 5.2-0 (15936) Firmware : 5.2-0 (15936) Driver : 1.1-5 (2453) Boot Flash : 5.2-0 (15936) -------------------------------------------------------- Controller Battery Information -------------------------------------------------------- Status : Not Installed ---------------------------------------------------------------------- Logical device information ---------------------------------------------------------------------- Logical device number 0 Logical device name : DATA RAID level : 10 Status of logical device : Optimal Size : 1429504 MB Stripe-unit size : 256 KB Read-cache mode : Enabled Write-cache mode : Disabled (write-through) Write-cache setting : Disabled (write-through) Partitioned : Unknown Protected by Hot-Spare : No Bootable : Yes Failed stripes : No Power settings : Disabled -------------------------------------------------------- Logical device segment information -------------------------------------------------------- Group 0, Segment 0 : Present (0,0) 9QK0GT1X Group 0, Segment 1 : Present (0,1) 9QK0B8FD Group 1, Segment 0 : Present (0,3) 9QK0HF0D Group 1, Segment 1 : Present (0,2) 9QK0HHF4 ---------------------------------------------------------------------- Physical Device information ---------------------------------------------------------------------- Device #0 Device is a Hard drive State : Online Supported : Yes Transfer Speed : SATA 3.0 Gb/s Reported Channel,Device(T:L) : 0,0(0:0) Reported Location : Enclosure 0, Slot 0 Reported ESD(T:L) : 2,0(0:0) Vendor : Model : ST3750330NS Firmware : SN04 Serial number : 9QK0GT1X Size : 715404 MB Write Cache : Enabled (write-back) FRU : None S.M.A.R.T. : No Device #1 Device is a Hard drive State : Online Supported : Yes Transfer Speed : SATA 3.0 Gb/s Reported Channel,Device(T:L) : 0,1(1:0) Reported Location : Enclosure 0, Slot 1 Reported ESD(T:L) : 2,0(0:0) Vendor : Model : ST3750330NS Firmware : SN04 Serial number : 9QK0B8FD Size : 715404 MB Write Cache : Enabled (write-back) FRU : None S.M.A.R.T. : No Device #2 Device is a Hard drive State : Online Supported : Yes Transfer Speed : SATA 3.0 Gb/s Reported Channel,Device(T:L) : 0,2(2:0) Reported Location : Enclosure 0, Slot 2 Reported ESD(T:L) : 2,0(0:0) Vendor : Model : ST3750330NS Firmware : SN04 Serial number : 9QK0HHF4 Size : 715404 MB Write Cache : Enabled (write-back) FRU : None S.M.A.R.T. : No Device #3 Device is a Hard drive State : Online Supported : Yes Transfer Speed : SATA 3.0 Gb/s Reported Channel,Device(T:L) : 0,3(3:0) Reported Location : Enclosure 0, Slot 3 Reported ESD(T:L) : 2,0(0:0) Vendor : Model : ST3750330NS Firmware : SN04 Serial number : 9QK0HF0D Size : 715404 MB Write Cache : Enabled (write-back) FRU : None S.M.A.R.T. : No Device #4 Device is an Enclosure services device Reported Channel,Device(T:L) : 2,0(0:0) Enclosure ID : 0 Type : SES2 Vendor : ADAPTEC Model : Virtual SGPIO Firmware : 0001 Status of Enclosure services device Temperature : Normal Command completed successfully. =========================================================== From the above result, it appears like everything is normal. Can you give any suggestions? Thanks. On Friday 23 January 2009 16:39, Fajar A. Nugraha wrote:> On Fri, Jan 23, 2009 at 5:49 PM, gopikrishnan <gopikrishnan@crucialp.com>wrote:> > We are using CentOS release 5.2 (DOM 0) and Xen version > > 3.1.2-92.1.18.el5. All the VPS (DOM U) are various distros of LINUX. > > > > And we are using an LVM setup in this DOM 0. > > > > xm top > > ========> > xentop - 21:53:55 Xen 3.1.2-92.1.18.el5 > > 36 domains: 1 running, 35 blocked, 0 paused, 0 crashed, 0 dying, 0 > > shutdown Mem: 8386596k total, 6288484k used, 2098112k free CPUs: 4 @ > > 2660MHz ===========> > Actually what I need from xm top is what kind of CPU load dom0 and > domUs are using, which ones uses most CPU. So cutting the output like > you did makes it pretty much useless. Never mind though, I think I > already found the problem. > > > And unfortunately, we need to reboot the DOM 0 today and at that time, we > > got the following error message: > > ==========> > sd 0:0:0:0: rejecting I/O to offline device > > sd 0:0:0:0: rejecting I/O to offline device > > ===========> > Seems like my wild guess was right : It WAS disk problem :D > The solution is simple : fix your disk. I think sd 0:0:0:0 is sda (e.g > the first disk if its internal, the first LUN if its external). > > If you''re using internal disks with HW raid controller, I think you > can see which disks are broken from BIOS, or by looking for disks > whose LED is on. > > Regards > > Fajar > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Sat, Jan 24, 2009 at 11:29 PM, gopikrishnan <gopikrishnan@crucialp.com> wrote:> > From the above result, it appears like everything is normal. Can you give any > suggestions?A "normal" device should not trigger ==========sd 0:0:0:0: rejecting I/O to offline device sd 0:0:0:0: rejecting I/O to offline device =========== In my setup I got similar cases happened several times because of three problems : (1) the disks were simply busy. For example, when using some hosting appliances they''d use a lot of I/O during startup. Putting several hosting domUs on the same dom0 and starting them all at the same has the effect of making startup takes a loooooong time. When this happens : - "iostsat -x 3" on dom0 during the boot process will show that the disk is busy with high throughput - There''s no weird messages on syslog - all you have to do is wait patiently (2) problems on the SAN switches/connections or HW raid controller For example, when your SAN switch is rebooted. This would block all disk I/O for some time, and on some cases can lead to data corruption. When this happens : - "iostsat -x 3" on dom0 (on the time the problem occurs) will show that the disk is busy with very low or no throughput - depending on your setup, you might get "rejecting I/O to offline device" messages (check the CONSOLE to be sure, not just /var/log/messages) - sometimes the problem seems to "fix itself" without you having to do anything (3) broken disks or controller Similar to (2), but this can also happen on local storage. Everything seemed to work correctly, but when accessing certain data it would take a loooong time or failed. This one''s hardest to diagnose, but sometimes had the similar symptoms as (2)>From your earlier mail I suspect it was (3). Then again, from "After a few hours(may be 8-10hrs), all these VPS will come up automatically." it can also be (1). To be sure though, you''ll need to have some diagnostics when the problem occured : - how was disk throughput at that time (check with "iostat -x 3" or similar commands) - was there any weird messages on the CONSOLE or on /var/log/messages at that time (depending on the problem, it is possible that error messages were not written to /var/log/messages) - what was domU load at that time. Do all domUs uses 100% CPU? Note that some diagnostics had to be done at the time the probelm occured, not AFTER. Good luck! Regards, Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hello Fajar, Thank you so much for those suggestions. I''ll surely let you know my findings when this issue occur again. Once again, Thanks Gopi. On Sunday 25 January 2009 01:50, Fajar A. Nugraha wrote:> On Sat, Jan 24, 2009 at 11:29 PM, gopikrishnan > > <gopikrishnan@crucialp.com> wrote: > > From the above result, it appears like everything is normal. Can you give > > any suggestions? > > A "normal" device should not trigger > > ==========> sd 0:0:0:0: rejecting I/O to offline device > sd 0:0:0:0: rejecting I/O to offline device > ===========> > In my setup I got similar cases happened several times because of > three problems : > > (1) the disks were simply busy. > For example, when using some hosting appliances they''d use a lot of > I/O during startup. Putting several hosting domUs on the same dom0 and > starting them all at the same has the effect of making startup takes a > loooooong time. > When this happens : > - "iostsat -x 3" on dom0 during the boot process will show that the > disk is busy with high throughput > - There''s no weird messages on syslog > - all you have to do is wait patiently > > (2) problems on the SAN switches/connections or HW raid controller > For example, when your SAN switch is rebooted. This would block all > disk I/O for some time, and on some cases can lead to data corruption. > When this happens : > - "iostsat -x 3" on dom0 (on the time the problem occurs) will show > that the disk is busy with very low or no throughput > - depending on your setup, you might get "rejecting I/O to offline > device" messages (check the CONSOLE to be sure, not just > /var/log/messages) > - sometimes the problem seems to "fix itself" without you having to do > anything > > (3) broken disks or controller > Similar to (2), but this can also happen on local storage. Everything > seemed to work correctly, but when accessing certain data it would > take a loooong time or failed. This one''s hardest to diagnose, but > sometimes had the similar symptoms as (2) > > >From your earlier mail I suspect it was (3). Then again, from "After a few > > hours > > (may be 8-10hrs), all these VPS will come up automatically." it can also be > (1). > > To be sure though, you''ll need to have some diagnostics when the > problem occured : > - how was disk throughput at that time (check with "iostat -x 3" or > similar commands) > - was there any weird messages on the CONSOLE or on /var/log/messages > at that time (depending on the problem, it is possible that error > messages were not written to /var/log/messages) > - what was domU load at that time. Do all domUs uses 100% CPU? > > Note that some diagnostics had to be done at the time the probelm > occured, not AFTER. > > Good luck! > > Regards, > > Fajar > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users