Marco Sinhoreli
2011-Feb-18  17:57 UTC
[Xen-devel] Guest CentOS 5.4 64bit on-top XCP 0.5 issue / HP ProLiant BL460c G6
Hello all: I''ve running on-top XCP 0.5 (Xen Hypervisor 3.4.2) a CentOS 5.4 64bit Guest and it has some issue to finish the boot. In linux kernel boot, it goes into a loop like this bellow: <code> INFO: task swapper:1 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. swapper D ffff880004217d88 0 1 0 2 (L-TLB) ffff880004217c90 0000000000000246 0000000000000000 ffff8800010f3460 0000000000000008 ffff88000425c7a0 ffff88001fa51860 0000000000001678 ffff88000425c988 00000000000520a1 Call Trace: [<ffffffff80287879>] __wake_up_common+0x3e/0x68 [<ffffffff80262fb3>] wait_for_completion+0x7d/0xaa [<ffffffff8028906a>] default_wake_function+0x0/0xe [<ffffffff80258b30>] pdflush+0x0/0x207 [<ffffffff8029c577>] kthread_create+0xc1/0x141 [<ffffffff8029c3f2>] keventd_create_kthread+0x0/0xc4 [<ffffffff80258b30>] pdflush+0x0/0x207 [<ffffffff802889dd>] enqueue_task+0x41/0x56 [<ffffffff80288a48>] __activate_task+0x56/0x6d [<ffffffff802490c3>] try_to_wake_up+0x392/0x3a4 [<ffffffff80264931>] _spin_lock_irqsave+0x9/0x14 [<ffffffff802c1187>] start_one_pdflush_thread+0x1b/0x2e [<ffffffff8065d20a>] pdflush_init+0xa/0x13 [<ffffffff8064c7eb>] init+0x1f9/0x2fe [<ffffffff80260b2c>] child_rip+0xa/0x12 [<ffffffff8064c5f2>] init+0x0/0x2fe [<ffffffff80260b22>] child_rip+0x0/0x12 </code> This problem occurs only on HP blades model HP ProLiant BL460c G6 [1]. Others servers running XenServer or XCP the problem does not occur. [1] http://h10010.www1.hp.com/wwpc/us/en/sm/WF05a/3709945-3709945-3328410-241641-3328419-3884098.html Cheers, -- Marco Sinhoreli _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
nadaoneal
2011-Apr-15  18:51 UTC
[Xen-users] Re: Guest CentOS 5.4 64bit on-top XCP 0.5 issue / HP ProLiant BL460c G6
Hi Marco, I hope you''ve already solved this issue to your satisfaction, but I thought I''d post just in case. There''s an issue that affects people who: - Are running HP or Fujitsu servers with a hardware RAID - Are running CentOS/RHEL Xen domUs on CentOS/RHEL Xen dom0s - Are using kernels 2.6.18-194.x or greater (and really, who isn''t?) ... it''s currently affecting me and I think it''s the one affecting you. Please see these bug reports for RedHat and CentOS for some background information: http://bugs.centos.org/view.php?id=4515 https://bugzilla.redhat.com/show_bug.cgi?id=605444 - You should update your firmware if you haven''t already, though that will not solve the problem on its own. - You should ensure that your battery is charging correctly. - You should switch your scheduler, on both the dom0 and the domU, to noop. You can do this by adding "elevator=noop" to your kernel line in /etc/grub.conf and restarting. In my case, I also have a blade (g5 instead of g6), and my stack trace harps on fsync issues rather than pdflush issues, but I suspect you''re experiencing more or less the same issue. I''m currently on CentOS 5.6 and 2.6.18-238.9.1.el5xen, but I also see this issue on CentOS 5.5 and kernels in the -194, -233, and earlier -238 ranges. I see it with Xen 3.0.3 (CentOS''s version), 3.4.3, and 4.1. (http://www.gitco.de/repo/) You''re experiencing the issue right away, on boot, but if upgrading the firmware and changing the scheduler fixes the boot issue, I would encourage you to nevertheless run some tests in the guest domU to ensure that you''re okay during times of heavy disk access. I''ve been using dd to write a gb to disk to test: $ dd if=/dev/zero of=./test1024M bs=1024k count=1024 conv=fsync I found that before upgrading the firmware and changing the scheduler, this would reliably make dmesg explode with "blocked for more than 120 seconds" messages, and the write speed could be as low as 353 kB/s. Writing anything less than 1GB did not as reliably cause issues. Since making these changes, I still sometimes see issues with this heavy test, still sometimes see a single "blocked for 120 seconds" message. The write speed can be as low as 2MB/sec, but is generally between closer 50MB/sec. So I certainly don''t have the answer, but these changes have made a very material difference. -- View this message in context: http://xen.1045712.n5.nabble.com/Guest-CentOS-5-4-64bit-on-top-XCP-0-5-issue-HP-ProLiant-BL460c-G6-tp3391484p4306328.html Sent from the Xen - User mailing list archive at Nabble.com. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
nadaoneal
2011-Apr-15  20:10 UTC
[Xen-users] Re: Guest CentOS 5.4 64bit on-top XCP 0.5 issue / HP ProLiant BL460c G6
I apologize for double-posting, but I''ve just "verified" with about 50 trials that capping the RAID max write speed on the dom0 at about 50MB seems to allow the domU to very consistently write 1GB at 46-48MB/sec, without any dmesg errors. So I would update my recommendations to: - install any relevant firmware upgrades and ensure there''s no battery issue - on dom0: echo "50000" > /proc/sys/dev/raid/speed_limit_max ... if your domU uses a RAID configuration, you might want to do this on domU as well - on domU and dom0: change default scheduler to noop Again, hope this is helpful to someone. It''s just a band-aid - the actual fix will come either from the kernel or from a firmware update, or both, eventually. -- View this message in context: http://xen.1045712.n5.nabble.com/Guest-CentOS-5-4-64bit-on-top-XCP-0-5-issue-HP-ProLiant-BL460c-G6-tp3391484p4306475.html Sent from the Xen - User mailing list archive at Nabble.com. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users