Konrad Scherer
2011-Dec-09 16:33 UTC
Observations building a Linux distro testing Xen cluster
Hello all, First, thank you to everyone involved in the Xen project. This project has enabled so many other cool and useful projects. I have some comments on my experience installing > 30 distros as well as a request for guidance in debugging intermittent compiler failures for guests OSes running the 2.6.32 kernels. Some background: I work at Wind River Systems on the Embedded Linux team. Think of a commercial Gentoo like embedded Linux system. All of our customers build our product from source. Xen helps us do coverage building of our distribution (which cross compiles for embedded targets) on all of our supported Linux distributions. The currently supported list includes: RedHat 4.8 i386 RedHat 5.0 though 5.6, 6.0, 6.1 i386 and x86_64 SLED 10.2 i386, SLED 11.0 i386 and x86_64 OpenSuSE 11.2 i386 and x86_64 Fedora 13 i386 and x86_64 Ubuntu 10.04 i386 and x86_64 I have also done builds on some our "unsupported" Linux distros: Ubuntu 10.10, 11.04, Fedora 14 and 15, etc. In total: over 30 Linux variants all running as PV DomU. The good news is that all run using the Xen Dom0 2.6.32 and 3.1 kernel in Debian Squeeze and Wheezy! Here is a list of things I have noticed along the way that may be of interest to other people: 1) Stock RedHat 6.0 kernel (2.6.32-74) deadlocks on heavy IO. Fixed in 2.6.32-83. Workaround is to disable irqbalance service. https://bugzilla.redhat.com/show_bug.cgi?id=550724 2) On Debian Wheezy, the Dom0 kernel does not autoload some of the necessary xen modules like xen_gntdev. The solution is to run: echo "module xen_gntdev" >> /etc/modules Could these modules be autoloaded if a kernel detects it is running as Dom0? 3) Random errors on 2.6.32 i386 kernels. Building a distro is very CPU and I/O intensive. I have noticed some random errors that manifest as "Internal Compiler Errors" or mysterious build errors and they cluster on 32 bit 2.6.32 DomU kernels (on 2.6.32 and 3.1 Dom0 x86_64 kernels). They are sporadic but I get 5 to 10 a week. The Fedora 13 VMs (kernel 2.6.35) do not have these errors. None of our bare-metal builders show these kinds of errors and have I tried different hardware and Dom0 kernels. Is this a known issue? Since there is no Xen or DomU crash or coredump, what would be the best way I could help a developer track this down? 4) All builds on RedHat 5.0 though 5.6 are 10% faster than all other distros: RH 4.8, 6.x, Fedora, opensuse, etc. I suspect it may be with some RH specific ext3 patches or xen pv driver optimization. I have a bunch of single VM desktop machines using the 2.6.32 Dom0 that consistently produce the this result. Has anyone else seen this? Any ideas what could be causing this? Thanks -- Konrad Scherer, Sr. Engineer, Linux Products Group, Wind River direct 613-963-1342 fax 613-592-2283