The basic problem is that when I reboot a node in my cluster, it comes back up without its static routes. I posted this first to the linux-ha list, but I have since determined that the problem happens when Xen is stopped or started. The original post appears below. Since then, I have determined that if I add code to my bridge-wrapper script (which basically calls the network-bridge script once for each interface) that manually adds the static routes back in, the problem goes away on startup, but still occurs when I put a node into standby (which takes down all the shared resources including Xen). I realize that I am running an old version of Xen, but I am mostly wondering if this behavior is intentional or if there is a less kludgy workaround (like a config parameter that can be set that I have missed). Am I the only one in the world who has to use static routes? Unfortunately I am stuck with them because we have a /16 address space that is partially inside and partially outside our security perimeter, which means some subnets are reached through the external interface and some through the internal one; these are defined with static routes. TIA, --Greg =============================================================================OS: CentOS 5.5 heartbeat: heartbeat-3.0.3-2.3.el5 (latest from clusterlabs) pacemaker: pacemaker-1.0.9.1-1.15.el5 (latest from clusterlabs) If it matters, this cluster is primarily used to run Xen virtual machines (xen-3.0.3-105.el5_5.5 kernel-2.6.18-194.11.1.el5xen latest from CentOS) I have been looking off and on for the source of this problem for quite a while without finding what is causing it. The basic problem is that when I reboot a node in my cluster, it comes back up without its static routes. Adding them back in manually works; they stay until the next reboot. These are defined in /etc/sysconfig/static-routes and are added by the network service at boot time. I have been able to pretty much rule out the boot process itself as the source of the problem. I added a "netstat -r -n > /tmp/static-routes" command to the rc.local file which is the very last thing run at boot time and the routes are there. I have also tried putting nodes into standby (crm node standby) and back online, and the routes stay there through that. But once I log in after a reboot, the static routes are gone and I have to manually re-add them. I can probably work around this using a hideous kludge like having the rc.local file run a background job that sleeps for a couple of minutes, then adds the routes, but that doesn''t really fix the issue and isn''t guaranteed to work reliably (obviously high reliability is important or I wouldn''t be using HA in the first place). Has anyone ever seen this before or have any clue where I can look to troubleshoot this? Thanks in advance, --Greg _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Mon, Aug 23, 2010 at 10:45 PM, Greg Woods <woods@ucar.edu> wrote:> The basic problem is that when I reboot a node in my cluster, it comes > back up without its static routes.... and how did you determine that your setup works without Xen?> These are defined in /etc/sysconfig/static-routes and are added > by the network service at boot time.If they work on normal kernel (without Xen), then you should file a bug to RedHat. Generally they''d maintain the default network-bridge script to "just work". If it''s really xen networking that causes problem, you could simply discard the default network-bridge script and create your bridges manually using /etc/sysconfig/network-scripts/ifcfg-* I''m not sure you''ve configured static routes correctly though. See http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.5/html/Deployment_Guide/s1-networkscripts-static-routes.html http://www.akadia.com/services/redhat_static_routes.html -- Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Mon, 2010-08-23 at 23:32 +0700, Fajar A. Nugraha wrote:> > ... and how did you determine that your setup works without Xen?I have not definitively done that yet. It would require taking down the entire cluster including all the VMs, then removing the xen resources from the cluster configuration, resulting in an extended down time for all the VMs. I did determine that putting the routes back in the bridge startup script works around the issue.> I''m not sure you''ve configured static routes correctly though. See > http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.5/html/Deployment_Guide/s1-networkscripts-static-routes.htmlI will have to give this a try. First I''ve ever heard of a route-eth0 file. The network service startup script (/etc/rc.d/init.d/network) clearly uses the /etc/sysconfig/static-routes file. One thing I have determined is that whatever is dropping the routes, it is happening after booting is complete. Since rc.local is the very last boot-time service executed, and "netstat -r -n" shows that the static routes are properly there at that time, it has to be something that occurs after initial boot that is removing them. The main reason for suspecting xen networking is that there are "ip route delete" commands in some of the scripts. That is the only place on the system I have found anything like this. I have other high availability clusters that do not have Xen where this issue does not occur. I realize this is not "definitive proof" that Xen is at fault. I am not trying to point fingers or file bug reports at this point, I am just trying to troubleshoot. --Greg _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Tue, Aug 24, 2010 at 12:25 AM, Greg Woods <woods@ucar.edu> wrote:> On Mon, 2010-08-23 at 23:32 +0700, Fajar A. Nugraha wrote: > >> >> ... and how did you determine that your setup works without Xen? > > I have not definitively done that yet. It would require taking down the > entire cluster including all the VMs, then removing the xen resources > from the cluster configuration, resulting in an extended down time for > all the VMs. I did determine that putting the routes back in the bridge > startup script works around the issue.Why take down everything? What I meant was whether the static route setup works without Xen. It should be something as simple as: - installing normal (non-xen) kernel (if not already installed) on one of the nodes - reboot choosing that kernel That would at least verify whether the routes stay on after booting or not, and whether some startup script removes it. The only different (startup-script wise) between the default normal and xen kernel setup is that xen''s network-bridge script shouldn''t be running. If it works -> probably xen''s network-bridge does something wrong, and you should definitely file a bug to RH If it still doesn''t work (routes still missing several minutes after boot, or not appearing at all) -> something else is causing problems in your setup.> The main reason for > suspecting xen networking is that there are "ip route delete" commands > in some of the scripts. That is the only place on the system I have > found anything like this. I have other high availability clusters that > do not have Xen where this issue does not occur.Is it part of /etc/xen/scripts/network-bridge? If yes, you can also test disabling the script and create your own bridge. Someting like https://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.5/html/Virtualization_Guide/sect-Virtualization-Network_Configuration-Bridged_networking_with_libvirt.html With this setup, don''t forget that the IP address settings will be in the bridge''s config file (br0, xenbr0, or whatever bridge name you choose) instead of eth0. -- Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
> What I meant was whether the static route setup works without Xen. It > should be something as simple as: > - installing normal (non-xen) kernel (if not already installed) on one > of the nodes > - reboot choosing that kernelNot so simple, because if I boot a cluster node that doesn''t start Xen resources properly, this can cause a stonith death match (where the two nodes keep killing each other, or the working node keeps killing the non-working node). What this means is that I would have to disable the cluster software as well (at least on the testing node), which makes it not quite this simple. Still it is a test that I could do and I''ll find a time when I can do it. It requires rebooting some of the DomU''s to move them between nodes, so I can''t do this during the work day. For the present this is non-urgent because adding the routes back at the end of the network-bridge script at least causes the routes to be present after startup. The only problem is that the routes disappear again when a cluster node is taken offline (out of the cluster), but that is never left that way for a long period of time, so the workaround is acceptable for now. The static routes, of course, only affect the Dom0 since the DomU''s have their own routing tables. Mainly it causes my backups to our mass storage device to fail because the connection comes from the wrong interface (and therefore the wrong IP address). So while I do want to eventually track this down, it''s not an urgent priority. Thanks for the pointers though, it should help in troubleshooting.> > Is it part of /etc/xen/scripts/network-bridge?Yes. There are a lot of "ip route" commands in there, including "ip route del". But it is fairly convoluted where commands to execute are created with sed scripts, so it isn''t exactly clear what it is doing or trying to do there. I don''t fully understand how Linux bridging works either. --Greg _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users