Graham, Simon
2006-Dec-04 19:18 UTC
[Xen-devel] XENBUS: Timeout connecting to device errors
We''ve been noticing a lot of these errors when booting VMs since we moved to 3.0.3 - I''ve traced this to the hotplug scripts in Dom0 taking>10s to run to completion and specifically the vif-bridge script taking >=9s to plug the vif into the s/w bridge on occasion - was wondering ifanyone has any insight into why it might take this long. I added some instrumentation to the scripts to log entry/exit from xen-backend.agent and also lock contention (attached at the end of this) and have the following observations: 1. Currently, the various script invocations are issued in parallel but are serialized by a single global lock -- is it really necessary, for example, to serialize vif and vbd hot plug processing in Dom0? 2. In most cases we''ve seen, this problem happens when the first VM is started after re-installing a box. In the example below, the ''vif online'' processing started at 2:21:53 and did not finish until 2:22:04 3. Clearly a hard coded timeout of 10s is less than perfect -- is there no better way of knowing when the hotplug processing is done? Thanks, Simon <dom0 /var/log/messages:> Dec 4 02:21:53 gromit xen-hotplug: /etc/hotplug/xen-backend.agent: xen-backend[20234]: Start vif: add Dec 4 02:21:53 gromit xen-hotplug: /etc/hotplug/xen-backend.agent: xen-backend[20234]: End vif: add Dec 4 02:21:53 gromit xen-hotplug: /etc/hotplug/xen-backend.agent: xen-backend[20240]: Start vif: online Dec 4 02:21:53 gromit xen-hotplug: /etc/hotplug/xen-backend.agent: xen-backend[20252]: Start vbd: add Dec 4 02:21:53 gromit xen-hotplug: /etc/hotplug/xen-backend.agent: Lock /var/run/xen-hotplug/xenbus_hotplug_global by 20252 - currently owned by 20240: /etc/hotplug/xen-backend.agent Dec 4 02:21:54 gromit lvm[12123]: XenDom wallace1: state changed stopped => paused Dec 4 02:21:54 gromit sn2spine: start RESULT <?xml version="1.0" ?> <result status=''ok'' code=''200''> <guest id="3f879d14-8c70-48af-ae02-88df3afad3cb"><name>wallace1</name><id>3f879 d14-8c70-48af-ae02-88df3afad3cb</id><system>gromit.sn.stratus.com</syste m><state>starting</state><availability>failover</availability><mode>dupl ex</mode><memory>256</memory><cpus>1</cpus><storage><volume device="hda1" mountpoint="/" name="drbd0"/></storage></guest> </result> Dec 4 02:21:54 gromit xen-hotplug: /etc/xen/scripts/vif-bridge: online XENBUS_PATH=backend/vif/1/0 Dec 4 02:21:54 gromit kernel: device vif1.0 entered promiscuous mode Dec 4 02:21:54 gromit xen-hotplug: /etc/xen/scripts/vif-bridge: iptables -A FORWARD -m physdev --physdev-in vif1.0 -j ACCEPT failed. If you are using iptables, this may affect networking for guest domains. Dec 4 02:21:55 gromit kernel: xenbr0: port 3(vif1.0) entering learning state Dec 4 02:21:59 gromit kernel: xenbr0: topology change detected, propagating Dec 4 02:22:03 gromit kernel: xenbr0: port 3(vif1.0) entering forwarding state Dec 4 02:22:04 gromit lvm[12123]: XenDom wallace1: state changed paused => running Dec 4 02:22:04 gromit xen-hotplug: /etc/hotplug/xen-backend.agent: xen-backend[20240]: End vif: online Dec 4 02:22:18 gromit kernel: ip_tables: (C) 2000-2006 Netfilter Core Team Dec 4 02:22:22 gromit xen-hotplug: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/1/769 Dec 4 02:22:26 gromit xen-hotplug: /etc/hotplug/xen-backend.agent: xen-backend[20252]: End vbd: add Dec 4 02:22:29 gromit lvm[12123]: XenDom wallace1: state changed running => crashed <guest console>: Dec 4 02:21:54 Linux version 2.6.16.29-xenU (sntriage@anna.sn.stratus.com) (gcc version 3.4.4 20050721 (Red Hat 3.4.4-2)) #1 SMP Mon Dec 4 01:33:25 EST 2006 ... Dec 4 02:22:04 XENBUS: Timeout connecting to device: device/vbd/769 (state 3) Dec 4 02:22:04 Root-NFS: No NFS server available, giving up. Dec 4 02:22:04 VFS: Unable to mount root fs via NFS, trying floppy. Dec 4 02:22:04 VFS: Cannot open root device "hda1" or unknown-block(2,0) Dec 4 02:22:04 Please append a correct "root=" boot option Dec 4 02:22:04 Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(2,0) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ewan Mellor
2006-Dec-04 19:54 UTC
Re: [Xen-devel] XENBUS: Timeout connecting to device errors
On Mon, Dec 04, 2006 at 02:18:37PM -0500, Graham, Simon wrote:> We''ve been noticing a lot of these errors when booting VMs since we > moved to 3.0.3 - I''ve traced this to the hotplug scripts in Dom0 taking > >10s to run to completion and specifically the vif-bridge script taking > >=9s to plug the vif into the s/w bridge on occasion - was wondering if > anyone has any insight into why it might take this long. > > I added some instrumentation to the scripts to log entry/exit from > xen-backend.agent and also lock contention (attached at the end of this) > and have the following observations: > > 1. Currently, the various script invocations are issued in parallel but > are serialized > by a single global lock -- is it really necessary, for example, to > serialize vif > and vbd hot plug processing in Dom0?You need to serialise VBD hotplug if you are going to get the right result when performing the sharing check. If you''re using vif-nat, you need to serialise the modifications to the DHCP configuration file. Other than that, I don''t think that there''s a need to serialise events at startup. On Bugzilla #515, Harry Butterworth notes that there is a race condition in teardown, which is why he introduced the global lock. You could make this cleverer, possibly, so that it doesn''t affect startup times. All that said, I believe that udev is supposed to serialise all events anyway, so unless you''re using hotplug rather than udev, I''d expect you to see no lock contention whatsoever.> 2. In most cases we''ve seen, this problem happens when the first VM is > started after > re-installing a box. In the example below, the ''vif online'' > processing started at > 2:21:53 and did not finish until 2:22:04 > > 3. Clearly a hard coded timeout of 10s is less than perfect -- is there > no better way of knowing > when the hotplug processing is done?We know precisely when hotplugging is done -- the scripts write an entry into the store to tell us so. It''s knowing when they''ve locked up that''s the hard bit. If you''re seeing vif bringup taking 9 seconds, then clearly the 10 second timeout is far too short. There''s no particular reason to keep the timeout short, so feel free to lengthen it, with the obvious consequences. Bear in mind that Xend will time out the whole device bringup phase after 100 seconds. I''d want to root-cause the 9 second bringup as well, as I don''t see why it ought to take that long. Cheers, Ewan. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Graham, Simon
2006-Dec-05 16:05 UTC
RE: [Xen-devel] XENBUS: Timeout connecting to device errors
> > All that said, I believe that udev is supposed to serialise all events > anyway, so unless you''re using hotplug rather than udev, I''d expectyou> to see > no lock contention whatsoever. >We''re using hotplug (& a RHEL4 based Dom0) -- it seems to issue all the hotplug script invocations at the same time.> > 3. Clearly a hard coded timeout of 10s is less than perfect -- is > there > > no better way of knowing > > when the hotplug processing is done? > > We know precisely when hotplugging is done -- the scripts write an > entry into > the store to tell us so. It''s knowing when they''ve locked up that''s > the hard > bit. >Of course - silly me!> If you''re seeing vif bringup taking 9 seconds, then clearly the 10 > second > timeout is far too short. There''s no particular reason to keep the > timeout > short, so feel free to lengthen it, with the obvious consequences. > Bear in > mind that Xend will time out the whole device bringup phase after 100 > seconds. > I''d want to root-cause the 9 second bringup as well, as I don''t seewhy> it > ought to take that long. >So, just for grins, I increased the timeout to 20s since that seemed to be plenty of time based on the log files I''ve seen -- the timeout still happens and now the hotplug scripts seem to take ~20s to run... I''m beginning to think that there is some resource contention between the domain that is starting up and dom0... All suggestions welcome! Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel