thr3ads.net - Xen devel - Crashing / unable to start domUs due to high number of luns? [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Nathan March

2012-Jan-31 21:42 UTC

Crashing / unable to start domUs due to high number of luns?

Hi All,

We''ve got a xen setup based around a dell iscsi device with each xen 
host having 2 lun''s, we then run multipath on top of that. After adding
a couple new virtual disks the other day, a couple of our online stable 
VM''s suddenly hard locked up. Attaching to the console gave me nothing,
looked like they lost their disk devices.

Attempting to restart them on the same dom0 failed with hot plug errors, 
as did attempting to start them on a few different dom0''s. After doing
a
"multipath -F" to remove unused devices and manually bringing in just 
the selected LUN''s via "multipath diskname", I was able to
successfully
start them. This initially made me think perhaps I was hitting some sort 
of udev / multipath / iscsi device lun limit (136 luns, 8 paths per lun 
= 1088 iscsi connections). Just to be clear, the problem occurred on 
multiple dom0''s at the same time so it definitely seems iscsi related.

Now, a day later, I''m debugging this further and I''m again
unable to
start VM''s, even with all extra multipath devices removed. I rebooted 
one of the dom0''s and was able to successfully migrate our production 
VM''s off a broken server, so I''ve now got an empty dom0
that''s unable to
start any vm''s.

Starting a VM results in the following in xend.log:

[2012-01-31 13:06:16 12353] DEBUG (DevController:144) Waiting for 0.
[2012-01-31 13:06:16 12353] DEBUG (DevController:628) 
hotplugStatusCallback /local/domain/0/backend/vif/35/0/hotplug-status.
[2012-01-31 13:07:56 12353] ERROR (SrvBase:88) Request wait_for_devices 
failed.
Traceback (most recent call last):
   File "/usr/lib64/python2.6/site-packages/xen/web/SrvBase.py", line 
85, in perform
     return op_method(op, req)
   File 
"/usr/lib64/python2.6/site-packages/xen/xend/server/SrvDomain.py",
line
85, in op_wait_for_devices
     return self.dom.waitForDevices()
   File
"/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py",
line 1237, in waitForDevices
     self.getDeviceController(devclass).waitForDevices()
   File 
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
line 140, in waitForDevices
     return map(self.waitForDevice, self.deviceIDs())
   File 
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
line 155, in waitForDevice
     (devid, self.deviceClass))
VmError: Device 0 (vif) could not be connected. Hotplug scripts not working.
[2012-01-31 13:07:56 12353] DEBUG (XendDomainInfo:3071) 
XendDomainInfo.destroy: domid=35
[2012-01-31 13:07:58 12353] DEBUG (XendDomainInfo:2401) Destroying 
device model

I tried turning up udev''s log level but that didn''t reveal
anything.
Reading the xenstore for the vif doesn''t show anything unusual either:

ukxen1 ~ # xenstore-ls /local/domain/0/backend/vif/35
0 = ""
  bridge = "vlan91"
  domain = "nathanxenuk1"
  handle = "0"
  uuid = "2128d0b7-d50f-c2ad-4243-8a42bb598b81"
  script = "/etc/xen/scripts/vif-bridge"
  state = "1"
  frontend = "/local/domain/35/device/vif/0"
  mac = "00:16:3d:03:00:44"
  online = "1"
  frontend-id = "35"

The bridge device (vlan91) exists, trying a different bridge doesn''t 
matter. Removing the VIF completely results in the same error for the 
VBD. Adding debugging to the hotplug/network scripts didn''t reveal 
anything, it looks like they aren''t even being executed yet. Nothing is
logged to xen-hotplug.log.

The only thing I can think of that this may be related to, is gentoo 
defaulted to a 10mb /dev which we filled up a few months back. We upped 
the size to 50mb in the mount options and everything''s been completely 
stable since (~33 days). None of the /dev on the dom0''s is higher than 
25% usage. Asides from adding the new luns, no changes have been made in 
the past month.

To try and test if removing some devices would solve anything, I tried 
doing an "iscsiadm -m node --logout" and it promptly hard locked the 
entire box. After a reboot, I was unable to reproduce the problem on 
that particular dom0.

I''ve still got 1 dom0 that''s exhibiting the problem, if anyone
is able
to suggest any further debugging steps?

- Nathan


(XEN) Xen version 4.1.1 (root@) (gcc version 4.3.4 (Gentoo 4.3.4 p1.1, 
pie-10.1.5) ) Mon Aug 29 16:24:12 PDT 2011

ukxen1 xen # xm info
host                   : ukxen1
release                : 3.0.3
version                : #4 SMP Thu Dec 22 12:44:22 PST 2011
machine                : x86_64
nr_cpus                : 24
nr_nodes               : 2
cores_per_socket       : 6
threads_per_core       : 2
cpu_mhz                : 2261
hw_caps                : 
bfebfbff:2c100800:00000000:00003f40:029ee3ff:00000000:00000001:00000000
virt_caps              : hvm hvm_directio
total_memory           : 98291
free_memory            : 91908
free_cpus              : 0
xen_major              : 4
xen_minor              : 1
xen_extra              : .1
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
xen_commandline        : console=vga dom0_mem=1024M dom0_max_vcpus=1 
dom0_vcpus_pin=true
cc_compiler            : gcc version 4.3.4 (Gentoo 4.3.4 p1.1, pie-10.1.5)
cc_compile_by          : root
cc_compile_domain      :
cc_compile_date        : Mon Aug 29 16:24:12 PDT 2011
xend_config_format     : 4

Konrad Rzeszutek Wilk

2012-Feb-01 01:30 UTC

head link

Re: Crashing / unable to start domUs due to high number of luns?

On Tue, Jan 31, 2012 at 01:42:23PM -0800, Nathan March
wrote:> Hi All,
> 
> We''ve got a xen setup based around a dell iscsi device with each
xen
> host having 2 lun''s, we then run multipath on top of that. After
adding
> a couple new virtual disks the other day, a couple of our online stable 
> VM''s suddenly hard locked up. Attaching to the console gave me
nothing,
> looked like they lost their disk devices.
> 
> Attempting to restart them on the same dom0 failed with hot plug errors, 
> as did attempting to start them on a few different dom0''s. After
doing a
> "multipath -F" to remove unused devices and manually bringing in
just
> the selected LUN''s via "multipath diskname", I was able
to successfully
> start them. This initially made me think perhaps I was hitting some sort 
> of udev / multipath / iscsi device lun limit (136 luns, 8 paths per lun 
> = 1088 iscsi connections). Just to be clear, the problem occurred on 
> multiple dom0''s at the same time so it definitely seems iscsi
related.
> 
> Now, a day later, I''m debugging this further and I''m
again unable to
> start VM''s, even with all extra multipath devices removed. I
rebooted
> one of the dom0''s and was able to successfully migrate our
production
> VM''s off a broken server, so I''ve now got an empty dom0
that''s unable to
> start any vm''s.
> 
> Starting a VM results in the following in xend.log:
> 
> [2012-01-31 13:06:16 12353] DEBUG (DevController:144) Waiting for 0.
> [2012-01-31 13:06:16 12353] DEBUG (DevController:628) 
> hotplugStatusCallback /local/domain/0/backend/vif/35/0/hotplug-status.
> [2012-01-31 13:07:56 12353] ERROR (SrvBase:88) Request wait_for_devices 
> failed.
> Traceback (most recent call last):
>   File "/usr/lib64/python2.6/site-packages/xen/web/SrvBase.py",
line
> 85, in perform
>     return op_method(op, req)
>   File 
>
"/usr/lib64/python2.6/site-packages/xen/xend/server/SrvDomain.py",
line
> 85, in op_wait_for_devices
>     return self.dom.waitForDevices()
>   File
"/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py",
> line 1237, in waitForDevices
>     self.getDeviceController(devclass).waitForDevices()
>   File 
>
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
> line 140, in waitForDevices
>     return map(self.waitForDevice, self.deviceIDs())
>   File 
>
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
> line 155, in waitForDevice
>     (devid, self.deviceClass))
> VmError: Device 0 (vif) could not be connected. Hotplug scripts not
working.

Was there anything in the kernel (dmesg) about vifs? What does your 
/proc/interrupts look like? Can you provide the dmesg that you get
during startup. I am mainly looking for:

NR_IRQS:16640 nr_irqs:1536 16

How many guests are your running when this happens?

One theory is that your are running out dom0 interrupts. Thought
I *think* that was made dynamic in 3.0..


Thought that does explain your iSCSI network wonky in the guest -
was there anything in the dmesg when the guest started going bad?
> [2012-01-31 13:07:56 12353] DEBUG (XendDomainInfo:3071) 
> XendDomainInfo.destroy: domid=35
> [2012-01-31 13:07:58 12353] DEBUG (XendDomainInfo:2401) Destroying 
> device model
> 
> I tried turning up udev''s log level but that didn''t
reveal anything.
> Reading the xenstore for the vif doesn''t show anything unusual
either:
> 
> ukxen1 ~ # xenstore-ls /local/domain/0/backend/vif/35
> 0 = ""
>  bridge = "vlan91"
>  domain = "nathanxenuk1"
>  handle = "0"
>  uuid = "2128d0b7-d50f-c2ad-4243-8a42bb598b81"
>  script = "/etc/xen/scripts/vif-bridge"
>  state = "1"
>  frontend = "/local/domain/35/device/vif/0"
>  mac = "00:16:3d:03:00:44"
>  online = "1"
>  frontend-id = "35"
> 
> The bridge device (vlan91) exists, trying a different bridge
doesn''t
> matter. Removing the VIF completely results in the same error for the 
> VBD. Adding debugging to the hotplug/network scripts didn''t reveal
> anything, it looks like they aren''t even being executed yet.
Nothing is
> logged to xen-hotplug.log.
OK, so that would imply the kernel hasn''t been able to do the right
thing. Hmm.

What do you see when this happens with udev --monitor --kernel --udev
--property ?
> 
> The only thing I can think of that this may be related to, is gentoo 
> defaulted to a 10mb /dev which we filled up a few months back. We upped 
> the size to 50mb in the mount options and everything''s been
completely
> stable since (~33 days). None of the /dev on the dom0''s is higher
than
> 25% usage. Asides from adding the new luns, no changes have been made in 
> the past month.
> 
> To try and test if removing some devices would solve anything, I tried 
> doing an "iscsiadm -m node --logout" and it promptly hard locked
the
> entire box. After a reboot, I was unable to reproduce the problem on 
> that particular dom0.
> 
> I''ve still got 1 dom0 that''s exhibiting the problem, if
anyone is able
> to suggest any further debugging steps?
> 
> - Nathan
> 
> 
> (XEN) Xen version 4.1.1 (root@) (gcc version 4.3.4 (Gentoo 4.3.4 p1.1, 
> pie-10.1.5) ) Mon Aug 29 16:24:12 PDT 2011
> 
> ukxen1 xen # xm info
> host                   : ukxen1
> release                : 3.0.3
> version                : #4 SMP Thu Dec 22 12:44:22 PST 2011
> machine                : x86_64
> nr_cpus                : 24
> nr_nodes               : 2
> cores_per_socket       : 6
> threads_per_core       : 2
> cpu_mhz                : 2261
> hw_caps                : 
> bfebfbff:2c100800:00000000:00003f40:029ee3ff:00000000:00000001:00000000
> virt_caps              : hvm hvm_directio
> total_memory           : 98291
> free_memory            : 91908
> free_cpus              : 0
> xen_major              : 4
> xen_minor              : 1
> xen_extra              : .1
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
> hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> xen_commandline        : console=vga dom0_mem=1024M dom0_max_vcpus=1 
> dom0_vcpus_pin=true
> cc_compiler            : gcc version 4.3.4 (Gentoo 4.3.4 p1.1, pie-10.1.5)
> cc_compile_by          : root
> cc_compile_domain      :
> cc_compile_date        : Mon Aug 29 16:24:12 PDT 2011
> xend_config_format     : 4
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

Nathan March

2012-Feb-01 19:48 UTC

head link

Re: Crashing / unable to start domUs due to high number of luns?

On 1/31/2012 5:30 PM, Konrad Rzeszutek Wilk wrote:> On Tue, Jan 31, 2012 at 01:42:23PM -0800, Nathan March wrote:
>> Hi All,
>>
>> We''ve got a xen setup based around a dell iscsi device with
each xen
>> host having 2 lun''s, we then run multipath on top of that.
After adding
>> a couple new virtual disks the other day, a couple of our online stable
>> VM''s suddenly hard locked up. Attaching to the console gave me
nothing,
>> looked like they lost their disk devices.
>>
>> Attempting to restart them on the same dom0 failed with hot plug
errors,
>> as did attempting to start them on a few different dom0''s.
After doing a
>> "multipath -F" to remove unused devices and manually bringing
in just
>> the selected LUN''s via "multipath diskname", I was
able to successfully
>> start them. This initially made me think perhaps I was hitting some
sort
>> of udev / multipath / iscsi device lun limit (136 luns, 8 paths per lun
>> = 1088 iscsi connections). Just to be clear, the problem occurred on
>> multiple dom0''s at the same time so it definitely seems iscsi
related.
>>
>> Now, a day later, I''m debugging this further and I''m
again unable to
>> start VM''s, even with all extra multipath devices removed. I
rebooted
>> one of the dom0''s and was able to successfully migrate our
production
>> VM''s off a broken server, so I''ve now got an empty
dom0 that''s unable to
>> start any vm''s.
>>
>> Starting a VM results in the following in xend.log:
>>
>> [2012-01-31 13:06:16 12353] DEBUG (DevController:144) Waiting for 0.
>> [2012-01-31 13:06:16 12353] DEBUG (DevController:628)
>> hotplugStatusCallback /local/domain/0/backend/vif/35/0/hotplug-status.
>> [2012-01-31 13:07:56 12353] ERROR (SrvBase:88) Request wait_for_devices
>> failed.
>> Traceback (most recent call last):
>>    File
"/usr/lib64/python2.6/site-packages/xen/web/SrvBase.py", line
>> 85, in perform
>>      return op_method(op, req)
>>    File
>>
"/usr/lib64/python2.6/site-packages/xen/xend/server/SrvDomain.py",
line
>> 85, in op_wait_for_devices
>>      return self.dom.waitForDevices()
>>    File
"/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py",
>> line 1237, in waitForDevices
>>      self.getDeviceController(devclass).waitForDevices()
>>    File
>>
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
>> line 140, in waitForDevices
>>      return map(self.waitForDevice, self.deviceIDs())
>>    File
>>
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
>> line 155, in waitForDevice
>>      (devid, self.deviceClass))
>> VmError: Device 0 (vif) could not be connected. Hotplug scripts not
working.
>
> Was there anything in the kernel (dmesg) about vifs? What does your
> /proc/interrupts look like? Can you provide the dmesg that you get
> during startup. I am mainly looking for:
>
> NR_IRQS:16640 nr_irqs:1536 16
>
> How many guests are your running when this happens?
>
> One theory is that your are running out dom0 interrupts. Thought
> I *think* that was made dynamic in 3.0..
>
>
> Thought that does explain your iSCSI network wonky in the guest -
> was there anything in the dmesg when the guest started going bad?
Was running approximately 15 guests, although this persisted after 
migrating them off.

Nothing in dmesg (dom0 dmesg or xm dmesg) that looked abnormal at all, 
no references to vifs. Asides from the inability to start a VM, I 
couldn''t seem to find any sort of error anywhere.

All the hosts show the same irq counts:[   34.903763] NR_IRQS:4352 
nr_irqs:4352 16

Unfortunately I''m not able to reproduce this now, but I''ve
posted
several different copies of /proc/interrupts here: 
http://pastebin.com/n7PWNeaZ

Full xm / kernel dmesg is uploaded here: http://pastebin.com/AtCvFBDS
>> [2012-01-31 13:07:56 12353] DEBUG (XendDomainInfo:3071)
>> XendDomainInfo.destroy: domid=35
>> [2012-01-31 13:07:58 12353] DEBUG (XendDomainInfo:2401) Destroying
>> device model
>>
>> I tried turning up udev''s log level but that didn''t
reveal anything.
>> Reading the xenstore for the vif doesn''t show anything unusual
either:
>>
>> ukxen1 ~ # xenstore-ls /local/domain/0/backend/vif/35
>> 0 = ""
>>   bridge = "vlan91"
>>   domain = "nathanxenuk1"
>>   handle = "0"
>>   uuid = "2128d0b7-d50f-c2ad-4243-8a42bb598b81"
>>   script = "/etc/xen/scripts/vif-bridge"
>>   state = "1"
>>   frontend = "/local/domain/35/device/vif/0"
>>   mac = "00:16:3d:03:00:44"
>>   online = "1"
>>   frontend-id = "35"
>>
>> The bridge device (vlan91) exists, trying a different bridge
doesn''t
>> matter. Removing the VIF completely results in the same error for the
>> VBD. Adding debugging to the hotplug/network scripts didn''t
reveal
>> anything, it looks like they aren''t even being executed yet.
Nothing is
>> logged to xen-hotplug.log.
> OK, so that would imply the kernel hasn''t been able to do the
right
> thing. Hmm.
>
> What do you see when this happens with udev --monitor --kernel --udev
> --property ?
The remaining server I thought was doing this is apparently not (I was 
probably mistaken), so the 2 that were definitely doing it have been 
rebooted and I can''t reproduce this at the moment.

I''ve been abusing a free server all morning with a loop to 
spawn/shutdown a VM repeatedly and flush / rescan multipath to see if I 
can reproduce this again. No luck so far unfortunately, but I''ll keep 
trying.

>> The only thing I can think of that this may be related to, is gentoo
>> defaulted to a 10mb /dev which we filled up a few months back. We upped
>> the size to 50mb in the mount options and everything''s been
completely
>> stable since (~33 days). None of the /dev on the dom0''s is
higher than
>> 25% usage. Asides from adding the new luns, no changes have been made
in
>> the past month.
>>
>> To try and test if removing some devices would solve anything, I tried
>> doing an "iscsiadm -m node --logout" and it promptly hard
locked the
>> entire box. After a reboot, I was unable to reproduce the problem on
>> that particular dom0.
>>
>> I''ve still got 1 dom0 that''s exhibiting the problem,
if anyone is able
>> to suggest any further debugging steps?
>>
>> - Nathan
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel

-- 
Nathan March<nathan@gt.net>
Gossamer Threads Inc. http://www.gossamer-threads.com/
Tel: (604) 687-5804 Fax: (604) 687-5806

Nathan March

2012-Feb-17 07:15 UTC

head link

Re: Crashing / unable to start domUs due to high number of luns?

On 1/31/2012 5:30 PM, Konrad Rzeszutek Wilk wrote:> On Tue, Jan 31, 2012 at 01:42:23PM -0800, Nathan March wrote:
>> Hi All,
>>
>> We''ve got a xen setup based around a dell iscsi device with
each xen
>> host having 2 lun''s, we then run multipath on top of that.
After adding
>> a couple new virtual disks the other day, a couple of our online stable
>> VM''s suddenly hard locked up. Attaching to the console gave me
nothing,
>> looked like they lost their disk devices.
>>
>> Attempting to restart them on the same dom0 failed with hot plug
errors,
>> as did attempting to start them on a few different dom0''s.
After doing a
>> "multipath -F" to remove unused devices and manually bringing
in just
>> the selected LUN''s via "multipath diskname", I was
able to successfully
>> start them. This initially made me think perhaps I was hitting some
sort
>> of udev / multipath / iscsi device lun limit (136 luns, 8 paths per lun
>> = 1088 iscsi connections). Just to be clear, the problem occurred on
>> multiple dom0''s at the same time so it definitely seems iscsi
related.
>>
>> Now, a day later, I''m debugging this further and I''m
again unable to
>> start VM''s, even with all extra multipath devices removed. I
rebooted
>> one of the dom0''s and was able to successfully migrate our
production
>> VM''s off a broken server, so I''ve now got an empty
dom0 that''s unable to
>> start any vm''s.
>>
>> Starting a VM results in the following in xend.log:
>>
>> [2012-01-31 13:06:16 12353] DEBUG (DevController:144) Waiting for 0.
>> [2012-01-31 13:06:16 12353] DEBUG (DevController:628)
>> hotplugStatusCallback /local/domain/0/backend/vif/35/0/hotplug-status.
>> [2012-01-31 13:07:56 12353] ERROR (SrvBase:88) Request wait_for_devices
>> failed.
>> Traceback (most recent call last):
>>    File
"/usr/lib64/python2.6/site-packages/xen/web/SrvBase.py", line
>> 85, in perform
>>      return op_method(op, req)
>>    File
>>
"/usr/lib64/python2.6/site-packages/xen/xend/server/SrvDomain.py",
line
>> 85, in op_wait_for_devices
>>      return self.dom.waitForDevices()
>>    File
"/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py",
>> line 1237, in waitForDevices
>>      self.getDeviceController(devclass).waitForDevices()
>>    File
>>
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
>> line 140, in waitForDevices
>>      return map(self.waitForDevice, self.deviceIDs())
>>    File
>>
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
>> line 155, in waitForDevice
>>      (devid, self.deviceClass))
>> VmError: Device 0 (vif) could not be connected. Hotplug scripts not
working.
>
> Was there anything in the kernel (dmesg) about vifs? What does your
> /proc/interrupts look like? Can you provide the dmesg that you get
> during startup. I am mainly looking for:
>
> NR_IRQS:16640 nr_irqs:1536 16
>
> How many guests are your running when this happens?
>
> One theory is that your are running out dom0 interrupts. Thought
> I *think* that was made dynamic in 3.0..
>
>
> Thought that does explain your iSCSI network wonky in the guest -
> was there anything in the dmesg when the guest started going bad?
>
>> [2012-01-31 13:07:56 12353] DEBUG (XendDomainInfo:3071)
>> XendDomainInfo.destroy: domid=35
>> [2012-01-31 13:07:58 12353] DEBUG (XendDomainInfo:2401) Destroying
>> device model
>>
>> I tried turning up udev''s log level but that didn''t
reveal anything.
>> Reading the xenstore for the vif doesn''t show anything unusual
either:
>>
>> ukxen1 ~ # xenstore-ls /local/domain/0/backend/vif/35
>> 0 = ""
>>   bridge = "vlan91"
>>   domain = "nathanxenuk1"
>>   handle = "0"
>>   uuid = "2128d0b7-d50f-c2ad-4243-8a42bb598b81"
>>   script = "/etc/xen/scripts/vif-bridge"
>>   state = "1"
>>   frontend = "/local/domain/35/device/vif/0"
>>   mac = "00:16:3d:03:00:44"
>>   online = "1"
>>   frontend-id = "35"
>>
>> The bridge device (vlan91) exists, trying a different bridge
doesn''t
>> matter. Removing the VIF completely results in the same error for the
>> VBD. Adding debugging to the hotplug/network scripts didn''t
reveal
>> anything, it looks like they aren''t even being executed yet.
Nothing is
>> logged to xen-hotplug.log.
> OK, so that would imply the kernel hasn''t been able to do the
right
> thing. Hmm.
>
> What do you see when this happens with udev --monitor --kernel --udev
> --property ?
I have this happening again on a server and running udev monitor 
(udevadm monitor --kernel --udev --property) prints absolutely 
*nothing*. I''ve confirmed on a working xen host that it does actually 
print a ton of debugging when i restart a VM. This machine however 
prints nothing when trying to spawn.

Still returns the same hotplug failure error.

Any suggestions on what I can do to debug? Still nothing in dmesg.

>
>> The only thing I can think of that this may be related to, is gentoo
>> defaulted to a 10mb /dev which we filled up a few months back. We upped
>> the size to 50mb in the mount options and everything''s been
completely
>> stable since (~33 days). None of the /dev on the dom0''s is
higher than
>> 25% usage. Asides from adding the new luns, no changes have been made
in
>> the past month.
>>
>> To try and test if removing some devices would solve anything, I tried
>> doing an "iscsiadm -m node --logout" and it promptly hard
locked the
>> entire box. After a reboot, I was unable to reproduce the problem on
>> that particular dom0.
>>
>> I''ve still got 1 dom0 that''s exhibiting the problem,
if anyone is able
>> to suggest any further debugging steps?
>>
>> - Nathan
>>
>>
>> (XEN) Xen version 4.1.1 (root@) (gcc version 4.3.4 (Gentoo 4.3.4 p1.1,
>> pie-10.1.5) ) Mon Aug 29 16:24:12 PDT 2011
>>
>> ukxen1 xen # xm info
>> host                   : ukxen1
>> release                : 3.0.3
>> version                : #4 SMP Thu Dec 22 12:44:22 PST 2011
>> machine                : x86_64
>> nr_cpus                : 24
>> nr_nodes               : 2
>> cores_per_socket       : 6
>> threads_per_core       : 2
>> cpu_mhz                : 2261
>> hw_caps                :
>> bfebfbff:2c100800:00000000:00003f40:029ee3ff:00000000:00000001:00000000
>> virt_caps              : hvm hvm_directio
>> total_memory           : 98291
>> free_memory            : 91908
>> free_cpus              : 0
>> xen_major              : 4
>> xen_minor              : 1
>> xen_extra              : .1
>> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
>> hvm-3.0-x86_32p hvm-3.0-x86_64
>> xen_scheduler          : credit
>> xen_pagesize           : 4096
>> platform_params        : virt_start=0xffff800000000000
>> xen_changeset          : unavailable
>> xen_commandline        : console=vga dom0_mem=1024M dom0_max_vcpus=1
>> dom0_vcpus_pin=true
>> cc_compiler            : gcc version 4.3.4 (Gentoo 4.3.4 p1.1,
pie-10.1.5)
>> cc_compile_by          : root
>> cc_compile_domain      :
>> cc_compile_date        : Mon Aug 29 16:24:12 PDT 2011
>> xend_config_format     : 4
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

Fajar A. Nugraha

2012-Feb-17 09:15 UTC

head link

Re: Crashing / unable to start domUs due to high number of luns?

On Fri, Feb 17, 2012 at 2:15 PM, Nathan March <nathan@gt.net>
wrote:> I have this happening again on a server and running udev monitor (udevadm
> monitor --kernel --udev --property) prints absolutely *nothing*.
I''ve
> confirmed on a working xen host that it does actually print a ton of
> debugging when i restart a VM. This machine however prints nothing when
> trying to spawn.
This might sound silly, but is udevd running? I''ve had cases where it
suddenly died.

-- 
Fajar

Nathan March

2012-Feb-17 09:21 UTC

head link

Re: Crashing / unable to start domUs due to high number of luns?

On 2/17/2012 1:15 AM, Fajar A. Nugraha wrote:> On Fri, Feb 17, 2012 at 2:15 PM, Nathan March<nathan@gt.net>  wrote:
>> I have this happening again on a server and running udev monitor
(udevadm
>> monitor --kernel --udev --property) prints absolutely *nothing*.
I''ve
>> confirmed on a working xen host that it does actually print a ton of
>> debugging when i restart a VM. This machine however prints nothing when
>> trying to spawn.
> This might sound silly, but is udevd running? I''ve had cases where
it
> suddenly died.
>Sometimes it takes someone to point out the obvious =)

It''s running, but whether it''s actually working or not may be
another
question:

root      2476  0.0  0.0   6184   488 ?        S<s  Feb14   0:00 
/sbin/udevd --daemon
root     16274  0.0  0.0   6180   284 ?        S<   Feb15   0:00  \_ 
/sbin/udevd --daemon
root     16285  0.0  0.0   6180   164 ?        S<   Feb15   0:00  \_ 
/sbin/udevd --daemon

*Although* we''re currently on udev 141 which appears to be ancient, so 
that may be the source of the problem. Will try an upgrade.

- Nathan

Konrad Rzeszutek Wilk

2012-Apr-16 14:18 UTC

head link

Re: Crashing / unable to start domUs due to high number of luns?

> >>I''ve still got 1 dom0 that''s exhibiting the
problem, if anyone is able
> >>to suggest any further debugging steps?
> >>
> >>- Nathan
> >>
> >>
> >>(XEN) Xen version 4.1.1 (root@) (gcc version 4.3.4 (Gentoo 4.3.4
p1.1,
> >>pie-10.1.5) ) Mon Aug 29 16:24:12 PDT 2011
> >>
> >>ukxen1 xen # xm info
> >>host                   : ukxen1
> >>release                : 3.0.3
> >>version                : #4 SMP Thu Dec 22 12:44:22 PST 2011
> >>machine                : x86_64
> >>nr_cpus                : 24
> >>nr_nodes               : 2
> >>cores_per_socket       : 6
> >>threads_per_core       : 2
> >>cpu_mhz                : 2261
> >>hw_caps                :
>
>>bfebfbff:2c100800:00000000:00003f40:029ee3ff:00000000:00000001:00000000
> >>virt_caps              : hvm hvm_directio
> >>total_memory           : 98291
> >>free_memory            : 91908
> >>free_cpus              : 0
> >>xen_major              : 4
> >>xen_minor              : 1
> >>xen_extra              : .1
> >>xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
hvm-3.0-x86_32
> >>hvm-3.0-x86_32p hvm-3.0-x86_64
> >>xen_scheduler          : credit
> >>xen_pagesize           : 4096
> >>platform_params        : virt_start=0xffff800000000000
> >>xen_changeset          : unavailable
> >>xen_commandline        : console=vga dom0_mem=1024M
dom0_max_vcpus=1
It could be that you are running out of memory. So the pvops kernel
(or at least the one you are running) has a bug in that it will
allocate pages up to 98GB. The solution for that is to use
dom0_mem=max:1024M , not dom0_mem=1024M.

Please try that.

Xen devel - Jan 2012 - Crashing / unable to start domUs due to high number of luns?

Crashing / unable to start domUs due to high number of luns?

Re: Crashing / unable to start domUs due to high number of luns?

Re: Crashing / unable to start domUs due to high number of luns?

Re: Crashing / unable to start domUs due to high number of luns?

Re: Crashing / unable to start domUs due to high number of luns?

Re: Crashing / unable to start domUs due to high number of luns?

Re: Crashing / unable to start domUs due to high number of luns?