Hi All, We have a somewhat serious issue around NLB on Windows 2012 and Xen. First, let me describe our environment and then I''ll let you know what''s wrong. 2 X Debian-squeeze boxes running the latest provided AMD64 Xen kernel and about 100GB of RAM. These boxes are connected via infiniband and DRBD is running over this(IPoIB). Each VPS runs on a mirrored DRBD devices. Each DRBD device sits on 2 logical volumes. One for data and one for metadata. The hypervisors exclusively run Windows VM''s(Server 2008 R2 and 2012). The VM''s are utilizing the GPLPV drivers(PCI,VBD,Net,etc). We are using network-bridge. So here is the trouble. We had somebody trying to setup Windows NLB. When adding a host it would cause the VM to freeze but also disconnect the DRBD devices. Everything recovers but the DRBD devices resync and a bunch of VM''s on the one side(the side with the VM that hangs up) get rebooted by Xen. Here is what we are seeing in messages: eth0: port 3(nlb2.e0) entering disabled state eth0: port 3(nlb2.e0) entering disabled state frontend_changed: backend/vif/65/0: prepare for reconnect device nlb.e0 entered promiscuous mode block drbd29: sock was shut down by peer block drbd29: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) block drbd24: sock was shut down by peer block drbd24: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) block drbd29: Creating new current UUID block drbd30: sock was shut down by peer block drbd30: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) .... and on and on and on with the DRBD disconnecting block drbd29: md_sync_timer expired! Worker calls drbd_md_sync(). block drbd21: md_sync_timer expired! Worker calls drbd_md_sync(). .... lots of that block drbd24: Terminating drbd24_asender block drbd21: asender terminated block drbd21: Terminating drbd21_asender .... eth0: port 3(nlb2.e0) entering forwarding state .... block drbd1: Handshake successful: Agreed network protocol version 91 block drbd1: conn( WFConnection -> WFReportParams ) block drbd38: Handshake successful: Agreed network protocol version 91 block drbd38: conn( WFConnection -> WFReportParams ) block drbd38: Starting asender thread (from drbd38_receiver [16250]) block drbd1: Starting asender thread (from drbd1_receiver [18278]) ... Then lots of stuff for the DRBD devices reconnecting and syncing. This happened three times, each time the user was attempting to add the second node into NLB. I can reproduce the network adapter dying(Becomes disabled and is unusable until reboot) in the lab on Server 2012 unless I follow specific steps, but not the DRBD dying. I can get NLB working but I''m mostly concerned about one persons ability to effectively crash 8 other VM''s. It looks like whatever is going on is somehow effecting my DRBD connection. Has anyone seen anything like this before? Thanks, Greg _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Thu, Nov 29, 2012 at 01:12:50PM +1300, Greg Zapp wrote:> Hi All, >Hello,> We have a somewhat serious issue around NLB on Windows 2012 and Xen. > First, let me describe our environment and then I''ll let you know what''s > wrong. > > 2 X Debian-squeeze boxes running the latest provided AMD64 Xen kernel and > about 100GB of RAM.You haven''t provided enough information.. - What Xen version are you running? - What dom0 kernel version are you running?> These boxes are connected via infiniband and DRBD is running over > this(IPoIB). > Each VPS runs on a mirrored DRBD devices. > Each DRBD device sits on 2 logical volumes. One for data and one for > metadata. > The hypervisors exclusively run Windows VM''s(Server 2008 R2 and 2012). > The VM''s are utilizing the GPLPV drivers(PCI,VBD,Net,etc). > We are using network-bridge. > > So here is the trouble. We had somebody trying to setup Windows NLB. > When adding a host it would cause the VM to freeze but also disconnect the > DRBD devices. Everything recovers but the DRBD devices resync and a bunch > of VM''s on the one side(the side with the VM that hangs up) get rebooted > by Xen. Here is what we are seeing in messages: > > eth0: port 3(nlb2.e0) entering disabled state > eth0: port 3(nlb2.e0) entering disabled state > frontend_changed: backend/vif/65/0: prepare for reconnect > device nlb.e0 entered promiscuous mode > block drbd29: sock was shut down by peer > block drbd29: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) > pdsk( UpToDate -> DUnknown ) > block drbd24: sock was shut down by peer > block drbd24: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) > pdsk( UpToDate -> DUnknown ) > block drbd29: Creating new current UUID > block drbd30: sock was shut down by peer > block drbd30: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) > pdsk( UpToDate -> DUnknown ) > .... and on and on and on with the DRBD disconnecting > block drbd29: md_sync_timer expired! Worker calls drbd_md_sync(). > block drbd21: md_sync_timer expired! Worker calls drbd_md_sync(). > .... lots of that > block drbd24: Terminating drbd24_asender > block drbd21: asender terminated > block drbd21: Terminating drbd21_asender > .... > eth0: port 3(nlb2.e0) entering forwarding state > .... > block drbd1: Handshake successful: Agreed network protocol version 91 > block drbd1: conn( WFConnection -> WFReportParams ) > block drbd38: Handshake successful: Agreed network protocol version 91 > block drbd38: conn( WFConnection -> WFReportParams ) > block drbd38: Starting asender thread (from drbd38_receiver [16250]) > block drbd1: Starting asender thread (from drbd1_receiver [18278]) > ... Then lots of stuff for the DRBD devices reconnecting and syncing. > > This happened three times, each time the user was attempting to add the > second node into NLB. I can reproduce the network adapter dying(Becomes > disabled and is unusable until reboot) in the lab on Server 2012 unless I > follow specific steps, but not the DRBD dying. I can get NLB working but > I''m mostly concerned about one persons ability to effectively crash 8 > other VM''s. It looks like whatever is going on is somehow effecting my > DRBD connection. Has anyone seen anything like this before? >Does it happen without GPLPV drivers? Try using plain Intel e1000 emulated NICs in the Windows VMs. Any errors in dom0 kernel dmesg? How about in Xen dmesg? -- Pasi
HI, We are running Debian''s provided xen-hypervisor-4.0-amd64(4.0. 1-4). The kernel is 2.6.32-5-xen-amd64(2.6.32-46) from Debian. The previously posted log lines were from the dom0''s /var/log/messages. The only thing I''m seeing form xm dmesg is the following: (XEN) grant_table.c:1717:d0 Bad grant reference I''ve also picked up on some more entries from syslog that were not present in messages. Here is what''s present in syslog. Time seems to be sync''d to the second on both machines: Nov 28 10:55:03 nodeA kernel: [1239467.400293] eth0: port 11(nlb2.e0) entering disabled state Nov 28 10:55:03 nodeA kernel: [1239467.400516] eth0: port 11(nlb2.e0) entering disabled state Nov 28 10:55:04 nodeA kernel: [1239467.731442] frontend_changed: backend/vif/73/0: prepare for reconnect Nov 28 10:55:04 nodeA logger: /etc/xen/scripts/vif-bridge: offline XENBUS_PATH=backend/vif/73/0 Nov 28 10:55:05 nodeA logger: /etc/xen/scripts/vif-bridge: brctl delif eth0 nlb2.e0 failed Nov 28 10:55:05 nodeA logger: /etc/xen/scripts/vif-bridge: ifconfig nlb2.e0 down failed Nov 28 10:55:06 nodeA logger: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for nlb2.e0, bridge eth0. Nov 28 10:55:06 nodeA logger: /etc/xen/scripts/vif-bridge: online XENBUS_PATH=backend/vif/73/0 Nov 28 10:55:08 nodeA kernel: [1239471.758583] device nlb2.e0 entered promiscuous mode Nov 28 10:55:10 nodeA kernel: [1239473.795967] block drbd23: sock was shut down by peer Nov 28 10:55:27 nodeA kernel: [1239473.795973] block drbd23: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Nov 28 10:55:27 nodeA kernel: [1239473.795980] block drbd23: short read expecting header on sock: r=0 Nov 28 10:55:27 nodeA kernel: [1239474.009951] block drbd31: sock was shut down by peer Nov 28 10:55:09 nodeB kernel: [1239622.217505] block drbd23: PingAck did not arrive in time. Nov 28 10:55:09 nodeB kernel: [1239622.217542] block drbd23: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Nov 28 10:55:09 nodeB kernel: [1239622.217551] block drbd23: asender terminated Nov 28 10:55:09 nodeB kernel: [1239622.217554] block drbd23: Terminating drbd23_asender Nov 28 10:55:09 nodeB kernel: [1239622.217795] block drbd23: short read expecting header on sock: r=-512 Nov 28 10:55:09 nodeB kernel: [1239622.217887] block drbd23: Creating new current UUID Nov 28 10:55:09 nodeB kernel: [1239622.218118] block drbd23: Connection closed Nov 28 10:55:09 nodeB kernel: [1239622.218125] block drbd23: conn( NetworkFailure -> Unconnected ) Nov 28 10:55:09 nodeB kernel: [1239622.218135] block drbd23: receiver terminated Nov 28 10:55:09 nodeB kernel: [1239622.218137] block drbd23: Restarting drbd23_receiver Nov 28 10:55:09 nodeB kernel: [1239622.218140] block drbd23: receiver (re)started Nov 28 10:55:09 nodeB kernel: [1239622.218143] block drbd23: conn( Unconnected -> WFConnection ) Nov 28 10:55:09 nodeB kernel: [1239622.353589] block drbd30: PingAck did not arrive in time. Nov 28 10:55:09 nodeB kernel: [1239622.353627] block drbd30: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Nov 28 10:55:09 nodeB kernel: [1239622.353637] block drbd30: asender terminated Nov 28 10:55:09 nodeB kernel: [1239622.353639] block drbd30: Terminating drbd30_asender Nov 28 10:55:09 nodeB kernel: [1239622.353668] block drbd30: short read expecting header on sock: r=-512 Nov 28 10:55:09 nodeB kernel: [1239622.353754] block drbd30: Creating new current UUID Nov 28 10:55:09 nodeB kernel: [1239622.388101] block drbd30: Connection closed Nov 28 10:55:09 nodeB kernel: [1239622.388107] block drbd30: conn( NetworkFailure -> Unconnected ) Nov 28 10:55:09 nodeB kernel: [1239622.388111] block drbd30: receiver terminated Nov 28 10:55:09 nodeB kernel: [1239622.388113] block drbd30: Restarting drbd30_receiver Nov 28 10:55:09 nodeB kernel: [1239622.388116] block drbd30: receiver (re)started Nov 28 10:55:09 nodeB kernel: [1239622.388119] block drbd30: conn( Unconnected -> WFConnection ) I''ve also looked at the qemu, xend-hotplug, and xend logs and do not see any telling errors. In xend.log I just see lines pertaining to VM''s being rebooted. As for GPLPV.. I haven''t been able to reproduce the "network crashing" and rebooting in the lab and probably won''t be able to until I can get a more robust production-like environment setup. Unfortunately I can''t risk more customer down time by attempting to setup NLB without the GPLPV drivers in production. If I can manage to reproduce this in staging I will of course attempt without the GPLPV drivers. -Greg _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, Nov 30, 2012 at 11:11:11AM +1300, Greg Zapp wrote:> HI, > > We are running Debian''s provided xen-hypervisor-4.0-amd64(4.0. > 1-4). The kernel is 2.6.32-5-xen-amd64(2.6.32-46) from Debian. >Also you didn''t tell which version of GPLPV you''re using.> The previously posted log lines were from the dom0''s /var/log/messages. > The only thing I''m seeing form xm dmesg is the following: > (XEN) grant_table.c:1717:d0 Bad grant reference >That might be related. Does the log entry about bad grant reference show up at the same time as the problem? -- Pasi
Looks like I have GPLPV 0.10.0.135 packaged in(according to the changelog). I''m using the Univention signed drivers. As for that error, you may be on to something. I''m not sure when those errors came up as there are no time stamps in the output from ''xm dmesg''. However, I only see those errors on nodA, which is the node with the VM''s that crashed. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Looks like these are the same errors as referenced in here: http://old-list-archives.xen.org/archives/html/xen-devel/2011-04/msg00453.html. It''s indicated in that trouble could be triggered from the netback code. If so, I''m curious if some combination of the number of HVM guests running and the re-connecting of the network adapter could be triggering the issue? On Fri, Nov 30, 2012 at 1:29 PM, Greg Zapp <greg.zapp@gmail.com> wrote:> Looks like I have GPLPV 0.10.0.135 packaged in(according to the > changelog). I''m using the Univention signed drivers. > > As for that error, you may be on to something. I''m not sure when those > errors came up as there are no time stamps in the output from ''xm dmesg''. > However, I only see those errors on nodA, which is the node with the VM''s > that crashed. >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, Nov 30, 2012 at 11:11:11AM +1300, Greg Zapp wrote:> HI, > > We are running Debian''s provided xen-hypervisor-4.0-amd64(4.0. > 1-4). The kernel is 2.6.32-5-xen-amd64(2.6.32-46) from Debian. > > The previously posted log lines were from the dom0''s /var/log/messages. > The only thing I''m seeing form xm dmesg is the following: > (XEN) grant_table.c:1717:d0 Bad grant reference > > I''ve also picked up on some more entries from syslog that were not present > in messages. Here is what''s present in syslog. Time seems to be sync''d > to the second on both machines: > Nov 28 10:55:03 nodeA kernel: [1239467.400293] eth0: port 11(nlb2.e0) > entering disabled state > Nov 28 10:55:03 nodeA kernel: [1239467.400516] eth0: port 11(nlb2.e0) > entering disabled stateHow''s your networking set up? I hope the the Windows NLB VMs aren''t using the same bridge/VLAN as DRBD is using? -- Pasi> Nov 28 10:55:04 nodeA kernel: [1239467.731442] frontend_changed: > backend/vif/73/0: prepare for reconnect > Nov 28 10:55:04 nodeA logger: /etc/xen/scripts/vif-bridge: offline > XENBUS_PATH=backend/vif/73/0 > Nov 28 10:55:05 nodeA logger: /etc/xen/scripts/vif-bridge: brctl delif > eth0 nlb2.e0 failed > Nov 28 10:55:05 nodeA logger: /etc/xen/scripts/vif-bridge: ifconfig > nlb2.e0 down failed > Nov 28 10:55:06 nodeA logger: /etc/xen/scripts/vif-bridge: Successful > vif-bridge offline for nlb2.e0, bridge eth0. > Nov 28 10:55:06 nodeA logger: /etc/xen/scripts/vif-bridge: online > XENBUS_PATH=backend/vif/73/0 > Nov 28 10:55:08 nodeA kernel: [1239471.758583] device nlb2.e0 entered > promiscuous mode > Nov 28 10:55:10 nodeA kernel: [1239473.795967] block drbd23: sock was shut > down by peer > Nov 28 10:55:27 nodeA kernel: [1239473.795973] block drbd23: peer( Primary > -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) > Nov 28 10:55:27 nodeA kernel: [1239473.795980] block drbd23: short read > expecting header on sock: r=0 > Nov 28 10:55:27 nodeA kernel: [1239474.009951] block drbd31: sock was shut > down by peer > > Nov 28 10:55:09 nodeB kernel: [1239622.217505] block drbd23: PingAck did > not arrive in time. > Nov 28 10:55:09 nodeB kernel: [1239622.217542] block drbd23: peer( > Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate > -> DUnknown ) > Nov 28 10:55:09 nodeB kernel: [1239622.217551] block drbd23: asender > terminated > Nov 28 10:55:09 nodeB kernel: [1239622.217554] block drbd23: Terminating > drbd23_asender > Nov 28 10:55:09 nodeB kernel: [1239622.217795] block drbd23: short read > expecting header on sock: r=-512 > Nov 28 10:55:09 nodeB kernel: [1239622.217887] block drbd23: Creating new > current UUID > Nov 28 10:55:09 nodeB kernel: [1239622.218118] block drbd23: Connection > closed > Nov 28 10:55:09 nodeB kernel: [1239622.218125] block drbd23: conn( > NetworkFailure -> Unconnected ) > Nov 28 10:55:09 nodeB kernel: [1239622.218135] block drbd23: receiver > terminated > Nov 28 10:55:09 nodeB kernel: [1239622.218137] block drbd23: Restarting > drbd23_receiver > Nov 28 10:55:09 nodeB kernel: [1239622.218140] block drbd23: receiver > (re)started > Nov 28 10:55:09 nodeB kernel: [1239622.218143] block drbd23: conn( > Unconnected -> WFConnection ) > Nov 28 10:55:09 nodeB kernel: [1239622.353589] block drbd30: PingAck did > not arrive in time. > Nov 28 10:55:09 nodeB kernel: [1239622.353627] block drbd30: peer( > Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate > -> DUnknown ) > Nov 28 10:55:09 nodeB kernel: [1239622.353637] block drbd30: asender > terminated > Nov 28 10:55:09 nodeB kernel: [1239622.353639] block drbd30: Terminating > drbd30_asender > Nov 28 10:55:09 nodeB kernel: [1239622.353668] block drbd30: short read > expecting header on sock: r=-512 > Nov 28 10:55:09 nodeB kernel: [1239622.353754] block drbd30: Creating new > current UUID > Nov 28 10:55:09 nodeB kernel: [1239622.388101] block drbd30: Connection > closed > Nov 28 10:55:09 nodeB kernel: [1239622.388107] block drbd30: conn( > NetworkFailure -> Unconnected ) > Nov 28 10:55:09 nodeB kernel: [1239622.388111] block drbd30: receiver > terminated > Nov 28 10:55:09 nodeB kernel: [1239622.388113] block drbd30: Restarting > drbd30_receiver > Nov 28 10:55:09 nodeB kernel: [1239622.388116] block drbd30: receiver > (re)started > Nov 28 10:55:09 nodeB kernel: [1239622.388119] block drbd30: conn( > Unconnected -> WFConnection ) > > I''ve also looked at the qemu, xend-hotplug, and xend logs and do not see > any telling errors. In xend.log I just see lines pertaining to VM''s being > rebooted. > > As for GPLPV.. I haven''t been able to reproduce the "network crashing" and > rebooting in the lab and probably won''t be able to until I can get a more > robust production-like environment setup. Unfortunately I can''t risk more > customer down time by attempting to setup NLB without the GPLPV drivers in > production. If I can manage to reproduce this in staging I will of course > attempt without the GPLPV drivers. > > -Greg
On Fri, Nov 30, 2012 at 01:36:55PM +0200, Pasi Kärkkäinen wrote:> On Fri, Nov 30, 2012 at 11:11:11AM +1300, Greg Zapp wrote: > > HI, > > > > We are running Debian''s provided xen-hypervisor-4.0-amd64(4.0. > > 1-4). The kernel is 2.6.32-5-xen-amd64(2.6.32-46) from Debian. > > > > The previously posted log lines were from the dom0''s /var/log/messages. > > The only thing I''m seeing form xm dmesg is the following: > > (XEN) grant_table.c:1717:d0 Bad grant reference > > > > I''ve also picked up on some more entries from syslog that were not present > > in messages. Here is what''s present in syslog. Time seems to be sync''d > > to the second on both machines: > > Nov 28 10:55:03 nodeA kernel: [1239467.400293] eth0: port 11(nlb2.e0) > > entering disabled state > > Nov 28 10:55:03 nodeA kernel: [1239467.400516] eth0: port 11(nlb2.e0) > > entering disabled state > > How''s your networking set up? > > I hope the the Windows NLB VMs aren''t using the same bridge/VLAN as DRBD is using? >Also in what mode are you using NLB? -- Pasi
On Fri, Nov 30, 2012 at 02:52:32PM +1300, Greg Zapp wrote:> Looks like these are the same errors as referenced in here: > [1]http://old-list-archives.xen.org/archives/html/xen-devel/2011-04/msg00453.html > . It''s indicated in that trouble could be triggered from the netback > code. If so, I''m curious if some combination of the number of HVM guests > running and the re-connecting of the network adapter could be triggering > the issue? > > On Fri, Nov 30, 2012 at 1:29 PM, Greg Zapp <[2]greg.zapp@gmail.com> wrote: > > Looks like I have GPLPV 0.10.0.135 packaged in(according to the > changelog). I''m using the Univention signed drivers. >Hmm.. based on the changelog (http://xenbits.xen.org/ext/win-pvdrivers) 0.10.0.135 is very old version from 2009. There have been tons of changes/fixes after that. -- Pasi