PCextreme B.V. - Wido den Hollander
2009-May-20 10:52 UTC
[Xen-devel] Network drop to domU (netfront: rx->offset: 0, size: 4294967295)
Hello, I am the administrator of a fairly big Xen envirioment and i have run into a bug. At random points some dom0''s loose their network connection for about 1 ~ 2 minutes and in their kernel log the following comes up: [548994.957487] printk: 56 messages suppressed. [548994.957508] netfront: rx->offset: 0, size: 4294967295 [548994.957511] netfront: rx->offset: 0, size: 4294967295 The dom0 specs: - 2x Intel(R) Xeon(R) CPU E5420 - 64GB DDR2 FB-DIMM - 2x Intel 80003ES2LAN - SuperMicro X7DB8 mainboard - Areca ARC-1680ix-16 RAID Controller This is a Ubuntu 8.04.2 system with Xen 3.2.1-rc1-pre installed from the Ubuntu repositories. The kernel used here is a customized kernel (2.6.24-24-xen) based on the Ubuntu source, NR_DYNIRQS has been raised from 256 to 1024 to support more domU''s. At the moment this server is hosting about 110 domU''s. In my "xm dmesg" i get the following messages: (XEN) grant_table.c:1262:d0 Bad flags (0) or dom (0). (expected dom 0) This message is reported about 1000 times in a few days. I have two of these machines running, they are identical in both software in hardware, the only difference is the fact that one server hosts 110 domU''s and the other hosts about 20 domU''s. This behaviour is only seen the the machine hosting the 110 domU''s. At first i thought this had to do something with my Intel NIC, but at the moment the domU becomes unavailable the dom0 is still available, so it seems to go wrong somewhere inside the netfront. (That is what Google told me). One of the tests i did was disabling TSO, RX en TX with ethtool in both the dom0 and the domU, but this did not have any effect, the messages keep coming. To me this issue seems related to the large number of domU''s running on this system, especially since the other identical machine is not effected. I took the kernel source and started looking where the netfront messages was being printed and it seemed some kind of memory allocation issue? I have found some old messages with patches but those did not apply to my current source. Since this is a running production system i can schedule a reboot for a new kernel, but this takes some time. - Met vriendelijke groet, Wido den Hollander Hoofd Systeembeheer / CSO Telefoon Support Nederland: 0900 9633 (45 cpm) Telefoon Support België: 0900 70312 (45 cpm) Telefoon Direct: (+31) (0)20 50 60 104 Fax: +31 (0)20 50 60 111 E-mail: support@pcextreme.nl Website: http://www.pcextreme.nl Kennisbank: http://support.pcextreme.nl/ Netwerkstatus: http://nmc.pcextreme.nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-May-20 12:43 UTC
Re: [Xen-devel] Network drop to domU (netfront: rx->offset: 0, size: 4294967295)
On 20/05/2009 03:52, "PCextreme B.V. - Wido den Hollander" <wido@pcextreme.nl> wrote:> I took the kernel source and started looking where the netfront messages > was being printed and it seemed some kind of memory allocation issue? I > have found some old messages with patches but those did not apply to my > current source.The issue is that dom0 is trying to copy data to a domU-provided buffer, but the buffer details appear bogus. Netback returns a -1 error to netfront, which then prints out its error message. It''s not immediately obvious whether the bug is in netfront or netback. If it''s an issue that crops up with many guests then I suppose it''s more likely a netback issue, which is a pain. Do domUs lose their connection at the same time, or is it random individual domUs losing their connection from time to time? -- Keir> Since this is a running production system i can schedule a reboot for a > new kernel, but this takes some time._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
PCextreme B.V. - Wido den Hollander
2009-May-20 12:50 UTC
Re: [Xen-devel] Network drop to domU (netfront: rx->offset: 0, size: 4294967295)
Hi Keir,> The issue is that dom0 is trying to copy data to a domU-provided buffer, but > the buffer details appear bogus. Netback returns a -1 error to netfront, > which then prints out its error message. It''s not immediately obvious > whether the bug is in netfront or netback. If it''s an issue that crops up > with many guests then I suppose it''s more likely a netback issue, which is a > pain. Do domUs lose their connection at the same time, or is it random > individual domUs losing their connection from time to time?I am running some tests at the moment to verify wether this is a random issue or a "global" issue. So far i think it is random to just some individual domU''s. But i will keep running some tests.> If it''s an issue that crops up with many guests then I suppose it''smore likely a netback issue, which is a pain. I already assumed this would be a pain, any way to determine if it is a netback issue? Adding some verbose messages to the kernel? - Met vriendelijke groet, Wido den Hollander Hoofd Systeembeheer / CSO Telefoon Support Nederland: 0900 9633 (45 cpm) Telefoon Support België: 0900 70312 (45 cpm) Telefoon Direct: (+31) (0)20 50 60 104 Fax: +31 (0)20 50 60 111 E-mail: support@pcextreme.nl Website: http://www.pcextreme.nl Kennisbank: http://support.pcextreme.nl/ Netwerkstatus: http://nmc.pcextreme.nl On Wed, 2009-05-20 at 05:43 -0700, Keir Fraser wrote:> On 20/05/2009 03:52, "PCextreme B.V. - Wido den Hollander" > <wido@pcextreme.nl> wrote: > > > I took the kernel source and started looking where the netfront messages > > was being printed and it seemed some kind of memory allocation issue? I > > have found some old messages with patches but those did not apply to my > > current source. > > The issue is that dom0 is trying to copy data to a domU-provided buffer, but > the buffer details appear bogus. Netback returns a -1 error to netfront, > which then prints out its error message. It''s not immediately obvious > whether the bug is in netfront or netback. If it''s an issue that crops up > with many guests then I suppose it''s more likely a netback issue, which is a > pain. Do domUs lose their connection at the same time, or is it random > individual domUs losing their connection from time to time? > > -- Keir > > > Since this is a running production system i can schedule a reboot for a > > new kernel, but this takes some time. > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Mickael Marchand
2009-May-20 12:55 UTC
Re: [Xen-devel] Network drop to domU (netfront: rx->offset: 0, size: 4294967295)
On Wed, May 20, 2009 at 05:43:59AM -0700, Keir Fraser wrote:> On 20/05/2009 03:52, "PCextreme B.V. - Wido den Hollander" > <wido@pcextreme.nl> wrote: > > > I took the kernel source and started looking where the netfront messages > > was being printed and it seemed some kind of memory allocation issue? I > > have found some old messages with patches but those did not apply to my > > current source. > > The issue is that dom0 is trying to copy data to a domU-provided buffer, but > the buffer details appear bogus. Netback returns a -1 error to netfront, > which then prints out its error message. It''s not immediately obvious > whether the bug is in netfront or netback. If it''s an issue that crops up > with many guests then I suppose it''s more likely a netback issue, which is a > pain. Do domUs lose their connection at the same time, or is it random > individual domUs losing their connection from time to time?for the record I have the same error messages in domU when migrating (live) domUs between 2 servers, it does not always occurs but sometimes. I have much much less domUs on the same kind of servers (10-15 domUs) Cheers, Mik> > -- Keir > > > Since this is a running production system i can schedule a reboot for a > > new kernel, but this takes some time. > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-May-20 13:01 UTC
Re: [Xen-devel] Network drop to domU (netfront: rx->offset: 0, size: 4294967295)
On 20/05/2009 05:50, "PCextreme B.V. - Wido den Hollander" <wido@pcextreme.nl> wrote:>> If it''s an issue that crops up with many guests then I suppose it''s > more likely a netback issue, which is a pain. > > I already assumed this would be a pain, any way to determine if it is a > netback issue? Adding some verbose messages to the kernel?The visible effects of the bug start in dom0, when it presents a buffer reference (aka a ''grant reference'') to Xen as provided to it by domU. Xen notes that the grant reference is bogus (the xm dmesg output shows the flag field of the grant is zero, which means it''s currently unused). Now, does that mean domU forgot to initialise the buffer grant, or got out of sync somehow, or is the dom0 which has got out of sync? It''s rather hard to tell. But dom0 is more likely to be affected by scaling to large numbers of domains than a domU is. The logic in domU netfront doesn''t change, whereas dom0 netback has the actual multiplexing job. Hence dom0 is more likely to be the culprit. If you define DEBUG at the very top of dom0''s drivers/xen/netback/netback.c you will get more debug output from dom0 kernel when things go wrong. It may be not much extra help unfortunately, but extra tracing could be added I suppose (the pain being of course that each such change will require a dom0 reboot or a netback module reload, which itself may require domains to be restarted). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-May-20 13:10 UTC
Re: [Xen-devel] Network drop to domU (netfront: rx->offset: 0, size: 4294967295)
On 20/05/2009 05:55, "Mickael Marchand" <mikmak@freenux.org> wrote:>> The issue is that dom0 is trying to copy data to a domU-provided buffer, but >> the buffer details appear bogus. Netback returns a -1 error to netfront, >> which then prints out its error message. It''s not immediately obvious >> whether the bug is in netfront or netback. If it''s an issue that crops up >> with many guests then I suppose it''s more likely a netback issue, which is a >> pain. Do domUs lose their connection at the same time, or is it random >> individual domUs losing their connection from time to time? > > for the record I have the same error messages in domU when migrating > (live) domUs between 2 servers, it does not always occurs but sometimes. > I have much much less domUs on the same kind of servers (10-15 domUs)The error message is quite non specific. The messages you see are quite probably benign, whereas seeing this message when in ''steady state'' is most definitely a bug. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
PCextreme B.V. - Wido den Hollander
2009-May-20 13:21 UTC
Re: [Xen-devel] Network drop to domU (netfront: rx->offset: 0, size: 4294967295)
Hi, Thank you for your explanation. I''m on a short holiday starting tonight, so i won''t be able to rebuild a new kernel until next week. My tests are running, they are simple HTTP-checks with a interval of 2 seconds. The two domU''s on the same dom0. I''ll keep this running for a few days and get back with some results. - Met vriendelijke groet, Wido den Hollander Hoofd Systeembeheer / CSO Telefoon Support Nederland: 0900 9633 (45 cpm) Telefoon Support België: 0900 70312 (45 cpm) Telefoon Direct: (+31) (0)20 50 60 104 Fax: +31 (0)20 50 60 111 E-mail: support@pcextreme.nl Website: http://www.pcextreme.nl Kennisbank: http://support.pcextreme.nl/ Netwerkstatus: http://nmc.pcextreme.nl On Wed, 2009-05-20 at 06:01 -0700, Keir Fraser wrote:> On 20/05/2009 05:50, "PCextreme B.V. - Wido den Hollander" > <wido@pcextreme.nl> wrote: > > >> If it''s an issue that crops up with many guests then I suppose it''s > > more likely a netback issue, which is a pain. > > > > I already assumed this would be a pain, any way to determine if it is a > > netback issue? Adding some verbose messages to the kernel? > > The visible effects of the bug start in dom0, when it presents a buffer > reference (aka a ''grant reference'') to Xen as provided to it by domU. Xen > notes that the grant reference is bogus (the xm dmesg output shows the flag > field of the grant is zero, which means it''s currently unused). Now, does > that mean domU forgot to initialise the buffer grant, or got out of sync > somehow, or is the dom0 which has got out of sync? It''s rather hard to tell. > But dom0 is more likely to be affected by scaling to large numbers of > domains than a domU is. The logic in domU netfront doesn''t change, whereas > dom0 netback has the actual multiplexing job. Hence dom0 is more likely to > be the culprit. > > If you define DEBUG at the very top of dom0''s drivers/xen/netback/netback.c > you will get more debug output from dom0 kernel when things go wrong. It may > be not much extra help unfortunately, but extra tracing could be added I > suppose (the pain being of course that each such change will require a dom0 > reboot or a netback module reload, which itself may require domains to be > restarted). > > -- Keir > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
PCextreme B.V. - Wido den Hollander
2009-May-25 10:02 UTC
Re: [Xen-devel] Network drop to domU (netfront: rx->offset: 0, size: 4294967295)
Hi, My tests kept running this weekend and it seems it affects ALL the domU''s at the same moment. For example on May 24 at 13:54 and 16:45 the traffic dropped for about 30 seconds to both domU''s i have been monitoring. My ifconfig on the dom0 shows a lot of TX drops (14663 atm) and a total of 11047260 packets, that''s about 0.1% packetloss over the uptime of the domU (12 days atm). The errors didn''t change yet. - Met vriendelijke groet, Wido den Hollander Hoofd Systeembeheer / CSO Telefoon Support Nederland: 0900 9633 (45 cpm) Telefoon Support België: 0900 70312 (45 cpm) Telefoon Direct: (+31) (0)20 50 60 104 Fax: +31 (0)20 50 60 111 E-mail: support@pcextreme.nl Website: http://www.pcextreme.nl Kennisbank: http://support.pcextreme.nl/ Netwerkstatus: http://nmc.pcextreme.nl On Wed, 2009-05-20 at 06:01 -0700, Keir Fraser wrote:> On 20/05/2009 05:50, "PCextreme B.V. - Wido den Hollander" > <wido@pcextreme.nl> wrote: > > >> If it''s an issue that crops up with many guests then I suppose it''s > > more likely a netback issue, which is a pain. > > > > I already assumed this would be a pain, any way to determine if it is a > > netback issue? Adding some verbose messages to the kernel? > > The visible effects of the bug start in dom0, when it presents a buffer > reference (aka a ''grant reference'') to Xen as provided to it by domU. Xen > notes that the grant reference is bogus (the xm dmesg output shows the flag > field of the grant is zero, which means it''s currently unused). Now, does > that mean domU forgot to initialise the buffer grant, or got out of sync > somehow, or is the dom0 which has got out of sync? It''s rather hard to tell. > But dom0 is more likely to be affected by scaling to large numbers of > domains than a domU is. The logic in domU netfront doesn''t change, whereas > dom0 netback has the actual multiplexing job. Hence dom0 is more likely to > be the culprit. > > If you define DEBUG at the very top of dom0''s drivers/xen/netback/netback.c > you will get more debug output from dom0 kernel when things go wrong. It may > be not much extra help unfortunately, but extra tracing could be added I > suppose (the pain being of course that each such change will require a dom0 > reboot or a netback module reload, which itself may require domains to be > restarted). > > -- Keir > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel