Maybe related or maybe not, but it was the same VM getting all the scheduling time in my previous post. (SMP Celeron box with 512M of RAM, no himem enabled.) At the time, four VMs were all compiling, with dom0 copying a linux source tree from one place to another with rsync. Everything copacetic until I started the big rsync in dom0, where within a minute or so, vm2 bombed. No messages on the dom0 console or in the VM other than the "Segmentation Fault" in the VM during compliation. However XEN (compiled with debug=y) console spits out: (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 4GB segment. at the time of the segmentation fault. (and there are lots of these, pretty much any time there is heavy i/o on the machine, all with the same values:) (XEN) (file=traps.c, line=466) GPF (0004): fc5277a8 -> fc52a294 Any further activity inside vm2 results in more segmentation faults and more "Bailing" messages. The other VMs and dom0 seem to be ok. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
that sounds like the same sort of errors i''m getting which appeared to be filesystem corruption. First the corruption starts, then everything you do causes a segfault, although i''ve only seen funny things happen in dom0. In the limited testing i''ve done it looks like dom0 by itself is stable, but crashes start occuring once I start up other domains and work dom0 hard (other domains running under light load). I''m running this script in dom0: #!/bin/sh while [ 1 = 1 ] do diff file3 file4 && echo okay done where file3 and file4 are around 300mb files, and the vm has 128mb of memory with no swap. This ensures that none of the file is cached so there''s lots of I/O. When i''ve seen it crash most readily has been when i''m running a few other domains and then start running dom0 out of memory, but nothing conclusive yet. I''ll let this test keep running for another hour (otherwise idle, no other domains running) or so then start my running-out-of-memory program. I wonder if it is coincidence that we both have smp boxes... each of the domains only sees 1 cpu so I wouldn''t have thought that would be a problem unless there''s a race in xen itself. James From: Derek Glidden Sent: Mon 19/07/2004 3:22 PM To: xen-devel@lists.sourceforge.net Subject: [Xen-devel] segfault in VM Maybe related or maybe not, but it was the same VM getting all the scheduling time in my previous post. (SMP Celeron box with 512M of RAM, no himem enabled.) At the time, four VMs were all compiling, with dom0 copying a linux source tree from one place to another with rsync. Everything copacetic until I started the big rsync in dom0, where within a minute or so, vm2 bombed. No messages on the dom0 console or in the VM other than the "Segmentation Fault" in the VM during compliation. However XEN (compiled with debug=y) console spits out: (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 4GB segment. at the time of the segmentation fault. (and there are lots of these, pretty much any time there is heavy i/o on the machine, all with the same values:) (XEN) (file=traps.c, line=466) GPF (0004): fc5277a8 -> fc52a294 Any further activity inside vm2 results in more segmentation faults and more "Bailing" messages. The other VMs and dom0 seem to be ok. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Clearly there''s some fairly random memory corruption going on, which then causes segfaults (if the corruption hits code pages) and filesystem corruption (if the corruption hits buffer-cache pages). The "Bailing: not a -ve offset" and "GPF (0004):" messages are almost certainly just symptoms of executing a corrupted block of code. i.e., the bug has already triggered some time ago - probably corrupted a page of glibc or the kernel. It would be interesting to see whether or not this is SMP-related. It''s also interesting that someone said they couldn''t reproduce corruption when using 2.6.7 for the non-privileged guest OSes. -- Keir> that sounds like the same sort of errors i''m getting which appeared to be filesystem corruption. First the corruption starts, then everything you do causes a segfault, although i''ve only seen funny things happen in dom0. > > In the limited testing i''ve done it looks like dom0 by itself is stable, but crashes start occuring once I start up other domains and work dom0 hard (other domains running under light load). I''m running this script in dom0: > > #!/bin/sh > while [ 1 = 1 ] > do > diff file3 file4 && echo okay > done > > where file3 and file4 are around 300mb files, and the vm has 128mb of memory with no swap. This ensures that none of the file is cached so there''s lots of I/O. > > When i''ve seen it crash most readily has been when i''m running a few other domains and then start running dom0 out of memory, but nothing conclusive yet. > > I''ll let this test keep running for another hour (otherwise idle, no other domains running) or so then start my running-out-of-memory program. > > I wonder if it is coincidence that we both have smp boxes... each of the domains only sees 1 cpu so I wouldn''t have thought that would be a problem unless there''s a race in xen itself. > > James > > > From: Derek Glidden > Sent: Mon 19/07/2004 3:22 PM > To: xen-devel@lists.sourceforge.net > Subject: [Xen-devel] segfault in VM > > > Maybe related or maybe not, but it was the same VM getting all the > scheduling time in my previous post. (SMP Celeron box with 512M of > RAM, no himem enabled.) > > At the time, four VMs were all compiling, with dom0 copying a linux > source tree from one place to another with rsync. Everything copacetic > until I started the big rsync in dom0, where within a minute or so, vm2 > bombed. No messages on the dom0 console or in the VM other than the > "Segmentation Fault" in the VM during compliation. > > However XEN (compiled with debug=y) console spits out: > > (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into > 4GB segment. > > at the time of the segmentation fault. > > (and there are lots of these, pretty much any time there is heavy i/o > on the machine, all with the same values:) > > (XEN) (file=traps.c, line=466) GPF (0004): fc5277a8 -> fc52a294 > > Any further activity inside vm2 results in more segmentation faults and > more "Bailing" messages. The other VMs and dom0 seem to be ok. > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > "We all enter this world in the | Support Electronic Freedom > same way: naked; screaming; soaked | http://www.eff.org/ > in blood. But if you live your | http://www.anti-dmca.org/ > life right, that kind of thing |--------------------------- > doesn''t have to stop there." -- Dana Gould > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel-=- MIME -=- --_DA10D165-B49A-46A6-8E62-3E81282C36E8_ Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1"; format=flowed that sounds like the same sort of errors i''m getting which appeared to be filesystem corruption. First the corruption starts, then everything you do causes a segfault, although i''ve only seen funny things happen in dom0. In the limited testing i''ve done it looks like dom0 by itself is stable, but crashes start occuring once I start up other domains and work dom0 hard (other domains running under light load). I''m running this script in dom0: #!/bin/sh while [ 1 =3D 1 ] do diff file3 file4 && echo okay done where file3 and file4 are around 300mb files, and the vm has 128mb of memory with no swap. This ensures that none of the file is cached so there''s lots of I/O. When i''ve seen it crash most readily has been when i''m running a few other domains and then start running dom0 out of memory, but nothing conclusive yet. I''ll let this test keep running for another hour (otherwise idle, no other domains running) or so then start my running-out-of-memory program. I wonder if it is coincidence that we both have smp boxes... each of the domains only sees 1 cpu so I wouldn''t have thought that would be a problem unless there''s a race in xen itself. James From: Derek Glidden Sent: Mon 19/07/2004 3:22 PM To: xen-devel@lists.sourceforge.net Subject: [Xen-devel] segfault in VM Maybe related or maybe not, but it was the same VM getting all the=20 scheduling time in my previous post. (SMP Celeron box with 512M of=20 RAM, no himem enabled.) At the time, four VMs were all compiling, with dom0 copying a linux=20 source tree from one place to another with rsync. Everything copacetic=20 until I started the big rsync in dom0, where within a minute or so, vm2=20 bombed. No messages on the dom0 console or in the VM other than the=20 "Segmentation Fault" in the VM during compliation. However XEN (compiled with debug=3Dy) console spits out: (XEN) (file=3Dx86_32/emulate.c, line=3D228) Bailing: not a -ve offset into=20 4GB segment. at the time of the segmentation fault. (and there are lots of these, pretty much any time there is heavy i/o=20 on the machine, all with the same values:) (XEN) (file=3Dtraps.c, line=3D466) GPF (0004): fc5277a8 -> fc52a294 Any further activity inside vm2 results in more segmentation faults and=20 more "Bailing" messages. The other VMs and dom0 seem to be ok. -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel --_DA10D165-B49A-46A6-8E62-3E81282C36E8_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <HTML><HEAD></HEAD> <BODY> <DIV id=3DidOWAReplyText53940 dir=3Dltr> <DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>that sounds like the same sort of errors i''m getting which appeared to be filesystem corruption. First the corruption starts, then everything you do causes a segfault, although i''ve only seen funny things happen in dom0.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>In the limited testing i''ve done it looks like dom0 by itself is stable, but crashes start occuring once I start up other domains and work dom0 hard (other domains running under light load). I''m running this script in dom0:</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>#!/bin/sh<BR>while [ 1 =3D 1 ]<BR>do<BR> diff file3 file4 && echo okay<BR>done<BR></FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>where file3 and file4 are around 300mb files, and the vm has 128mb of memory with no swap. This ensures that none of the file is cached so there''s lots of I/O.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>When i''ve seen it crash most readily has been when i''m running a few other domains and then start running dom0 out of memory, but nothing conclusive yet.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>I''ll let this test keep running for another hour (otherwise idle, no other domains running) or so then start my running-out-of-memory program.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>I wonder if it is coincidence that we both have smp boxes... each of the domains only sees 1 cpu so I wouldn''t have thought that would be a problem unless there''s a race in xen itself.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr> </DIV></DIV> <DIV dir=3Dltr><BR> <HR tabIndex=3D-1> <FONT face=3DTahoma size=3D2><B>From:</B> Derek Glidden<BR><B>Sent:</B> Mon 19/07/2004 3:22 PM<BR><B>To:</B> xen-devel@lists.sourceforge.net<BR><B>Subject:</B> [Xen-devel] segfault in VM<BR></FONT><BR></DIV> <DIV><PRE style=3D"WORD-WRAP: break-word">Maybe related or maybe not, but it was the same VM getting all the=20 scheduling time in my previous post. (SMP Celeron box with 512M of=20 RAM, no himem enabled.) At the time, four VMs were all compiling, with dom0 copying a linux=20 source tree from one place to another with rsync. Everything copacetic=20 until I started the big rsync in dom0, where within a minute or so, vm2=20 bombed. No messages on the dom0 console or in the VM other than the=20 "Segmentation Fault" in the VM during compliation. However XEN (compiled with debug=3Dy) console spits out: (XEN) (file=3Dx86_32/emulate.c, line=3D228) Bailing: not a -ve offset into=20 4GB segment. at the time of the segmentation fault. (and there are lots of these, pretty much any time there is heavy i/o=20 on the machine, all with the same values:) (XEN) (file=3Dtraps.c, line=3D466) GPF (0004): fc5277a8 -> fc52a294 Any further activity inside vm2 results in more segmentation faults and=20 more "Bailing" messages. The other VMs and dom0 seem to be ok. -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel </PRE></DIV></BODY></HTML> --_DA10D165-B49A-46A6-8E62-3E81282C36E8_-- ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Keir Fraser wrote:> Clearly there''s some fairly random memory corruption going on, which > then causes segfaults (if the corruption hits code pages) and > filesystem corruption (if the corruption hits buffer-cache pages).>> The "Bailing: not a -ve offset" and "GPF (0004):" messages are almost > certainly just symptoms of executing a corrupted block of code. i.e., > the bug has already triggered some time ago - probably corrupted a > page of glibc or the kernel. > > It would be interesting to see whether or not this is SMP-related. > It''s also interesting that someone said they couldn''t reproduce > corruption when using 2.6.7 for the non-privileged guest OSes.I''m seeing this corruption on a single CPU machine, with a single 2.4 guest running but idle. I only ran one 2.6.7 guest, and I didn''t give it any work, but it didn''t take any load in the 2.4 guest to provoke problems. The machine uses devicemapper, so I''m going to move some partitions around and see if I still get corruption without it. I can also build Xen with debug=y and try that, once I''ve got the disk sorted. Chris. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> Keir Fraser wrote: > > Clearly there''s some fairly random memory corruption going on, which > > then causes segfaults (if the corruption hits code pages) and > > filesystem corruption (if the corruption hits buffer-cache pages). > > > > The "Bailing: not a -ve offset" and "GPF (0004):" messages are almost > > certainly just symptoms of executing a corrupted block of code. i.e., > > the bug has already triggered some time ago - probably corrupted a > > page of glibc or the kernel. > > > > It would be interesting to see whether or not this is SMP-related. > > It''s also interesting that someone said they couldn''t reproduce > > corruption when using 2.6.7 for the non-privileged guest OSes. > > I''m seeing this corruption on a single CPU machine, with a single 2.4 > guest running but idle. I only ran one 2.6.7 guest, and I didn''t give it > any work, but it didn''t take any load in the 2.4 guest to provoke problems.Do you mean a single 2.4 or 2.6 guest in addition to your 2.4 DOM0? -- Keir ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Keir Fraser wrote:>>Keir Fraser wrote: >> >>>Clearly there''s some fairly random memory corruption going on, which >>>then causes segfaults (if the corruption hits code pages) and >>>filesystem corruption (if the corruption hits buffer-cache pages). >> >> > >> >>>The "Bailing: not a -ve offset" and "GPF (0004):" messages are almost >>>certainly just symptoms of executing a corrupted block of code. i.e., >>>the bug has already triggered some time ago - probably corrupted a >>>page of glibc or the kernel. >>> >>>It would be interesting to see whether or not this is SMP-related. >>>It''s also interesting that someone said they couldn''t reproduce >>>corruption when using 2.6.7 for the non-privileged guest OSes. >> >>I''m seeing this corruption on a single CPU machine, with a single 2.4 >>guest running but idle. I only ran one 2.6.7 guest, and I didn''t give it >>any work, but it didn''t take any load in the 2.4 guest to provoke problems. > > > Do you mean a single 2.4 or 2.6 guest in addition to your 2.4 DOM0?Yes, that''s right. With just the 2.4 domain0 on its own, everything seems fine. Chris. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Mon, Jul 19, 2004 at 10:01:54AM +0100, Chris Andrews wrote:> Keir Fraser wrote: > >>Keir Fraser wrote: > >Do you mean a single 2.4 or 2.6 guest in addition to your 2.4 DOM0? > > Yes, that''s right. With just the 2.4 domain0 on its own, everything > seems fine.OK using an image from Chris of dom1 i have been able to semi reliably cause all sorts of corruption including Oopses in dom0 and other domains. This is on 2 different machines one of which is thought to be atleast semi reliable. I first noticed it when doing a bk pull and having bitkeeper deciding that my tree was rather corrupt (in a dom0), but with other domains running. Running while (:) do tar cpf - . | gzip -3vc | cat >/dev/null; done http://www.yuri.org.uk/~murble/boom/ has some output, with the domain0 deciding to try and access beyond the end of device lots, but this could be caused by random memory corruption. Whilst trying to build somthing in dom0 seems a fairly reliable way of triggering it. Also my dom0 only had 48mb or so of ram but plenty of swap. Before the crash i noticed user programs that allocated lots of memory in dom0 randomly segfaulting, including apt-get update and apt-get build-deps. Again the oopsen i get were generally rather random, although i have noticed another possible XenoLinux bug. When you boot with panic=30 it takes ages for dom0 to reboot, far longer than 30 seconds. Even though after the panic it says rebooting in 30 seconds. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
It strikes me that many people have started seeing this bug all of a sudden, so it has probably been introduced in the last week. Perhaps it is worth someone backing off to an older repository version and seeing whether they can reproduce the problems? If we can ''binary chop'' the changesets to isolate the bad one, it would be a much easier bug to fix. ;-) Sounds like it would be a fairly tedious process though... The first person to complain I think was Jody Belka, who was using the changeset with comment ''Fairly major fixes to the network frontend driver...'' (2004-07-13 18:24:48). Perhaps backing off to a day before that would be a sensible place to start? -- Keir> On Mon, Jul 19, 2004 at 10:01:54AM +0100, Chris Andrews wrote: > > Keir Fraser wrote: > > >>Keir Fraser wrote: > > >Do you mean a single 2.4 or 2.6 guest in addition to your 2.4 DOM0? > > > > Yes, that''s right. With just the 2.4 domain0 on its own, everything > > seems fine. > > OK using an image from Chris of dom1 i have been able to semi reliably cause > all sorts of corruption including Oopses in dom0 and other domains. > > This is on 2 different machines one of which is thought to be atleast semi > reliable. > > I first noticed it when doing a bk pull and having bitkeeper deciding > that my tree was rather corrupt (in a dom0), but with other domains > running. > > Running while (:) do tar cpf - . | gzip -3vc | cat >/dev/null; done > > http://www.yuri.org.uk/~murble/boom/ has some output, with the > domain0 deciding to try and access beyond the end of device lots, > but this could be caused by random memory corruption. > > Whilst trying to build somthing in dom0 seems a fairly reliable way > of triggering it. > > Also my dom0 only had 48mb or so of ram but plenty of swap. > > Before the crash i noticed user programs that allocated lots of memory > in dom0 randomly segfaulting, including apt-get update and apt-get > build-deps. > > Again the oopsen i get were generally rather random, although i > have noticed another possible XenoLinux bug. When you boot with > panic=30 it takes ages for dom0 to reboot, far longer than 30 seconds. > > Even though after the panic it says rebooting in 30 seconds. > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 19, 2004, at 1:50 AM, James Harper wrote:> where file3 and file4 are around 300mb files, and the vm has 128mb of > memory with no swap. This ensures that none of the file is cached so > there''s lots of I/O. > > When i''ve seen it crash most readily has been when i''m running a few > other domains and then start running dom0 out of memory, but nothing > conclusive yet. > > I''ll let this test keep running for another hour (otherwise idle, no > other domains running) or so then start my running-out-of-memory > program.similarly, I can reproduce it reasonably reliably if I wait until all the VMs are busy either doing I/o or high CPU utilization and then I start dom0 doing lots of I/o either through an rsync or something along those lines. If I let the system run for a little while to "prime" it, so far I think I can pretty much crash it whenever I want.> > I wonder if it is coincidence that we both have smp boxes... each of > the domains only sees 1 cpu so I wouldn''t have thought that would be a > problem unless there''s a race in xen itself.I have another, single-CPU, box that I can play with that I''ll try to get to building and deploying Xen tonight and see if it makes any difference. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_idG21&alloc_id040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 19, 2004, at 3:27 AM, Keir Fraser wrote:> > Clearly there''s some fairly random memory corruption going on, which > then causes segfaults (if the corruption hits code pages) and > filesystem corruption (if the corruption hits buffer-cache pages). > > The "Bailing: not a -ve offset" and "GPF (0004):" messages are almost > certainly just symptoms of executing a corrupted block of code. i.e., > the bug has already triggered some time ago - probably corrupted a > page of glibc or the kernel. > > It would be interesting to see whether or not this is SMP-related. > It''s also interesting that someone said they couldn''t reproduce > corruption when using 2.6.7 for the non-privileged guest OSes.I''ll be building Xen on a non-SMP box I also have at home tonight, with any luck. I''ll also be running memtest on the SMP box that''s been seeing the corruption when I get home as well, probably followed by CTCS. It was stable for a week or so under reasonably heavy load before I installed Xen on it, but you never know... If it passes all the testing, I''ll build a 2.6.7 guest kernel and give that a try. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "I think that''s what they mean by | "nickels a day can feed a child." | http://www.eff.org/ I thought, "How can food be so | http://www.anti-dmca.org/ cheap over there?" It''s not, they |-------------------------- just eat the nickels." -- Peter Nguyen ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 19, 2004, at 4:28 AM, Chris Andrews wrote:> > I''m seeing this corruption on a single CPU machine, with a single 2.4 > guest running but idle. I only ran one 2.6.7 guest, and I didn''t give > it any work, but it didn''t take any load in the 2.4 guest to provoke > problems.I''ve not really tried real hard at not loading the VMs or dom0 OS yet. I''ve got too much I want to make them do. :)> The machine uses devicemapper, so I''m going to move some partitions > around and see if I still get corruption without it. I can also build > Xen with debug=y and try that, once I''ve got the disk sorted.This box uses dm as well... Clue or coincidence? Probably coincidence... I''ve had the segfaults in different VMs and dom0, so I doubt it''s related to any specific LV or disk sector or anything. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 19, 2004, at 9:22 AM, Keir Fraser wrote:> > The first person to complain I think was Jody Belka, who was using > the changeset with comment ''Fairly major fixes to the network frontend > driver...'' (2004-07-13 18:24:48). Perhaps backing off to a day before > that would be a sensible place to start?I''m either going to blow your theory out of the water or help a lot because my first "real" build of all the Xen tools & kernel & linux kernels where I actually booted into a dom0 kernel from Xen was from a checkout on either the 12th or 13th. Prior to that I was working out getting everything built under gentoo and not actually running it. And that''s what I''ve been using until I checked out and rebuilt everything fresh this sunday afternoon and still have the problem as of last night. Although a VM will segfault while dom0 seems to panic, it''s probably the same root problem. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On 19 Jul 2004, at 19:58, Derek Glidden wrote:> > On Jul 19, 2004, at 4:28 AM, Chris Andrews wrote: > > >> The machine uses devicemapper, so I''m going to move some partitions >> around and see if I still get corruption without it. I can also build >> Xen with debug=y and try that, once I''ve got the disk sorted. > > This box uses dm as well... Clue or coincidence? Probably > coincidence... I''ve had the segfaults in different VMs and dom0, so I > doubt it''s related to any specific LV or disk sector or anything.I''ve moved stuff around my machine''s disk so I don''t need dm and recompiled without it, and I''ve seen the same crashes with the guest fs on a loop device, and with the guest fs on an ordinary disk partition, so I guess it''s not specific to dm. Chris. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 19, 2004, at 2:56 PM, Derek Glidden wrote:> I''ll also be running memtest on the SMP box that''s been seeing the > corruption when I get home as well, probably followed by CTCS. It was > stable for a week or so under reasonably heavy load before I installed > Xen on it, but you never know...FWIW - memtest ran for a couple of hours with no trouble. I''ve booted Xen with "nosmp" and will do the same things I''ve been doing to it to make it break and see what happens. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "I think that''s what they mean by | "nickels a day can feed a child." | http://www.eff.org/ I thought, "How can food be so | http://www.anti-dmca.org/ cheap over there?" It''s not, they |-------------------------- just eat the nickels." -- Peter Nguyen ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
I''m pretty sure i''ve seen it earlier than that, but couldn''t be certain. Initially I more or less expected instabilities and so wasn''t really taking much notice. so I guess my comments above are of absolutely no help at all. :) i''ll be trying a bk pull and build today (under normal linux - 2 cpus and max memory = faster builds) then verify that i can still make it crash, then try nosmp, although i''ve seen a few posts about single cpu crashes. james From: Derek Glidden Sent: Tue 20/07/2004 5:06 AM To: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM On Jul 19, 2004, at 9:22 AM, Keir Fraser wrote:> > The first person to complain I think was Jody Belka, who was using > the changeset with comment ''Fairly major fixes to the network frontend > driver...'' (2004-07-13 18:24:48). Perhaps backing off to a day before > that would be a sensible place to start?I''m either going to blow your theory out of the water or help a lot because my first "real" build of all the Xen tools & kernel & linux kernels where I actually booted into a dom0 kernel from Xen was from a checkout on either the 12th or 13th. Prior to that I was working out getting everything built under gentoo and not actually running it. And that''s what I''ve been using until I checked out and rebuilt everything fresh this sunday afternoon and still have the problem as of last night. Although a VM will segfault while dom0 seems to panic, it''s probably the same root problem. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
i''m not using dm and see lots of crashes. From: Chris Andrews Sent: Tue 20/07/2004 5:34 AM To: Derek Glidden Cc: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM On 19 Jul 2004, at 19:58, Derek Glidden wrote:> > On Jul 19, 2004, at 4:28 AM, Chris Andrews wrote: > > >> The machine uses devicemapper, so I''m going to move some partitions >> around and see if I still get corruption without it. I can also build >> Xen with debug=y and try that, once I''ve got the disk sorted. > > This box uses dm as well... Clue or coincidence? Probably > coincidence... I''ve had the segfaults in different VMs and dom0, so I > doubt it''s related to any specific LV or disk sector or anything.I''ve moved stuff around my machine''s disk so I don''t need dm and recompiled without it, and I''ve seen the same crashes with the guest fs on a loop device, and with the guest fs on an ordinary disk partition, so I guess it''s not specific to dm. Chris. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 19, 2004, at 7:06 PM, Derek Glidden wrote:> > I''ve booted Xen with "nosmp" and will do the same things I''ve been > doing to it to make it break and see what happens.hmm. Running this same box, same Xen kernel, same linux kernel, but with "nosmp", just creating a domain gives me about two dozen of these: (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 4GB segment. (XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS !!!! But so far, no crashes. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "I think that''s what they mean by | "nickels a day can feed a child." | http://www.eff.org/ I thought, "How can food be so | http://www.anti-dmca.org/ cheap over there?" It''s not, they |-------------------------- just eat the nickels." -- Peter Nguyen ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
bk pull only showed 2 patches, neither of which affected kernels so I didn''t bother recompiling. I have seen an error (shown by my diff script ''compare'' or by xend doing silly things like crashing), by simply starting another domain and pinging it with something like: ping -s 1400 -i 0.001 192.168.200.200 (ping -f might do it but I think it goes a bit fast) That occured once after about 5 minutes, but then not again for the 10 or so minutes I left it running. running it out of memory with this code: #include <stdio.h> #include <stdlib.h> int main() { char *buf; int mem = 0; int size = 1; char rnd; rnd = rand() & 255; while(1) { buf = (char *)malloc(size*1024*1024); memset(buf, rnd, size*1024*1024); if (buf != NULL) { mem += size; printf("%d\n", mem); } } } causes a crash far more quickly. I guess it''s possible that those are two different errors though... James From: James Harper Sent: Tue 20/07/2004 10:01 AM To: Derek Glidden; xen-devel@lists.sourceforge.net Subject: RE: [Xen-devel] segfault in VM I''m pretty sure i''ve seen it earlier than that, but couldn''t be certain. Initially I more or less expected instabilities and so wasn''t really taking much notice. so I guess my comments above are of absolutely no help at all. :) i''ll be trying a bk pull and build today (under normal linux - 2 cpus and max memory = faster builds) then verify that i can still make it crash, then try nosmp, although i''ve seen a few posts about single cpu crashes. james From: Derek Glidden Sent: Tue 20/07/2004 5:06 AM To: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM On Jul 19, 2004, at 9:22 AM, Keir Fraser wrote:> > The first person to complain I think was Jody Belka, who was using > the changeset with comment ''Fairly major fixes to the network frontend > driver...'' (2004-07-13 18:24:48). Perhaps backing off to a day before > that would be a sensible place to start?I''m either going to blow your theory out of the water or help a lot because my first "real" build of all the Xen tools & kernel & linux kernels where I actually booted into a dom0 kernel from Xen was from a checkout on either the 12th or 13th. Prior to that I was working out getting everything built under gentoo and not actually running it. And that''s what I''ve been using until I checked out and rebuilt everything fresh this sunday afternoon and still have the problem as of last night. Although a VM will segfault while dom0 seems to panic, it''s probably the same root problem. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
This could be harmless, or indicate memory corruption. -- Keir> > On Jul 19, 2004, at 7:06 PM, Derek Glidden wrote: > > > > > I''ve booted Xen with "nosmp" and will do the same things I''ve been > > doing to it to make it break and see what happens. > > hmm. Running this same box, same Xen kernel, same linux kernel, but > with "nosmp", just creating a domain gives me about two dozen of these: > > (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into > 4GB segment. > (XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS > !!!! > > But so far, no crashes. > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > "I think that''s what they mean by | > "nickels a day can feed a child." | http://www.eff.org/ > I thought, "How can food be so | http://www.anti-dmca.org/ > cheap over there?" It''s not, they |-------------------------- > just eat the nickels." -- Peter Nguyen > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
I''ve just checked in a few networking fixes that should make things rather more robust in low-memory conditions. I suspect there are still some bugs lurking somewhere, but hopefully this has thinned out the bugs somewhat. -- Keir> bk pull only showed 2 patches, neither of which affected kernels so > I didn''t bother recompiling. > > I have seen an error (shown by my diff script ''compare'' or by xend > doing silly things like crashing), by simply starting another domain > and pinging it with something like: > > ping -s 1400 -i 0.001 192.168.200.200 > > (ping -f might do it but I think it goes a bit fast) > > That occured once after about 5 minutes, but then not again for the 10 or so minutes I left it running. > > running it out of memory with this code: > > #include <stdio.h> > #include <stdlib.h> > int main() { > char *buf; > int mem = 0; > int size = 1; > char rnd; > rnd = rand() & 255; > while(1) { > buf = (char *)malloc(size*1024*1024); > memset(buf, rnd, size*1024*1024); > if (buf != NULL) { > mem += size; > printf("%d\n", mem); > } > } > } > > causes a crash far more quickly. I guess it''s possible that those are two different errors though... > > James------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
I still get corruption with these latest patches. In this case I had started 2 domains and was pinging them both fairly hard, I didn''t get as far as running it out of memory. hth James From: Keir Fraser Sent: Tue 20/07/2004 5:59 PM To: James Harper Cc: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM I''ve just checked in a few networking fixes that should make things rather more robust in low-memory conditions. I suspect there are still some bugs lurking somewhere, but hopefully this has thinned out the bugs somewhat. -- Keir> bk pull only showed 2 patches, neither of which affected kernels so > I didn''t bother recompiling. > > I have seen an error (shown by my diff script ''compare'' or by xend > doing silly things like crashing), by simply starting another domain > and pinging it with something like: > > ping -s 1400 -i 0.001 192.168.200.200 > > (ping -f might do it but I think it goes a bit fast) > > That occured once after about 5 minutes, but then not again for the 10 or so minutes I left it running. > > running it out of memory with this code: > > #include <stdio.h> > #include <stdlib.h> > int main() { > char *buf; > int mem = 0; > int size = 1; > char rnd; > rnd = rand() & 255; > while(1) { > buf = (char *)malloc(size*1024*1024); > memset(buf, rnd, size*1024*1024); > if (buf != NULL) { > mem += size; > printf("%d\n", mem); > } > } > } > > causes a crash far more quickly. I guess it''s possible that those are two different errors though... > > James
I''ve checked in some more fixes that might entirely solve the problems that everyone has been seeing. Unfortunately xen.bkbits.net is down and I''m about to leave for Canada. :-( Hopefully it will be possible to push to bkbits in a few hours... The Changesets that will hopefully fix everything are: 1.1116 04/07/20 11:32:39 kaf24@scramble.cl.cam.ac.uk +2 -0 More backend driver fixes and robustifying. 1.1115 04/07/20 11:14:24 kaf24@scramble.cl.cam.ac.uk +0 -0 Merge scramble.cl.cam.ac.uk:/auto/groups/xeno/BK/xeno.bk into scramble.cl.cam.ac.uk:/local/scratch/kaf24/xeno So keep an eye out for these when you pull --- we''re very interested to hear of further bugs in builds /with/ these changesets. :-) -- Keir> I''ve just checked in a few networking fixes that should make things > rather more robust in low-memory conditions. I suspect there are still > some bugs lurking somewhere, but hopefully this has thinned out the > bugs somewhat. > > -- Keir > > > bk pull only showed 2 patches, neither of which affected kernels so > > I didn''t bother recompiling. > > > > I have seen an error (shown by my diff script ''compare'' or by xend > > doing silly things like crashing), by simply starting another domain > > and pinging it with something like: > > > > ping -s 1400 -i 0.001 192.168.200.200 > > > > (ping -f might do it but I think it goes a bit fast) > > > > That occured once after about 5 minutes, but then not again for the 10 or so minutes I left it running. > > > > running it out of memory with this code: > > > > #include <stdio.h> > > #include <stdlib.h> > > int main() { > > char *buf; > > int mem = 0; > > int size = 1; > > char rnd; > > rnd = rand() & 255; > > while(1) { > > buf = (char *)malloc(size*1024*1024); > > memset(buf, rnd, size*1024*1024); > > if (buf != NULL) { > > mem += size; > > printf("%d\n", mem); > > } > > } > > } > > > > causes a crash far more quickly. I guess it''s possible that those are two different errors though... > > > > James > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Tue, Jul 20, 2004 at 11:52:39AM +0100, Keir Fraser wrote:> I''ve checked in some more fixes that might entirely solve the problems > that everyone has been seeing. > > Unfortunately xen.bkbits.net is down and I''m about to leave for > Canada. :-( Hopefully it will be possible to push to bkbits in a few > hours... > > The Changesets that will hopefully fix everything are: > 1.1116 04/07/20 11:32:39 kaf24@scramble.cl.cam.ac.uk +2 -0 > More backend driver fixes and robustifying. > > 1.1115 04/07/20 11:14:24 kaf24@scramble.cl.cam.ac.uk +0 -0 > Merge scramble.cl.cam.ac.uk:/auto/groups/xeno/BK/xeno.bk > into scramble.cl.cam.ac.uk:/local/scratch/kaf24/xeno > > So keep an eye out for these when you pull --- we''re very interested > to hear of further bugs in builds /with/ these changesets. :-)I''ve now pushed these changesets to the xen.bkbits repository. christian ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 19, 2004, at 9:01 PM, Derek Glidden wrote:> (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into > 4GB segment. > (XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS > !!!!After pounding on that box pretty much all evening with "nosmp", I wasn''t able to make it crash, either in dom0 or a VM, like I had been able to do in SMP mode. I had some weirdness in dom0 when I woke up and checked on it this morning - a compile had failed that shouldn''t have, but there were no log messages either from Xen or dom0, so I''m not really sure what that was. Tonight I''ll pull the latest changes and rebuild everything, reboot it without "nosmp" (make it SMP again) and see what happens. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "I think that''s what they mean by | "nickels a day can feed a child." | http://www.eff.org/ I thought, "How can food be so | http://www.anti-dmca.org/ cheap over there?" It''s not, they |-------------------------- just eat the nickels." -- Peter Nguyen -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Derek Glidden wrote:> > On Jul 19, 2004, at 9:01 PM, Derek Glidden wrote: > >> (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into >> 4GB segment. >> (XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS >> !!!! > > > After pounding on that box pretty much all evening with "nosmp", I > wasn''t able to make it crash, either in dom0 or a VM, like I had been > able to do in SMP mode.Which revision of the code were you running there? I''d like to give it a go...> I had some weirdness in dom0 when I woke up and checked on it this > morning - a compile had failed that shouldn''t have, but there were no > log messages either from Xen or dom0, so I''m not really sure what that was. > > Tonight I''ll pull the latest changes and rebuild everything, reboot it > without "nosmp" (make it SMP again) and see what happens.I''ve been trying various old revisions as far back as 1.1068[*] (so far), and I can''t find one that doesn''t blow up. My test is to run James'' ''compare'' script in domain0 on two large identical files of randomness, and compile various things continuously in a 2.4.26 domain1. It usually takes only a few minutes to start showing differences, and if I leave it I''ll get segfaults in domain0, then (with at least one revision) a panic in domain0 and reboot. Just now I tried the latest code (post Keir''s 1.1116/1.1117 csets) and I''m seeing much the same results. Hardware is a Dell 1650, single CPU, 1G RAM, aacraid controller. I''ve got rid of the devicemapper stuff I was running before, and domain1''s root is on an ordinary disk partition. Chris. [*] is this a suitably precise way of specifying revision? 1.1068 is based on the list from: http://xen.bkbits.net:8080/xeno-unstable.bk/ChangeSet@-2w?nav=index.html ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. The tests I was running were my ''compare'' script and pinging the two domains I had running with ping -q -i 0.01 -s 1400 <ip address> Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. James From: Keir Fraser Sent: Tue 20/07/2004 8:52 PM To: Keir Fraser Cc: James Harper; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM I''ve checked in some more fixes that might entirely solve the problems that everyone has been seeing. Unfortunately xen.bkbits.net is down and I''m about to leave for Canada. :-( Hopefully it will be possible to push to bkbits in a few hours... The Changesets that will hopefully fix everything are: 1.1116 04/07/20 11:32:39 kaf24@scramble.cl.cam.ac.uk +2 -0 More backend driver fixes and robustifying. 1.1115 04/07/20 11:14:24 kaf24@scramble.cl.cam.ac.uk +0 -0 Merge scramble.cl.cam.ac.uk:/auto/groups/xeno/BK/xeno.bk into scramble.cl.cam.ac.uk:/local/scratch/kaf24/xeno So keep an eye out for these when you pull --- we''re very interested to hear of further bugs in builds /with/ these changesets. :-) -- Keir> I''ve just checked in a few networking fixes that should make things > rather more robust in low-memory conditions. I suspect there are still > some bugs lurking somewhere, but hopefully this has thinned out the > bugs somewhat. > > -- Keir > > > bk pull only showed 2 patches, neither of which affected kernels so > > I didn''t bother recompiling. > > > > I have seen an error (shown by my diff script ''compare'' or by xend > > doing silly things like crashing), by simply starting another domain > > and pinging it with something like: > > > > ping -s 1400 -i 0.001 192.168.200.200 > > > > (ping -f might do it but I think it goes a bit fast) > > > > That occured once after about 5 minutes, but then not again for the 10 or so minutes I left it running. > > > > running it out of memory with this code: > > > > #include <stdio.h> > > #include <stdlib.h> > > int main() { > > char *buf; > > int mem = 0; > > int size = 1; > > char rnd; > > rnd = rand() & 255; > > while(1) { > > buf = (char *)malloc(size*1024*1024); > > memset(buf, rnd, size*1024*1024); > > if (buf != NULL) { > > mem += size; > > printf("%d\n", mem); > > } > > } > > } > > > > causes a crash far more quickly. I guess it''s possible that those are two different errors though... > > > > James > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel
On Wed, Jul 21, 2004 at 11:14:48AM +1000, James Harper wrote:> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] > in /boot? ksymoops would be much happier.done, the install target will now install the System.map along with the kernel and config file. christian ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Could someone try to isolate this to either the network backend driver or the blkdev backend driver? The best way to do this is to disable the frontend drivers so that they never try to coinnect to the backend driver... To disable networking: Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to always ''return 0;''. To disable block devices: Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to always ''return 0;''. Oh yes -- the 2.4 sparse tree no longer contains the net frontend driver - you''ll find the build tree symlinks to linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to edit that instead... Obviously, if you disable blkdevs you''ll need to boot off a ramdisk or via a networked mount. :-) Cheers, Keir> I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > The tests I was running were my ''compare'' script and pinging the two domains I had running with > ping -q -i 0.01 -s 1400 <ip address> > > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. > > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. > > James------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
i''ll try this out tomorrow morning (too late tonight - need sleep!) From: Keir Fraser Sent: Wed 21/07/2004 11:30 PM To: James Harper Cc: Keir Fraser; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM Could someone try to isolate this to either the network backend driver or the blkdev backend driver? The best way to do this is to disable the frontend drivers so that they never try to coinnect to the backend driver... To disable networking: Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to always ''return 0;''. To disable block devices: Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to always ''return 0;''. Oh yes -- the 2.4 sparse tree no longer contains the net frontend driver - you''ll find the build tree symlinks to linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to edit that instead... Obviously, if you disable blkdevs you''ll need to boot off a ramdisk or via a networked mount. :-) Cheers, Keir> I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > The tests I was running were my ''compare'' script and pinging the two domains I had running with > ping -q -i 0.01 -s 1400 <ip address> > > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. > > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. > > James
That would be extremely helpful! If it turns out to be the net backend (probably most likely, although I guess it may not be a backend problem at all, which would be harder to debug), then we can isolate it to the receive or transmit path as follows: To disable the receive path for guest OSes: Edit netif_be_start_xmit in arch/xen/drivers/netif/backend/main.c to always ''goto drop;''. To disable the transmit path for guest OSes: Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the call to netif_schedule_work(), add: make_tx_response(netif, txreq.id, NETIF_RSP_OKAY); netif_put(netif); continue; With one half of the network path disabled, to load up the remaining direction you''ll need to flood ping from an external machine to the guest OS (when you disable the guest''s transmit path) or flood ping out from the guest (when you disable it''s rx path). I guess in both cases you''ll need a broadcast ping (yuk!) since ARP won''t work (needs both tx and rx). -- Keir> i''ll try this out tomorrow morning (too late tonight - need sleep!) > > > > From: Keir Fraser > Sent: Wed 21/07/2004 11:30 PM > To: James Harper > Cc: Keir Fraser; xen-devel@lists.sourceforge.net > Subject: Re: [Xen-devel] segfault in VM > > > Could someone try to isolate this to either the network backend driver > or the blkdev backend driver? > > The best way to do this is to disable the frontend drivers so that > they never try to coinnect to the backend driver... > > To disable networking: > Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to > always ''return 0;''. > > To disable block devices: > Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to > always ''return 0;''. > > Oh yes -- the 2.4 sparse tree no longer contains the net frontend > driver - you''ll find the build tree symlinks to > linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to > edit that instead... > > Obviously, if you disable blkdevs you''ll need to boot off a ramdisk > or via a networked mount. :-) > > Cheers, > Keir > > > > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > > The tests I was running were my ''compare'' script and pinging the two domains I had running with > > ping -q -i 0.01 -s 1400 <ip address> > > > > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. > > > > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. > > > > James-=- MIME -=- --_AD96A7AB-04BB-40C1-819D-80A6B56655A4_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable i''ll try this out tomorrow morning (too late tonight - need sleep!) From: Keir Fraser Sent: Wed 21/07/2004 11:30 PM To: James Harper Cc: Keir Fraser; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM Could someone try to isolate this to either the network backend driver or the blkdev backend driver? The best way to do this is to disable the frontend drivers so that they never try to coinnect to the backend driver... To disable networking: Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to always ''return 0;''. To disable block devices: Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to always ''return 0;''. Oh yes -- the 2.4 sparse tree no longer contains the net frontend driver - you''ll find the build tree symlinks to linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to edit that instead... Obviously, if you disable blkdevs you''ll need to boot off a ramdisk or via a networked mount. :-) Cheers, Keir> I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > The tests I was running were my ''compare'' script and pinging the two domains I had running with > ping -q -i 0.01 -s 1400 <ip address> >=20 > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. >=20 > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. >=20 > James--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <HTML><HEAD></HEAD> <BODY> <DIV id=3DidOWAReplyText57341 dir=3Dltr> <DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>i''ll try this out tomorrow morning (too late tonight - need sleep!)</FONT></DIV></DIV> <DIV dir=3Dltr><BR> <HR tabIndex=3D-1> <FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Wed 21/07/2004 11:30 PM<BR><B>To:</B> James Harper<BR><B>Cc:</B> Keir Fraser; xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM<BR></FONT><BR></DIV> <DIV><PRE style=3D"WORD-WRAP: break-word">Could someone try to isolate this to either the network backend driver or the blkdev backend driver? The best way to do this is to disable the frontend drivers so that they never try to coinnect to the backend driver... To disable networking: Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to always ''return 0;''. To disable block devices: Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to always ''return 0;''. Oh yes -- the 2.4 sparse tree no longer contains the net frontend driver - you''ll find the build tree symlinks to linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to edit that instead... Obviously, if you disable blkdevs you''ll need to boot off a ramdisk or via a networked mount. :-) Cheers, Keir > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > The tests I was running were my ''compare'' script and pinging the two domains I had running with > ping -q -i 0.01 -s 1400 <ip address> >=20 > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. >=20 > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. >=20 > James </PRE></DIV></BODY></HTML> --_AD96A7AB-04BB-40C1-819D-80A6B56655A4_-- ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
FWIW: after coming home last night from work, dom0 crashed right away as soon as I logged in. I rebooted, repaired, and checked out and rebuilt everything and so far, so good. It hasn''t generated those same "Bailing" messages when I create a domain at least. If I can keep everything up and running tonight, I''ll start hammering on them using the compare/ping thing and see what I can make break. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 21, 2004, at 9:30 AM, Keir Fraser wrote:> > Could someone try to isolate this to either the network backend driver > or the blkdev backend driver? > > The best way to do this is to disable the frontend drivers so that > they never try to coinnect to the backend driver...I''ll give this a go as well, but, is this for dom0 or domU kernels? -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "I think that''s what they mean by | "nickels a day can feed a child." | http://www.eff.org/ I thought, "How can food be so | http://www.anti-dmca.org/ cheap over there?" It''s not, they |-------------------------- just eat the nickels." -- Peter Nguyen ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> > On Jul 21, 2004, at 9:30 AM, Keir Fraser wrote: > > > > > Could someone try to isolate this to either the network backend driver > > or the blkdev backend driver? > > > > The best way to do this is to disable the frontend drivers so that > > they never try to coinnect to the backend driver... > > I''ll give this a go as well, but, is this for dom0 or domU kernels?It''s modifying the frontend drivers in the domU kernel so that the data paths in the dom0 backend drivers do not get executed. i.e., it''s the domU kernel that needs recompiling. -- Keir ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
i''m building this now, and am just thinking about how to test this... I was using a ping as my test mechanism. I guess i''ll do lots of block device copies. I guess this lends weight to your thoughts that it probably is a net problem and not a block problem. Instead of changing the source code to disable the net stuff, would it work if I just specified ''nics=0'' or is some part of the net subsystem still activated? I''ll test this too anyway. In order to test disabling send or receive, this might be a bit trickier than you first make out. Send-only should be easy enough, just start another domain and then ping it (a manual arp table entry should alleviate the need to broadcast). Receive-only will be tricker. How do you get a domain to send to it? This problem of course assumes that corruption is not limited to the domain... if it is limited to the domain then you should be able to have a send/receive domain and ignore crashes in there, just focus on the crashes in the receive-only domain. i''m almost confused, but am about to start testing - firstly with no network. James From: Keir Fraser Sent: Wed 21/07/2004 11:30 PM To: James Harper Cc: Keir Fraser; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM Could someone try to isolate this to either the network backend driver or the blkdev backend driver? The best way to do this is to disable the frontend drivers so that they never try to coinnect to the backend driver... To disable networking: Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to always ''return 0;''. To disable block devices: Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to always ''return 0;''. Oh yes -- the 2.4 sparse tree no longer contains the net frontend driver - you''ll find the build tree symlinks to linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to edit that instead... Obviously, if you disable blkdevs you''ll need to boot off a ramdisk or via a networked mount. :-) Cheers, Keir> I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > The tests I was running were my ''compare'' script and pinging the two domains I had running with > ping -q -i 0.01 -s 1400 <ip address> > > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. > > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. > > James
> i''m building this now, and am just thinking about how to test this... I was using a ping as my test mechanism. I guess i''ll do lots of block device copies. I guess this lends weight to your thoughts that it probably is a net problem and not a block problem. > > Instead of changing the source code to disable the net stuff, would it work if I just specified ''nics=0'' or is some part of the net subsystem still activated? I''ll test this too anyway.I think the source will need to be changed. In any case, it''s a trivial change and then we can be certain that no device channel is being set up.> In order to test disabling send or receive, this might be a bit trickier than you first make out. Send-only should be easy enough, just start another domain and then ping it (a manual arp table entry should alleviate the need to broadcast). Receive-only will be tricker. How do you get a domain to send to it? This problem of course assumes that corruption is not limited to the domain... if it is limited to the domain then you should be able to have a send/receive domain and ignore crashes in there, just focus on the crashes in the receive-only domain.That''s the reason for the broadcast ping. Unfortunately I''m not sure how useful that will turn out to be -- e.g., we may just end up hosing DOM0.> i''m almost confused, but am about to start testing - firstly with no network.Stage 1 (isolating blkdev and network) shouldn''t be too hard. Basically we''re ensuring the data paths in teh backend drivers do not get executed -- they will only ever execute if there is a device channel set up to a frontend in another guest, so disabling the frontend drivers ensures this. -- Keir> James > > > From: Keir Fraser > Sent: Wed 21/07/2004 11:30 PM > To: James Harper > Cc: Keir Fraser; xen-devel@lists.sourceforge.net > Subject: Re: [Xen-devel] segfault in VM > > > Could someone try to isolate this to either the network backend driver > or the blkdev backend driver? > > The best way to do this is to disable the frontend drivers so that > they never try to coinnect to the backend driver... > > To disable networking: > Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to > always ''return 0;''. > > To disable block devices: > Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to > always ''return 0;''. > > Oh yes -- the 2.4 sparse tree no longer contains the net frontend > driver - you''ll find the build tree symlinks to > linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to > edit that instead... > > Obviously, if you disable blkdevs you''ll need to boot off a ramdisk > or via a networked mount. :-) > > Cheers, > Keir > > > > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > > The tests I was running were my ''compare'' script and pinging the two domains I had running with > > ping -q -i 0.01 -s 1400 <ip address> > > > > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. > > > > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. > > > > James-=- MIME -=- --_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable i''m building this now, and am just thinking about how to test this... I was using a ping as my test mechanism. I guess i''ll do lots of block device copies. I guess this lends weight to your thoughts that it probably is a net problem and not a block problem. Instead of changing the source code to disable the net stuff, would it work if I just specified ''nics=3D0'' or is some part of the net subsystem still activated? I''ll test this too anyway. In order to test disabling send or receive, this might be a bit trickier than you first make out. Send-only should be easy enough, just start another domain and then ping it (a manual arp table entry should alleviate the need to broadcast). Receive-only will be tricker. How do you get a domain to send to it? This problem of course assumes that corruption is not limited to the domain... if it is limited to the domain then you should be able to have a send/receive domain and ignore crashes in there, just focus on the crashes in the receive-only domain. i''m almost confused, but am about to start testing - firstly with no network. James From: Keir Fraser Sent: Wed 21/07/2004 11:30 PM To: James Harper Cc: Keir Fraser; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM Could someone try to isolate this to either the network backend driver or the blkdev backend driver? The best way to do this is to disable the frontend drivers so that they never try to coinnect to the backend driver... To disable networking: Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to always ''return 0;''. To disable block devices: Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to always ''return 0;''. Oh yes -- the 2.4 sparse tree no longer contains the net frontend driver - you''ll find the build tree symlinks to linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to edit that instead... Obviously, if you disable blkdevs you''ll need to boot off a ramdisk or via a networked mount. :-) Cheers, Keir> I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > The tests I was running were my ''compare'' script and pinging the two domains I had running with > ping -q -i 0.01 -s 1400 <ip address> >=20 > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. >=20 > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. >=20 > James--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <HTML><HEAD></HEAD> <BODY> <DIV id=3DidOWAReplyText8898 dir=3Dltr> <DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>i''m building this now, and am</FONT><FONT face=3DArial size=3D2> just thinking about how to test this... I was using a ping as my test mechanism. I guess i''ll do lots of block device copies. I guess this lends weight to your thoughts that it probably is a net problem and not a block problem.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>Instead of changing the source code to disable the net stuff, would it work if I just specified ''nics=3D0'' or is some part of the net subsystem still activated? </FONT><FONT face=3DArial size=3D2>I''ll test this too anyway.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>In order to test disabling send or receive, this might be a bit trickier than you first make out. Send-only should be easy enough, just start another domain and then ping it (a manual arp table entry should alleviate the need to broadcast). Receive-only will be tricker. How do you get a domain to send to it? This problem of course assumes that corruption is not limited to the domain... if it is limited to the domain then you should be able to have a send/receive domain and ignore crashes in there, just focus on the crashes in the receive-only domain.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>i''m almost confused, but am about to start testing - firstly with no network.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV></DIV> <DIV dir=3Dltr> <HR tabIndex=3D-1> <FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Wed 21/07/2004 11:30 PM<BR><B>To:</B> James Harper<BR><B>Cc:</B> Keir Fraser; xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM<BR></FONT><BR></DIV> <DIV><PRE style=3D"WORD-WRAP: break-word">Could someone try to isolate this to either the network backend driver or the blkdev backend driver? The best way to do this is to disable the frontend drivers so that they never try to coinnect to the backend driver... To disable networking: Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to always ''return 0;''. To disable block devices: Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to always ''return 0;''. Oh yes -- the 2.4 sparse tree no longer contains the net frontend driver - you''ll find the build tree symlinks to linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to edit that instead... Obviously, if you disable blkdevs you''ll need to boot off a ramdisk or via a networked mount. :-) Cheers, Keir > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > The tests I was running were my ''compare'' script and pinging the two domains I had running with > ping -q -i 0.01 -s 1400 <ip address> >=20 > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. >=20 > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. >=20 > James </PRE></DIV></BODY></HTML> --_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_-- ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 21, 2004, at 9:54 PM, Keir Fraser wrote:> It''s modifying the frontend drivers in the domU kernel so that the > data paths in the dom0 backend drivers do not get executed. > > i.e., it''s the domU kernel that needs recompiling.Got it. domU kernel recompiled and now running large amounts of block i/o while dom0 gets pung and also large amounts of block i/o. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "I think that''s what they mean by | "nickels a day can feed a child." | http://www.eff.org/ I thought, "How can food be so | http://www.anti-dmca.org/ cheap over there?" It''s not, they |-------------------------- just eat the nickels." -- Peter Nguyen ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
As a first test I have just disabled networking via nics=0 in the config, and running this script in dom1: #!/bin/sh while [ 1 = 1 ] do dd if=/dev/sda1 of=/dev/null bs=1024 count=128K & dd if=/dev/sda1 of=/dev/null bs=1024 skip=256K count=256K done it tells me ''ioctl 801c6d02 not supported by XL blkif'' but that doesn''t seem to matter. Anyway, there are no crashes so far so i''m thinking at this stage that the block interface stuff is probably fine and I should now concentrate on the network. Disabling the block stuff will be a huge hassle at this stage so i''ll have to let it go for the moment. I think i need a crash course in how all this hangs together before I can understand what i''m testing... My understanding is as follows: packets sent to dom0.vif1.0 appear at dom1.eth0. packets sent to dom1.eth0 appear at dom0.vif1.0. and that''s about it. Are they symmetrical? Is the transmit code for dom0.vif1.0 the same as the transmit code for dom1.eth0? Ditto for receive? James From: Keir Fraser Sent: Thu 22/07/2004 12:03 PM To: James Harper Cc: Keir Fraser; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM> i''m building this now, and am just thinking about how to test this... I was using a ping as my test mechanism. I guess i''ll do lots of block device copies. I guess this lends weight to your thoughts that it probably is a net problem and not a block problem. > > Instead of changing the source code to disable the net stuff, would it work if I just specified ''nics=0'' or is some part of the net subsystem still activated? I''ll test this too anyway.I think the source will need to be changed. In any case, it''s a trivial change and then we can be certain that no device channel is being set up.> In order to test disabling send or receive, this might be a bit trickier than you first make out. Send-only should be easy enough, just start another domain and then ping it (a manual arp table entry should alleviate the need to broadcast). Receive-only will be tricker. How do you get a domain to send to it? This problem of course assumes that corruption is not limited to the domain... if it is limited to the domain then you should be able to have a send/receive domain and ignore crashes in there, just focus on the crashes in the receive-only domain.That''s the reason for the broadcast ping. Unfortunately I''m not sure how useful that will turn out to be -- e.g., we may just end up hosing DOM0.> i''m almost confused, but am about to start testing - firstly with no network.Stage 1 (isolating blkdev and network) shouldn''t be too hard. Basically we''re ensuring the data paths in teh backend drivers do not get executed -- they will only ever execute if there is a device channel set up to a frontend in another guest, so disabling the frontend drivers ensures this. -- Keir> James > > > From: Keir Fraser > Sent: Wed 21/07/2004 11:30 PM > To: James Harper > Cc: Keir Fraser; xen-devel@lists.sourceforge.net > Subject: Re: [Xen-devel] segfault in VM > > > Could someone try to isolate this to either the network backend driver > or the blkdev backend driver? > > The best way to do this is to disable the frontend drivers so that > they never try to coinnect to the backend driver... > > To disable networking: > Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to > always ''return 0;''. > > To disable block devices: > Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to > always ''return 0;''. > > Oh yes -- the 2.4 sparse tree no longer contains the net frontend > driver - you''ll find the build tree symlinks to > linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to > edit that instead... > > Obviously, if you disable blkdevs you''ll need to boot off a ramdisk > or via a networked mount. :-) > > Cheers, > Keir > > > > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > > The tests I was running were my ''compare'' script and pinging the two domains I had running with > > ping -q -i 0.01 -s 1400 <ip address> > > > > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. > > > > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. > > > > James-=- MIME -=- --_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable i''m building this now, and am just thinking about how to test this... I was using a ping as my test mechanism. I guess i''ll do lots of block device copies. I guess this lends weight to your thoughts that it probably is a net problem and not a block problem. Instead of changing the source code to disable the net stuff, would it work if I just specified ''nics=3D0'' or is some part of the net subsystem still activated? I''ll test this too anyway. In order to test disabling send or receive, this might be a bit trickier than you first make out. Send-only should be easy enough, just start another domain and then ping it (a manual arp table entry should alleviate the need to broadcast). Receive-only will be tricker. How do you get a domain to send to it? This problem of course assumes that corruption is not limited to the domain... if it is limited to the domain then you should be able to have a send/receive domain and ignore crashes in there, just focus on the crashes in the receive-only domain. i''m almost confused, but am about to start testing - firstly with no network. James From: Keir Fraser Sent: Wed 21/07/2004 11:30 PM To: James Harper Cc: Keir Fraser; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM Could someone try to isolate this to either the network backend driver or the blkdev backend driver? The best way to do this is to disable the frontend drivers so that they never try to coinnect to the backend driver... To disable networking: Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to always ''return 0;''. To disable block devices: Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to always ''return 0;''. Oh yes -- the 2.4 sparse tree no longer contains the net frontend driver - you''ll find the build tree symlinks to linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to edit that instead... Obviously, if you disable blkdevs you''ll need to boot off a ramdisk or via a networked mount. :-) Cheers, Keir> I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > The tests I was running were my ''compare'' script and pinging the two domains I had running with > ping -q -i 0.01 -s 1400 <ip address> >=20 > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. >=20 > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. >=20 > James--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <HTML><HEAD></HEAD> <BODY> <DIV id=3DidOWAReplyText8898 dir=3Dltr> <DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>i''m building this now, and am</FONT><FONT face=3DArial size=3D2> just thinking about how to test this... I was using a ping as my test mechanism. I guess i''ll do lots of block device copies. I guess this lends weight to your thoughts that it probably is a net problem and not a block problem.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>Instead of changing the source code to disable the net stuff, would it work if I just specified ''nics=3D0'' or is some part of the net subsystem still activated? </FONT><FONT face=3DArial size=3D2>I''ll test this too anyway.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>In order to test disabling send or receive, this might be a bit trickier than you first make out. Send-only should be easy enough, just start another domain and then ping it (a manual arp table entry should alleviate the need to broadcast). Receive-only will be tricker. How do you get a domain to send to it? This problem of course assumes that corruption is not limited to the domain... if it is limited to the domain then you should be able to have a send/receive domain and ignore crashes in there, just focus on the crashes in the receive-only domain.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>i''m almost confused, but am about to start testing - firstly with no network.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV></DIV> <DIV dir=3Dltr> <HR tabIndex=3D-1> <FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Wed 21/07/2004 11:30 PM<BR><B>To:</B> James Harper<BR><B>Cc:</B> Keir Fraser; xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM<BR></FONT><BR></DIV> <DIV><PRE style=3D"WORD-WRAP: break-word">Could someone try to isolate this to either the network backend driver or the blkdev backend driver? The best way to do this is to disable the frontend drivers so that they never try to coinnect to the backend driver... To disable networking: Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to always ''return 0;''. To disable block devices: Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to always ''return 0;''. Oh yes -- the 2.4 sparse tree no longer contains the net frontend driver - you''ll find the build tree symlinks to linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to edit that instead... Obviously, if you disable blkdevs you''ll need to boot off a ramdisk or via a networked mount. :-) Cheers, Keir > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > The tests I was running were my ''compare'' script and pinging the two domains I had running with > ping -q -i 0.01 -s 1400 <ip address> >=20 > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. >=20 > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. >=20 > James </PRE></DIV></BODY></HTML> --_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_--
> As a first test I have just disabled networking via nics=0 in the config, and running this script in dom1: > #!/bin/sh > while [ 1 = 1 ] > do > dd if=/dev/sda1 of=/dev/null bs=1024 count=128K & > dd if=/dev/sda1 of=/dev/null bs=1024 skip=256K count=256K > done > > it tells me ''ioctl 801c6d02 not supported by XL blkif'' but that doesn''t seem to matter. Anyway, there are no crashes so far so i''m thinking at this stage that the block interface stuff is probably fine and I should now concentrate on the network. Disabling the block stuff will be a huge hassle at this stage so i''ll have to let it go for the moment.It does seem more likely that the network backend driver is to blame -- it''s considerably more complicated than the blkdev driver.> I think i need a crash course in how all this hangs together before I can understand what i''m testing... My understanding is as follows: > > packets sent to dom0.vif1.0 appear at dom1.eth0. > packets sent to dom1.eth0 appear at dom0.vif1.0.Yes, it''s basically a point-to-point link. The transmit side on each interface is directly linked to the receive side on the other.> and that''s about it. Are they symmetrical? Is the transmit code for dom0.vif1.0 the same as the transmit code for dom1.eth0? Ditto for receive?No. dom1.eth0 is implemented by the frontend driver arch/xen/drivers/netif/frontend/main.c dom0.vif* is implemented by arch/xen/drivers/netif/backend/main.c So they look symmetric to users, but the implementation is not symmetric. -- Keir ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Okay, I have made the following change in dom0: To disable the transmit path for guest OSes: Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the call to netif_schedule_work(), add: make_tx_response(netif, txreq.id, NETIF_RSP_OKAY); netif_put(netif); continue; compiled and rebooted with the new kernel. booted dom1, removed vif1.0 from the bridge, gave it it''s own ip address, added a static arp entry and pinged away. I could see the packet counters for dom0 and dom1 climbing rapiding indicating that dom0 was sending packets, dom1 was receiving packets, but that a packet sent by dom1 was unable to reach dom0 again. I got the same sort of crashes after about 10 minutes. I''m now testing the other half. James From: Keir Fraser Sent: Thu 22/07/2004 12:56 PM To: James Harper Cc: Keir Fraser; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM> As a first test I have just disabled networking via nics=0 in the config, and running this script in dom1: > #!/bin/sh > while [ 1 = 1 ] > do > dd if=/dev/sda1 of=/dev/null bs=1024 count=128K & > dd if=/dev/sda1 of=/dev/null bs=1024 skip=256K count=256K > done > > it tells me ''ioctl 801c6d02 not supported by XL blkif'' but that doesn''t seem to matter. Anyway, there are no crashes so far so i''m thinking at this stage that the block interface stuff is probably fine and I should now concentrate on the network. Disabling the block stuff will be a huge hassle at this stage so i''ll have to let it go for the moment.It does seem more likely that the network backend driver is to blame -- it''s considerably more complicated than the blkdev driver.> I think i need a crash course in how all this hangs together before I can understand what i''m testing... My understanding is as follows: > > packets sent to dom0.vif1.0 appear at dom1.eth0. > packets sent to dom1.eth0 appear at dom0.vif1.0.Yes, it''s basically a point-to-point link. The transmit side on each interface is directly linked to the receive side on the other.> and that''s about it. Are they symmetrical? Is the transmit code for dom0.vif1.0 the same as the transmit code for dom1.eth0? Ditto for receive?No. dom1.eth0 is implemented by the frontend driver arch/xen/drivers/netif/frontend/main.c dom0.vif* is implemented by arch/xen/drivers/netif/backend/main.c So they look symmetric to users, but the implementation is not symmetric. -- Keir
At this stage, it looks like disabling the receive path for the guest os eg netif_be_start_xmit ''goto drop'' means that I can ping from the guest OS all i like with no crashes. I hope that''s the right way around to do it... I''m just looking at that procedure, how is the ring actually managed - what do all the _prod and _cons variables actually represent? And how is synchronisation handled between the domains? i notice there is no spinlock in there, is this done by the calling function? james From: Keir Fraser Sent: Thu 22/07/2004 12:17 AM To: James Harper Cc: Keir Fraser; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM That would be extremely helpful! If it turns out to be the net backend (probably most likely, although I guess it may not be a backend problem at all, which would be harder to debug), then we can isolate it to the receive or transmit path as follows: To disable the receive path for guest OSes: Edit netif_be_start_xmit in arch/xen/drivers/netif/backend/main.c to always ''goto drop;''. To disable the transmit path for guest OSes: Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the call to netif_schedule_work(), add: make_tx_response(netif, txreq.id, NETIF_RSP_OKAY); netif_put(netif); continue; With one half of the network path disabled, to load up the remaining direction you''ll need to flood ping from an external machine to the guest OS (when you disable the guest''s transmit path) or flood ping out from the guest (when you disable it''s rx path). I guess in both cases you''ll need a broadcast ping (yuk!) since ARP won''t work (needs both tx and rx). -- Keir> i''ll try this out tomorrow morning (too late tonight - need sleep!) > > > > From: Keir Fraser > Sent: Wed 21/07/2004 11:30 PM > To: James Harper > Cc: Keir Fraser; xen-devel@lists.sourceforge.net > Subject: Re: [Xen-devel] segfault in VM > > > Could someone try to isolate this to either the network backend driver > or the blkdev backend driver? > > The best way to do this is to disable the frontend drivers so that > they never try to coinnect to the backend driver... > > To disable networking: > Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to > always ''return 0;''. > > To disable block devices: > Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to > always ''return 0;''. > > Oh yes -- the 2.4 sparse tree no longer contains the net frontend > driver - you''ll find the build tree symlinks to > linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to > edit that instead... > > Obviously, if you disable blkdevs you''ll need to boot off a ramdisk > or via a networked mount. :-) > > Cheers, > Keir > > > > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > > The tests I was running were my ''compare'' script and pinging the two domains I had running with > > ping -q -i 0.01 -s 1400 <ip address> > > > > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. > > > > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. > > > > James-=- MIME -=- --_AD96A7AB-04BB-40C1-819D-80A6B56655A4_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable i''ll try this out tomorrow morning (too late tonight - need sleep!) From: Keir Fraser Sent: Wed 21/07/2004 11:30 PM To: James Harper Cc: Keir Fraser; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM Could someone try to isolate this to either the network backend driver or the blkdev backend driver? The best way to do this is to disable the frontend drivers so that they never try to coinnect to the backend driver... To disable networking: Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to always ''return 0;''. To disable block devices: Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to always ''return 0;''. Oh yes -- the 2.4 sparse tree no longer contains the net frontend driver - you''ll find the build tree symlinks to linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to edit that instead... Obviously, if you disable blkdevs you''ll need to boot off a ramdisk or via a networked mount. :-) Cheers, Keir> I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > The tests I was running were my ''compare'' script and pinging the two domains I had running with > ping -q -i 0.01 -s 1400 <ip address> >=20 > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. >=20 > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. >=20 > James--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <HTML><HEAD></HEAD> <BODY> <DIV id=3DidOWAReplyText57341 dir=3Dltr> <DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>i''ll try this out tomorrow morning (too late tonight - need sleep!)</FONT></DIV></DIV> <DIV dir=3Dltr><BR> <HR tabIndex=3D-1> <FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Wed 21/07/2004 11:30 PM<BR><B>To:</B> James Harper<BR><B>Cc:</B> Keir Fraser; xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM<BR></FONT><BR></DIV> <DIV><PRE style=3D"WORD-WRAP: break-word">Could someone try to isolate this to either the network backend driver or the blkdev backend driver? The best way to do this is to disable the frontend drivers so that they never try to coinnect to the backend driver... To disable networking: Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to always ''return 0;''. To disable block devices: Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to always ''return 0;''. Oh yes -- the 2.4 sparse tree no longer contains the net frontend driver - you''ll find the build tree symlinks to linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to edit that instead... Obviously, if you disable blkdevs you''ll need to boot off a ramdisk or via a networked mount. :-) Cheers, Keir > I downloaded these (from a tgz that Keir had given me a link to as bk was down - I assume it''s identical to his latest fixes) and started my tests running and went to bed, but it looks like I got errors within a very short time. > The tests I was running were my ''compare'' script and pinging the two domains I had running with > ping -q -i 0.01 -s 1400 <ip address> >=20 > Lots of oopses in the logs, most are probably as a result of the corruption and not indicative of the cause. They look similar to Jody''s dump so I won''t bother sending them unless someone thinks they might be useful. >=20 > btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier. >=20 > James </PRE></DIV></BODY></HTML> --_AD96A7AB-04BB-40C1-819D-80A6B56655A4_--
On Jul 21, 2004, at 9:30 AM, Keir Fraser wrote:> > To disable networking: > Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to > always ''return 0;''.changing netif_init() so that all the returns are "return 0;" doesn''t seem to do much, the VMs still get network access, and everything looks and acts normal and there''s still corruption after a few minutes of stress testing and network traffic. (Although it does seem to be network related. It ran for a while with no network traffic and no corruption, and within a minute or two of starting the pings, it started to flake out.) changing netif_init() so that it immediately does "return 0" runs for a good long time with no corruption, unless you try to send data to one of the vifs, which makes dom0 blow up real good. Running it for a while with just block I/o and ping traffic to dom0 didn''t result in any obvious corruption while running, but I did get these messages when I rebooted: (XEN) (file=/opt/src/xeno/xeno-unstable.bk/xen/include/asm/mm.h, line=215) Unexpected type (saw c0000000 != exp e0000000) for pfn 000032db (XEN) DOM0: (file=memory.c, line=249) Bad page type for pfn 000032db (d0000005) (XEN) (file=traps.c, line=466) GPF (0004): fc5277c8 -> fc52a094 Kernel panic: Failed to execute MMU updates (XEN) Domain 0 shutdown: rebooting machine! which I''ve only seen on a reboot when there has been corruption. disabling the receive path seems to still let packets through and shows signs of corruption, even with very little network traffic. I''m not sure if that''s because I have everything doing NAT instead of bridging, although that doesn''t really make sense since it''s still the same interface and the code looks like it should simply drop the packets... It''s getting late so I''ll have to work on disabling the transmit path and working out how to go about testing the blockdev backend tomorrow. I''ll let it run for a while without even up''ing the vifs on the dom0 side, which should preclude any network traffic at all getting to the VMs and see if there''s any corruption going on running overnight or longer. Can anyone else corrupt their systems with no network traffic? -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "I think that''s what they mean by | "nickels a day can feed a child." | http://www.eff.org/ I thought, "How can food be so | http://www.anti-dmca.org/ cheap over there?" It''s not, they |-------------------------- just eat the nickels." -- Peter Nguyen ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> At this stage, it looks like disabling the receive path for the > guest os eg netif_be_start_xmit ''goto drop'' means that I can ping > from the guest OS all i like with no crashes. I hope that''s the > right way around to do it...Yep, an unconditional ''goto drop;'' at the start of netif_be_start_xmit will prevent the guest from ever receiving packets. How did you do send packets from the guest -- did you poke an ARP entry, or send broadcast packets? Anyway - currently sounds like teh bug resides in the most complex half of the most complex driver. Who''d''ve thought it? ;-)> I''m just looking at that procedure, > how is the ring actually managed - what do all the _prod and _cons > variables actually represent? And how is synchronisation handled > between the domains? i notice there is no spinlock in there, is this > done by the calling function?Synchronisation between backend and frontend is lock-free --- for each ring one guy is producer and the other is consumer so they each update a disjoint set of ring indexes. Within the backend, there is implicit per-interface locking on netif_be_start_xmit so we''ll never reenter for the same interface. Then when we batch stuff up for a tasklet we''re still okay because tasklets are guaranteed non-reentrant also. -- Keir ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> Okay, I have made the following change in dom0: > > To disable the transmit path for guest OSes: > Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the > call to netif_schedule_work(), add: > make_tx_response(netif, txreq.id, NETIF_RSP_OKAY); > netif_put(netif); > continue; > > compiled and rebooted with the new kernel. booted dom1, removed vif1.0 from the bridge, gave it it''s own ip address, added a static arp entry and pinged away. I could see the packet counters for dom0 and dom1 climbing rapiding indicating that dom0 was sending packets, dom1 was receiving packets, but that a packet sent by dom1 was unable to reach dom0 again. I got the same sort of crashes after about 10 minutes.If you do a test with DPRINTK enabled in linux-2.4.26-xen-sparse/arch/xen/drivers/netif/backend/common.h and with debugging enabled in Xen ''debug=y make'' then you may get some useful debugging out of the machine when it all goes horribly wrong. e.g., perhaps something is failing apparently spuriously... one example would be that a page reassignment (from dom0 to the other guest) is failing for some weird reason. If we can get somne debugging out when things first go wrong, that would be very useful indeed. Thanks, Keir ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
I am trying this now. Within a few seconds of starting the flood ping, dom1 rebooted. no messages in the logs to give any hint as to why though. Trying again and I didn''t get anything useful either once I started getting noticable corruption. just on the subject of page reassignment, I''m trying to figure out what the code is doing. in netif_be_start_xmit, there is a check to make sure that the packet is entirely on 1 page. What happens if the packet is too big for one page, or if there is other data on the same page? (it''s all black magic to me at the moment!) James From: Keir Fraser Sent: Thu 22/07/2004 9:54 PM To: James Harper Cc: Keir Fraser; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM> Okay, I have made the following change in dom0: > > To disable the transmit path for guest OSes: > Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the > call to netif_schedule_work(), add: > make_tx_response(netif, txreq.id, NETIF_RSP_OKAY); > netif_put(netif); > continue; > > compiled and rebooted with the new kernel. booted dom1, removed vif1.0 from the bridge, gave it it''s own ip address, added a static arp entry and pinged away. I could see the packet counters for dom0 and dom1 climbing rapiding indicating that dom0 was sending packets, dom1 was receiving packets, but that a packet sent by dom1 was unable to reach dom0 again. I got the same sort of crashes after about 10 minutes.If you do a test with DPRINTK enabled in linux-2.4.26-xen-sparse/arch/xen/drivers/netif/backend/common.h and with debugging enabled in Xen ''debug=y make'' then you may get some useful debugging out of the machine when it all goes horribly wrong. e.g., perhaps something is failing apparently spuriously... one example would be that a page reassignment (from dom0 to the other guest) is failing for some weird reason. If we can get somne debugging out when things first go wrong, that would be very useful indeed. Thanks, Keir
> I am trying this now. Within a few seconds of starting the flood ping, > dom1 rebooted. no messages in the logs to give any hint as to why > though. Trying again and I didn''t get anything useful either once I > started getting noticable corruption.Hmmm.... I guess maybe there''s a race somewhere, rather than the problem being a broken error-handling path. Which is a shame, as it''s bound to be harder to track down. :-(> just on the subject of page reassignment, I''m trying to figure out > what the code is doing. > > in netif_be_start_xmit, there is a check to make sure that the packet > is entirely on 1 page. What happens if the packet is too big for one > page, or if there is other data on the same page? (it''s all black > magic to me at the moment!)Unless you''re using jumbo Ethernet frames (which you''re almost certainly not) then the packet will certainly fit in a page. We also check that the packet buffer is at least half a page in size --- since the slab allocator allocates in powers-of-two, that means the packet buffer must actually be a full aligned page in size. If our checks are insufficient and a few packets that are sharing their data page are getting thru, for example, then we would be pretty screwed! This might be another area to explore -- whether there are a few skbuffs coming thru now and then that are of a layout that we mishandle. -- Keir ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 22, 2004, at 8:53 AM, James Harper wrote:> I am trying this now. Within a few seconds of starting the flood ping, > dom1 rebooted. no messages in the logs to give any hint as to why > though. Trying again and I didn''t get anything useful either once I > started getting noticable corruption.Just to corroborate, I''ve been able to pretty reliably induce corruption and I have my Xen kernel compiled with "debug=y". Xen will pretty much continuously spit out "GPF (0004)" messages, but I''ve only ever seen it output "Bailing" a couple of times on a corruption. Most of the time there''s nothing when the corruption starts. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:> > Anyway - currently sounds like teh bug resides in the most complex > half of the most complex driver. Who''d''ve thought it? ;-)At this point this data is surely redundant but... When I went to sleep last night I let my box run dom0 and four VMs doing md5sum checks on a couple of large files, hammering the heck out of the block i/o drivers and CPU but with all the ifaces/vifs on the machine down. When I woke up, all compares had been correct for the six hours or so it ran. I re-upped the ifaces and started to ping dom0 and the VMs and within a minute of the pings starting dom0 started to report incorrect md5sums. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
It''s useful to have the extra data points -- it adds to our confidence that it''s the network driver that is somehow at fault here. Quite how to proceed in narrowing down the problem is unclear. One approach is to perturb the backend driver''s data path (e.g., always copying packets into a known-safe page-sized buffer, as a check that our current copy-avoidancxe checks are not at fault; and replacing the current high-performance but convoluted code for batching hypercalls with something slower but easier to grok). The latter is useful because if the bug goes away then we have a smaller chunk of code to look at; if the bug remains then we end up with a less complex data path that is easier to instrument and bughunt. If anyone is interested in pursuing this bug independently, the functions most under suspicion are netif_be_start_xmit and net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. These two form the data path for packets getting sent to guest OSes. -- Keir> > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote: > > > > Anyway - currently sounds like teh bug resides in the most complex > > half of the most complex driver. Who''d''ve thought it? ;-) > > At this point this data is surely redundant but... > > When I went to sleep last night I let my box run dom0 and four VMs > doing md5sum checks on a couple of large files, hammering the heck out > of the block i/o drivers and CPU but with all the ifaces/vifs on the > machine down. When I woke up, all compares had been correct for the > six hours or so it ran. I re-upped the ifaces and started to ping dom0 > and the VMs and within a minute of the pings starting dom0 started to > report incorrect md5sums. > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > "We all enter this world in the | Support Electronic Freedom > same way: naked; screaming; soaked | http://www.eff.org/ > in blood. But if you live your | http://www.anti-dmca.org/ > life right, that kind of thing |--------------------------- > doesn''t have to stop there." -- Dana Gould > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn''t the only code containing bugs. As another test I also put the ''goto done;'' after the ''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still block the receive but do it later) and there were no crashes, so i''m comfortable that we''ve exhausted netif_be_start_xmit as a source for bugs. So I guess that leaves net_rx_action. I''m unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be? Keir: have you been able to reproduce these errors at all? James From: Keir Fraser Sent: Fri 23/07/2004 3:48 AM To: Derek Glidden Cc: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM It''s useful to have the extra data points -- it adds to our confidence that it''s the network driver that is somehow at fault here. Quite how to proceed in narrowing down the problem is unclear. One approach is to perturb the backend driver''s data path (e.g., always copying packets into a known-safe page-sized buffer, as a check that our current copy-avoidancxe checks are not at fault; and replacing the current high-performance but convoluted code for batching hypercalls with something slower but easier to grok). The latter is useful because if the bug goes away then we have a smaller chunk of code to look at; if the bug remains then we end up with a less complex data path that is easier to instrument and bughunt. If anyone is interested in pursuing this bug independently, the functions most under suspicion are netif_be_start_xmit and net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. These two form the data path for packets getting sent to guest OSes. -- Keir> > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote: > > > > Anyway - currently sounds like teh bug resides in the most complex > > half of the most complex driver. Who''d''ve thought it? ;-) > > At this point this data is surely redundant but... > > When I went to sleep last night I let my box run dom0 and four VMs > doing md5sum checks on a couple of large files, hammering the heck out > of the block i/o drivers and CPU but with all the ifaces/vifs on the > machine down. When I woke up, all compares had been correct for the > six hours or so it ran. I re-upped the ifaces and started to ping dom0 > and the VMs and within a minute of the pings starting dom0 started to > report incorrect md5sums. > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > "We all enter this world in the | Support Electronic Freedom > same way: naked; screaming; soaked | http://www.eff.org/ > in blood. But if you live your | http://www.anti-dmca.org/ > life right, that kind of thing |--------------------------- > doesn''t have to stop there." -- Dana Gould > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Yeah, it turns out I can reproduce this bug trivially by md5summing a file just slightly bigger than dom0''s memory allocation, while floodpinging dom1. I''m trying out a few things right now, so hopefully I''ll be able to report progress on this evil bug r.s.n. :-) -- Keir> I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn''t the only code containing bugs. > > As another test I also put the ''goto done;'' after the ''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still block the receive but do it later) and there were no crashes, so i''m comfortable that we''ve exhausted netif_be_start_xmit as a source for bugs. > > So I guess that leaves net_rx_action. I''m unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be? > > Keir: have you been able to reproduce these errors at all? > > James > > > > > From: Keir Fraser > Sent: Fri 23/07/2004 3:48 AM > To: Derek Glidden > Cc: xen-devel@lists.sourceforge.net > Subject: Re: [Xen-devel] segfault in VM > > > It''s useful to have the extra data points -- it adds to our confidence > that it''s the network driver that is somehow at fault here. > > Quite how to proceed in narrowing down the problem is > unclear. One approach is to perturb the backend driver''s data path > (e.g., always copying packets into a known-safe page-sized buffer, as > a check that our current copy-avoidancxe checks are not at fault; and > replacing the current high-performance but convoluted code for > batching hypercalls with something slower but easier to grok). The > latter is useful because if the bug goes away then we have a smaller > chunk of code to look at; if the bug remains then we end up with a > less complex data path that is easier to instrument and bughunt. > > If anyone is interested in pursuing this bug independently, the > functions most under suspicion are netif_be_start_xmit and > net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. > These two form the data path for packets getting sent to guest OSes. > > -- Keir > > > > > > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote: > > > > > > Anyway - currently sounds like teh bug resides in the most complex > > > half of the most complex driver. Who''d''ve thought it? ;-) > > > > At this point this data is surely redundant but... > > > > When I went to sleep last night I let my box run dom0 and four VMs > > doing md5sum checks on a couple of large files, hammering the heck out > > of the block i/o drivers and CPU but with all the ifaces/vifs on the > > machine down. When I woke up, all compares had been correct for the > > six hours or so it ran. I re-upped the ifaces and started to ping dom0 > > and the VMs and within a minute of the pings starting dom0 started to > > report incorrect md5sums. > > > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > > "We all enter this world in the | Support Electronic Freedom > > same way: naked; screaming; soaked | http://www.eff.org/ > > in blood. But if you live your | http://www.anti-dmca.org/ > > life right, that kind of thing |--------------------------- > > doesn''t have to stop there." -- Dana Gould > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by BEA Weblogic Workshop > > FREE Java Enterprise J2EE developer tools! > > Get your free copy of BEA WebLogic Workshop 8.1 today. > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/xen-devel > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel-=- MIME -=- --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn''t the only code containing bugs. As another test I also put the ''goto done;'' after the ''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still block the receive but do it later) and there were no crashes, so i''m comfortable that we''ve exhausted netif_be_start_xmit as a source for bugs. So I guess that leaves net_rx_action. I''m unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be? Keir: have you been able to reproduce these errors at all? James From: Keir Fraser Sent: Fri 23/07/2004 3:48 AM To: Derek Glidden Cc: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM It''s useful to have the extra data points -- it adds to our confidence that it''s the network driver that is somehow at fault here. Quite how to proceed in narrowing down the problem is unclear. One approach is to perturb the backend driver''s data path (e.g., always copying packets into a known-safe page-sized buffer, as a check that our current copy-avoidancxe checks are not at fault; and replacing the current high-performance but convoluted code for batching hypercalls with something slower but easier to grok). The latter is useful because if the bug goes away then we have a smaller chunk of code to look at; if the bug remains then we end up with a less complex data path that is easier to instrument and bughunt. If anyone is interested in pursuing this bug independently, the functions most under suspicion are netif_be_start_xmit and net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. These two form the data path for packets getting sent to guest OSes. -- Keir>=20 > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote: > > > > Anyway - currently sounds like teh bug resides in the most complex > > half of the most complex driver. Who''d''ve thought it? ;-) >=20 > At this point this data is surely redundant but... >=20 > When I went to sleep last night I let my box run dom0 and four VMs=20 > doing md5sum checks on a couple of large files, hammering the heck out=20 > of the block i/o drivers and CPU but with all the ifaces/vifs on the=20 > machine down. When I woke up, all compares had been correct for the=20 > six hours or so it ran. I re-upped the ifaces and started to ping dom0=20 > and the VMs and within a minute of the pings starting dom0 started to=20 > report incorrect md5sums. >=20 > -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- > "We all enter this world in the | Support Electronic Freedom > same way: naked; screaming; soaked | http://www.eff.org/ > in blood. But if you live your | http://www.anti-dmca.org/ > life right, that kind of thing |--------------------------- > doesn''t have to stop there." -- Dana Gould >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <HTML><HEAD></HEAD> <BODY> <DIV id=3DidOWAReplyText58627 dir=3Dltr> <DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn''t the only code containing bugs.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>As another test I also put the ''goto done;'' after the ''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still block the receive but do it later) and there were no crashes, so i''m comfortable that we''ve exhausted netif_be_start_xmit as a source for bugs.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>So I guess that leaves net_rx_action. I''m unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be?</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>Keir: have you been able to reproduce these errors at all?</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV></DIV> <DIV dir=3Dltr><BR> <HR tabIndex=3D-1> <FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Fri 23/07/2004 3:48 AM<BR><B>To:</B> Derek Glidden<BR><B>Cc:</B> xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM<BR></FONT><BR></DIV> <DIV><PRE style=3D"WORD-WRAP: break-word">It''s useful to have the extra data points -- it adds to our confidence that it''s the network driver that is somehow at fault here. Quite how to proceed in narrowing down the problem is unclear. One approach is to perturb the backend driver''s data path (e.g., always copying packets into a known-safe page-sized buffer, as a check that our current copy-avoidancxe checks are not at fault; and replacing the current high-performance but convoluted code for batching hypercalls with something slower but easier to grok). The latter is useful because if the bug goes away then we have a smaller chunk of code to look at; if the bug remains then we end up with a less complex data path that is easier to instrument and bughunt. If anyone is interested in pursuing this bug independently, the functions most under suspicion are netif_be_start_xmit and net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. These two form the data path for packets getting sent to guest OSes. -- Keir >=20 > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote: > > > > Anyway - currently sounds like teh bug resides in the most complex > > half of the most complex driver. Who''d''ve thought it? ;-) >=20 > At this point this data is surely redundant but... >=20 > When I went to sleep last night I let my box run dom0 and four VMs=20 > doing md5sum checks on a couple of large files, hammering the heck out=20 > of the block i/o drivers and CPU but with all the ifaces/vifs on the=20 > machine down. When I woke up, all compares had been correct for the=20 > six hours or so it ran. I re-upped the ifaces and started to ping dom0=20 > and the VMs and within a minute of the pings starting dom0 started to=20 > report incorrect md5sums. >=20 > -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- > "We all enter this world in the | Support Electronic Freedom > same way: naked; screaming; soaked | http://www.eff.org/ > in blood. But if you live your | http://www.anti-dmca.org/ > life right, that kind of thing |--------------------------- > doesn''t have to stop there." -- Dana Gould >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel </PRE></DIV></BODY></HTML> --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_-- ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
That''s comforting. I was starting to think of looking for gcc bugs and the like. Even so, it might be useful to collect the gcc versions of anyone who either has seen the bug or has tried to reproduce it and can''t. Mine reports itself as "gcc (GCC) 3.3.4 (Debian 1:3.3.4-2)" with "gcc --version" James From: Keir Fraser Sent: Fri 23/07/2004 11:11 AM To: James Harper Cc: Keir Fraser; Derek Glidden; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM Yeah, it turns out I can reproduce this bug trivially by md5summing a file just slightly bigger than dom0''s memory allocation, while floodpinging dom1. I''m trying out a few things right now, so hopefully I''ll be able to report progress on this evil bug r.s.n. :-) -- Keir> I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn''t the only code containing bugs. > > As another test I also put the ''goto done;'' after the ''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still block the receive but do it later) and there were no crashes, so i''m comfortable that we''ve exhausted netif_be_start_xmit as a source for bugs. > > So I guess that leaves net_rx_action. I''m unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be? > > Keir: have you been able to reproduce these errors at all? > > James > > > > > From: Keir Fraser > Sent: Fri 23/07/2004 3:48 AM > To: Derek Glidden > Cc: xen-devel@lists.sourceforge.net > Subject: Re: [Xen-devel] segfault in VM > > > It''s useful to have the extra data points -- it adds to our confidence > that it''s the network driver that is somehow at fault here. > > Quite how to proceed in narrowing down the problem is > unclear. One approach is to perturb the backend driver''s data path > (e.g., always copying packets into a known-safe page-sized buffer, as > a check that our current copy-avoidancxe checks are not at fault; and > replacing the current high-performance but convoluted code for > batching hypercalls with something slower but easier to grok). The > latter is useful because if the bug goes away then we have a smaller > chunk of code to look at; if the bug remains then we end up with a > less complex data path that is easier to instrument and bughunt. > > If anyone is interested in pursuing this bug independently, the > functions most under suspicion are netif_be_start_xmit and > net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. > These two form the data path for packets getting sent to guest OSes. > > -- Keir > > > > > > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote: > > > > > > Anyway - currently sounds like teh bug resides in the most complex > > > half of the most complex driver. Who''d''ve thought it? ;-) > > > > At this point this data is surely redundant but... > > > > When I went to sleep last night I let my box run dom0 and four VMs > > doing md5sum checks on a couple of large files, hammering the heck out > > of the block i/o drivers and CPU but with all the ifaces/vifs on the > > machine down. When I woke up, all compares had been correct for the > > six hours or so it ran. I re-upped the ifaces and started to ping dom0 > > and the VMs and within a minute of the pings starting dom0 started to > > report incorrect md5sums. > > > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > > "We all enter this world in the | Support Electronic Freedom > > same way: naked; screaming; soaked | http://www.eff.org/ > > in blood. But if you live your | http://www.anti-dmca.org/ > > life right, that kind of thing |--------------------------- > > doesn''t have to stop there." -- Dana Gould > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by BEA Weblogic Workshop > > FREE Java Enterprise J2EE developer tools! > > Get your free copy of BEA WebLogic Workshop 8.1 today. > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/xen-devel > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel-=- MIME -=- --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn''t the only code containing bugs. As another test I also put the ''goto done;'' after the ''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still block the receive but do it later) and there were no crashes, so i''m comfortable that we''ve exhausted netif_be_start_xmit as a source for bugs. So I guess that leaves net_rx_action. I''m unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be? Keir: have you been able to reproduce these errors at all? James From: Keir Fraser Sent: Fri 23/07/2004 3:48 AM To: Derek Glidden Cc: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM It''s useful to have the extra data points -- it adds to our confidence that it''s the network driver that is somehow at fault here. Quite how to proceed in narrowing down the problem is unclear. One approach is to perturb the backend driver''s data path (e.g., always copying packets into a known-safe page-sized buffer, as a check that our current copy-avoidancxe checks are not at fault; and replacing the current high-performance but convoluted code for batching hypercalls with something slower but easier to grok). The latter is useful because if the bug goes away then we have a smaller chunk of code to look at; if the bug remains then we end up with a less complex data path that is easier to instrument and bughunt. If anyone is interested in pursuing this bug independently, the functions most under suspicion are netif_be_start_xmit and net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. These two form the data path for packets getting sent to guest OSes. -- Keir>=20 > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote: > > > > Anyway - currently sounds like teh bug resides in the most complex > > half of the most complex driver. Who''d''ve thought it? ;-) >=20 > At this point this data is surely redundant but... >=20 > When I went to sleep last night I let my box run dom0 and four VMs=20 > doing md5sum checks on a couple of large files, hammering the heck out=20 > of the block i/o drivers and CPU but with all the ifaces/vifs on the=20 > machine down. When I woke up, all compares had been correct for the=20 > six hours or so it ran. I re-upped the ifaces and started to ping dom0=20 > and the VMs and within a minute of the pings starting dom0 started to=20 > report incorrect md5sums. >=20 > -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- > "We all enter this world in the | Support Electronic Freedom > same way: naked; screaming; soaked | http://www.eff.org/ > in blood. But if you live your | http://www.anti-dmca.org/ > life right, that kind of thing |--------------------------- > doesn''t have to stop there." -- Dana Gould >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <HTML><HEAD></HEAD> <BODY> <DIV id=3DidOWAReplyText58627 dir=3Dltr> <DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn''t the only code containing bugs.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>As another test I also put the ''goto done;'' after the ''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still block the receive but do it later) and there were no crashes, so i''m comfortable that we''ve exhausted netif_be_start_xmit as a source for bugs.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>So I guess that leaves net_rx_action. I''m unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be?</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>Keir: have you been able to reproduce these errors at all?</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV></DIV> <DIV dir=3Dltr><BR> <HR tabIndex=3D-1> <FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Fri 23/07/2004 3:48 AM<BR><B>To:</B> Derek Glidden<BR><B>Cc:</B> xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM<BR></FONT><BR></DIV> <DIV><PRE style=3D"WORD-WRAP: break-word">It''s useful to have the extra data points -- it adds to our confidence that it''s the network driver that is somehow at fault here. Quite how to proceed in narrowing down the problem is unclear. One approach is to perturb the backend driver''s data path (e.g., always copying packets into a known-safe page-sized buffer, as a check that our current copy-avoidancxe checks are not at fault; and replacing the current high-performance but convoluted code for batching hypercalls with something slower but easier to grok). The latter is useful because if the bug goes away then we have a smaller chunk of code to look at; if the bug remains then we end up with a less complex data path that is easier to instrument and bughunt. If anyone is interested in pursuing this bug independently, the functions most under suspicion are netif_be_start_xmit and net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. These two form the data path for packets getting sent to guest OSes. -- Keir >=20 > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote: > > > > Anyway - currently sounds like teh bug resides in the most complex > > half of the most complex driver. Who''d''ve thought it? ;-) >=20 > At this point this data is surely redundant but... >=20 > When I went to sleep last night I let my box run dom0 and four VMs=20 > doing md5sum checks on a couple of large files, hammering the heck out=20 > of the block i/o drivers and CPU but with all the ifaces/vifs on the=20 > machine down. When I woke up, all compares had been correct for the=20 > six hours or so it ran. I re-upped the ifaces and started to ping dom0=20 > and the VMs and within a minute of the pings starting dom0 started to=20 > report incorrect md5sums. >=20 > -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- > "We all enter this world in the | Support Electronic Freedom > same way: naked; screaming; soaked | http://www.eff.org/ > in blood. But if you live your | http://www.anti-dmca.org/ > life right, that kind of thing |--------------------------- > doesn''t have to stop there." -- Dana Gould >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/xen-devel ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel </PRE></DIV></BODY></HTML> --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_--
Okay, so I found that the problem is due to overly-aggressive merging of block requests in the IDE driver. The code assumes that if buffers are adjacent in virtual or physical address space then they can be merged --- this isn''t always the case over Xen since those physical addresses may map to different real machine pages. I''ve checked in a fix that I think is safe for IDE --- in the occasional instances that a merged scatter-gather list is invalid, we should now cause IDE to fall back to a super-safe mode (basically PIO). On my system this happens so occasionally that performance shouldn''t be affected. If this also turns out to be a problem for SCSI then we may need to do some more work --- our safety check will still trigger and we will still fail the scatter-gather list, but it doesn''t look as though many SCSI drivers pick up the error return code and do anything sane. This is a bug in those drivers, but this is small comfort to us in our aim to work with the full range of Linux SCSI drivers. What we need now is some more checking, particularly with SCSI block devices, to see whether there are any more bugs to shake out. -- Keir> > Yeah, it turns out I can reproduce this bug trivially by md5summing a > file just slightly bigger than dom0''s memory allocation, while > floodpinging dom1. > > I''m trying out a few things right now, so hopefully I''ll be able to > report progress on this evil bug r.s.n. :-) > > -- Keir > > > I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn''t the only code containing bugs. > > > > As another test I also put the ''goto done;'' after the ''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still block the receive but do it later) and there were no crashes, so i''m comfortable that we''ve exhausted netif_be_start_xmit as a source for bugs. > > > > So I guess that leaves net_rx_action. I''m unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be? > > > > Keir: have you been able to reproduce these errors at all? > > > > James > > > > > > > > > > From: Keir Fraser > > Sent: Fri 23/07/2004 3:48 AM > > To: Derek Glidden > > Cc: xen-devel@lists.sourceforge.net > > Subject: Re: [Xen-devel] segfault in VM > > > > > > It''s useful to have the extra data points -- it adds to our confidence > > that it''s the network driver that is somehow at fault here. > > > > Quite how to proceed in narrowing down the problem is > > unclear. One approach is to perturb the backend driver''s data path > > (e.g., always copying packets into a known-safe page-sized buffer, as > > a check that our current copy-avoidancxe checks are not at fault; and > > replacing the current high-performance but convoluted code for > > batching hypercalls with something slower but easier to grok). The > > latter is useful because if the bug goes away then we have a smaller > > chunk of code to look at; if the bug remains then we end up with a > > less complex data path that is easier to instrument and bughunt. > > > > If anyone is interested in pursuing this bug independently, the > > functions most under suspicion are netif_be_start_xmit and > > net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. > > These two form the data path for packets getting sent to guest OSes. > > > > -- Keir > > > > > > > > > > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote: > > > > > > > > Anyway - currently sounds like teh bug resides in the most complex > > > > half of the most complex driver. Who''d''ve thought it? ;-) > > > > > > At this point this data is surely redundant but... > > > > > > When I went to sleep last night I let my box run dom0 and four VMs > > > doing md5sum checks on a couple of large files, hammering the heck out > > > of the block i/o drivers and CPU but with all the ifaces/vifs on the > > > machine down. When I woke up, all compares had been correct for the > > > six hours or so it ran. I re-upped the ifaces and started to ping dom0 > > > and the VMs and within a minute of the pings starting dom0 started to > > > report incorrect md5sums. > > > > > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > > > "We all enter this world in the | Support Electronic Freedom > > > same way: naked; screaming; soaked | http://www.eff.org/ > > > in blood. But if you live your | http://www.anti-dmca.org/ > > > life right, that kind of thing |--------------------------- > > > doesn''t have to stop there." -- Dana Gould > > > > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email is sponsored by BEA Weblogic Workshop > > > FREE Java Enterprise J2EE developer tools! > > > Get your free copy of BEA WebLogic Workshop 8.1 today. > > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/xen-devel > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by BEA Weblogic Workshop > > FREE Java Enterprise J2EE developer tools! > > Get your free copy of BEA WebLogic Workshop 8.1 today. > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/xen-devel > -=- MIME -=- > --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_ > Content-Type: text/plain; > charset="iso-8859-1" > Content-Transfer-Encoding: quoted-printable > > I just made a change so that the skbuf is always copied in netif_be_start_x> mit but it still crashes, which means most likely that bit is fine or at le> ast isn''t the only code containing bugs. > > As another test I also put the ''goto done;'' after the ''if ( skb_shared(skb)> || skb_cloned(skb) || ...'' block, (still block the receive but do it later> ) and there were no crashes, so i''m comfortable that we''ve exhausted netif_> be_start_xmit as a source for bugs. > > So I guess that leaves net_rx_action. I''m unsure on one thing though, the p> ages that get passed from dom0 to domU, how/where/do they get recycled back> to dom0? Is it possible that domU could still write to a page that dom0 th> ought it had free to use for something else? If so, where would that be? > > Keir: have you been able to reproduce these errors at all? > > James >------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On Jul 23, 2004, at 12:01 PM, Keir Fraser wrote:> > Okay, so I found that the problem is due to overly-aggressive merging > of block requests in the IDE driver. The code assumes that if buffers > are adjacent in virtual or physical address space then they can be > merged --- this isn''t always the case over Xen since those physical > addresses may map to different real machine pages.And there was much rejoicing! Thanks Keir for working so hard on digging this problem out and getting a fix in. Other than the doms not dying after a halt, which you said you checked in a fix, and the occasional strange unbalanced dom scheduling, which I understand the scheduler is being worked on, the -unstable branch has worked very well for me so far. (Well, outside of the random crashes... :) I''ll do a pull tonight when I get home and rebuild everything and start hammering on it some more.> I''ve checked in a fix that I think is safe for IDE --- in the > occasional instances that a merged scatter-gather list is invalid, we > should now cause IDE to fall back to a super-safe mode (basically > PIO). On my system this happens so occasionally that performance > shouldn''t be affected.Does it revert back to "normal" behaviour for consequent operations? i.e. is the "basically PIO" mode just for the operation that fails?> What we need now is some more checking, particularly with SCSI block > devices, to see whether there are any more bugs to shake out.Would it help at all for me to set up a box as ide-scsi, or is it strictly the data path inside the individual SCSI drivers that could cause problems? -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "I think that''s what they mean by | "nickels a day can feed a child." | http://www.eff.org/ I thought, "How can food be so | http://www.anti-dmca.org/ cheap over there?" It''s not, they |-------------------------- just eat the nickels." -- Peter Nguyen -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn''t have to stop there." -- Dana Gould ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> > I''ve checked in a fix that I think is safe for IDE --- in the > > occasional instances that a merged scatter-gather list is invalid, we > > should now cause IDE to fall back to a super-safe mode (basically > > PIO). On my system this happens so occasionally that performance > > shouldn''t be affected. > > Does it revert back to "normal" behaviour for consequent operations? > i.e. is the "basically PIO" mode just for the operation that fails?That is correct -- in practice very very few requests should end up using PIO.> > What we need now is some more checking, particularly with SCSI block > > devices, to see whether there are any more bugs to shake out. > > Would it help at all for me to set up a box as ide-scsi, or is it > strictly the data path inside the individual SCSI drivers that could > cause problems?Stress-testing in as many environments and setups as possible is very welcome! I''ve also contacted the linux-kernel mailing list to find out whether anyone there has a btter fix, or would be amenable to some Xen-friendly patches being sent their way. ;-) -- Keir ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On 23 Jul 2004, at 17:01, Keir Fraser wrote:> > What we need now is some more checking, particularly with SCSI block > devices, to see whether there are any more bugs to shake out.I''ve given this change a go on my PE1650 (aacraid driver). Unfortunately this seems to be one of the SCSI drivers that doesn''t correctly handle the error condition. Running my usual test (''compare'' in dom0, compiles in other domains), I don''t see any differences in the compares, but after a few minutes, I get the following on the console, and everything is stuck waiting for disk. aacraid: cmd len 00000000 cmd underflow 00010000 aacraid: Host adapter reset request. SCSI hang ? The latter message repeats every few seconds. I rebooted the box with the Xen console after a few lines. I''m going to try the aacraid driver from 2.4.27-rc3, which I believe has had some attention recently. Chris. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
My system doesn''t have any ide devices, it''s scsi only. The scsi driver is aic7xxx, and i''m still having crashes even with the latest checkout. I noticed in the logs for the first time some scsi errors in amongst all the others, but given the nature of the crash i don''t know if that means anything. Is this the same problem that we thought was in the network code? I could not readily induce the crash without creating lots of network traffic. James From: Keir Fraser Sent: Sat 24/07/2004 2:01 AM To: Keir Fraser Cc: James Harper; Derek Glidden; xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM - FIXED! Okay, so I found that the problem is due to overly-aggressive merging of block requests in the IDE driver. The code assumes that if buffers are adjacent in virtual or physical address space then they can be merged --- this isn''t always the case over Xen since those physical addresses may map to different real machine pages. I''ve checked in a fix that I think is safe for IDE --- in the occasional instances that a merged scatter-gather list is invalid, we should now cause IDE to fall back to a super-safe mode (basically PIO). On my system this happens so occasionally that performance shouldn''t be affected. If this also turns out to be a problem for SCSI then we may need to do some more work --- our safety check will still trigger and we will still fail the scatter-gather list, but it doesn''t look as though many SCSI drivers pick up the error return code and do anything sane. This is a bug in those drivers, but this is small comfort to us in our aim to work with the full range of Linux SCSI drivers. What we need now is some more checking, particularly with SCSI block devices, to see whether there are any more bugs to shake out. -- Keir> > Yeah, it turns out I can reproduce this bug trivially by md5summing a > file just slightly bigger than dom0''s memory allocation, while > floodpinging dom1. > > I''m trying out a few things right now, so hopefully I''ll be able to > report progress on this evil bug r.s.n. :-) > > -- Keir > > > I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn''t the only code containing bugs. > > > > As another test I also put the ''goto done;'' after the ''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still block the receive but do it later) and there were no crashes, so i''m comfortable that we''ve exhausted netif_be_start_xmit as a source for bugs. > > > > So I guess that leaves net_rx_action. I''m unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be? > > > > Keir: have you been able to reproduce these errors at all? > > > > James > > > > > > > > > > From: Keir Fraser > > Sent: Fri 23/07/2004 3:48 AM > > To: Derek Glidden > > Cc: xen-devel@lists.sourceforge.net > > Subject: Re: [Xen-devel] segfault in VM > > > > > > It''s useful to have the extra data points -- it adds to our confidence > > that it''s the network driver that is somehow at fault here. > > > > Quite how to proceed in narrowing down the problem is > > unclear. One approach is to perturb the backend driver''s data path > > (e.g., always copying packets into a known-safe page-sized buffer, as > > a check that our current copy-avoidancxe checks are not at fault; and > > replacing the current high-performance but convoluted code for > > batching hypercalls with something slower but easier to grok). The > > latter is useful because if the bug goes away then we have a smaller > > chunk of code to look at; if the bug remains then we end up with a > > less complex data path that is easier to instrument and bughunt. > > > > If anyone is interested in pursuing this bug independently, the > > functions most under suspicion are netif_be_start_xmit and > > net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. > > These two form the data path for packets getting sent to guest OSes. > > > > -- Keir > > > > > > > > > > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote: > > > > > > > > Anyway - currently sounds like teh bug resides in the most complex > > > > half of the most complex driver. Who''d''ve thought it? ;-) > > > > > > At this point this data is surely redundant but... > > > > > > When I went to sleep last night I let my box run dom0 and four VMs > > > doing md5sum checks on a couple of large files, hammering the heck out > > > of the block i/o drivers and CPU but with all the ifaces/vifs on the > > > machine down. When I woke up, all compares had been correct for the > > > six hours or so it ran. I re-upped the ifaces and started to ping dom0 > > > and the VMs and within a minute of the pings starting dom0 started to > > > report incorrect md5sums. > > > > > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > > > "We all enter this world in the | Support Electronic Freedom > > > same way: naked; screaming; soaked | http://www.eff.org/ > > > in blood. But if you live your | http://www.anti-dmca.org/ > > > life right, that kind of thing |--------------------------- > > > doesn''t have to stop there." -- Dana Gould > > > > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email is sponsored by BEA Weblogic Workshop > > > FREE Java Enterprise J2EE developer tools! > > > Get your free copy of BEA WebLogic Workshop 8.1 today. > > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/xen-devel > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by BEA Weblogic Workshop > > FREE Java Enterprise J2EE developer tools! > > Get your free copy of BEA WebLogic Workshop 8.1 today. > > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/xen-devel > -=- MIME -=- > --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_ > Content-Type: text/plain; > charset="iso-8859-1" > Content-Transfer-Encoding: quoted-printable > > I just made a change so that the skbuf is always copied in netif_be_start_x> mit but it still crashes, which means most likely that bit is fine or at le> ast isn''t the only code containing bugs. > > As another test I also put the ''goto done;'' after the ''if ( skb_shared(skb)> || skb_cloned(skb) || ...'' block, (still block the receive but do it later> ) and there were no crashes, so i''m comfortable that we''ve exhausted netif_> be_start_xmit as a source for bugs. > > So I guess that leaves net_rx_action. I''m unsure on one thing though, the p> ages that get passed from dom0 to domU, how/where/do they get recycled back> to dom0? Is it possible that domU could still write to a page that dom0 th> ought it had free to use for something else? If so, where would that be? > > Keir: have you been able to reproduce these errors at all? > > James >
On 24 Jul 2004, at 13:47, I wrote:> I''m just testing a patch which disables merging in the scsi layer when > it believes it has contiguous requests in different pages. I think > this is more pessimistic that it needs to be, as the pages may after > all be contiguous, but it does allow some merging to happen and so > far seems to be stable.I''ve given this a bit more testing, and it seems to be working fine - the machine is now running a dom0 kernel built while running the patch. As for performance, it''s ''not bad'' -- I''ve just done a bonnie++ run, and some compiles. Based on sticking printks in and watching the console, it''s allowing merges much more often than not, but still I suspect not as much as it could. Probably it should use something with more arch-knowledge than page_to_phys(). patch for linux-2.4.26-xen0: http://munky.nodnol.org/~chris/xen_scsi_merge.diff bonnie++ stats: http://munky.nodnol.org/~chris/munkyII_stats.txt Chris. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
I''m building this now. The way I see it, currently it must be incorrect for only a very very small number of cases or the system would crash and burn almost instantly. So in theory, unless these cases are undetectable, or the cost of detecting them is high for some reason, the performance difference should be almost unnoticable I assume the patch would only affect dom0 and so should matter if domU is patched or not. Is there a way of installing a patch so that it''s picked up by ''make world''? i''ll follow up with results shortly. James From: Chris Andrews Sent: Sun 25/07/2004 1:54 AM To: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM - FIXED! On 24 Jul 2004, at 13:47, I wrote:> I''m just testing a patch which disables merging in the scsi layer when > it believes it has contiguous requests in different pages. I think > this is more pessimistic that it needs to be, as the pages may after > all be contiguous, but it does allow some merging to happen and so > far seems to be stable.I''ve given this a bit more testing, and it seems to be working fine - the machine is now running a dom0 kernel built while running the patch. As for performance, it''s ''not bad'' -- I''ve just done a bonnie++ run, and some compiles. Based on sticking printks in and watching the console, it''s allowing merges much more often than not, but still I suspect not as much as it could. Probably it should use something with more arch-knowledge than page_to_phys(). patch for linux-2.4.26-xen0: http://munky.nodnol.org/~chris/xen_scsi_merge.diff bonnie++ stats: http://munky.nodnol.org/~chris/munkyII_stats.txt Chris. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
so far so good. It''s been running for a while now with no errors. much longer than it would have survived previously. James From: James Harper Sent: Sun 25/07/2004 7:27 PM To: Chris Andrews; xen-devel@lists.sourceforge.net Subject: RE: [Xen-devel] segfault in VM - FIXED! I''m building this now. The way I see it, currently it must be incorrect for only a very very small number of cases or the system would crash and burn almost instantly. So in theory, unless these cases are undetectable, or the cost of detecting them is high for some reason, the performance difference should be almost unnoticable I assume the patch would only affect dom0 and so should matter if domU is patched or not. Is there a way of installing a patch so that it''s picked up by ''make world''? i''ll follow up with results shortly. James From: Chris Andrews Sent: Sun 25/07/2004 1:54 AM To: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM - FIXED! On 24 Jul 2004, at 13:47, I wrote:> I''m just testing a patch which disables merging in the scsi layer when > it believes it has contiguous requests in different pages. I think > this is more pessimistic that it needs to be, as the pages may after > all be contiguous, but it does allow some merging to happen and so > far seems to be stable.I''ve given this a bit more testing, and it seems to be working fine - the machine is now running a dom0 kernel built while running the patch. As for performance, it''s ''not bad'' -- I''ve just done a bonnie++ run, and some compiles. Based on sticking printks in and watching the console, it''s allowing merges much more often than not, but still I suspect not as much as it could. Probably it should use something with more arch-knowledge than page_to_phys(). patch for linux-2.4.26-xen0: http://munky.nodnol.org/~chris/xen_scsi_merge.diff bonnie++ stats: http://munky.nodnol.org/~chris/munkyII_stats.txt Chris. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On 25 Jul 2004, at 12:24, James Harper wrote:> so far so good. It''s been running for a while now with no errors. much > longer than it would have survived previously.It''s broken for me - I suspect it''s that although it checks that requests to be merged begin in the same page, it doesn''t also check they end in that same page. I''m testing a version now that tries to do that. Chris. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
I was running my diff script all night which itself reported no errors, but this morning I have the following in dom0''s kern.log: Jul 25 21:53:58 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 Jul 25 23:02:49 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 Jul 25 23:31:25 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 Jul 26 01:07:55 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 Jul 26 01:38:59 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 Jul 26 02:35:21 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 Jul 26 02:47:33 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 Jul 26 04:55:37 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 Jul 26 06:32:56 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 Jul 26 06:59:22 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 Jul 26 08:00:19 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 Jul 26 08:24:50 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 and in dom2: Jul 25 21:53:58 mail2 kernel: bad buffer on RX ring!(-1) Jul 25 23:02:49 mail2 kernel: bad buffer on RX ring!(-1) Jul 25 23:31:25 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 01:07:55 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 01:38:59 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 02:35:21 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 02:47:33 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 04:55:37 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 06:32:56 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 06:59:22 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 08:00:19 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 08:24:50 mail2 kernel: bad buffer on RX ring!(-1) so something funny is going on. i started my diff and ping scripts at about 21:20. At least the above error is detected though. James From: Chris Andrews Sent: Mon 26/07/2004 1:08 AM To: James Harper Cc: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM - FIXED! On 25 Jul 2004, at 12:24, James Harper wrote:> so far so good. It''s been running for a while now with no errors. much > longer than it would have survived previously.It''s broken for me - I suspect it''s that although it checks that requests to be merged begin in the same page, it doesn''t also check they end in that same page. I''m testing a version now that tries to do that. Chris.
> > On 23 Jul 2004, at 17:01, Keir Fraser wrote: > > > > What we need now is some more checking, particularly with SCSI block > > devices, to see whether there are any more bugs to shake out. > > I''ve given this change a go on my PE1650 (aacraid driver). > Unfortunately this seems to be one of the SCSI drivers that doesn''t > correctly handle the error condition. > > Running my usual test (''compare'' in dom0, compiles in other domains), I > don''t see any differences in the compares, but after a few minutes, I > get the following on the console, and everything is stuck waiting for > disk. > > aacraid: cmd len 00000000 cmd underflow 00010000 > aacraid: Host adapter reset request. SCSI hang ? > > The latter message repeats every few seconds. I rebooted the box with > the Xen console after a few lines. I''m going to try the aacraid driver > from 2.4.27-rc3, which I believe has had some attention recently.Looks like the SCSI-merge code will have to be modified. I''ll do IDE-merge code at the same time, so that we just merge less aggressively rather than falling back to PIO transfers. I''ll leave the check in pci_map_sg(), but it shouldn''t ever trigger after I patch the IDE and SCSI merge routines, so I''ll add a warning message if an invalid scatter-gather list is detected. -- Keir ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Looks like this is a very occasional failure, from the timestamps between messages. If you make a debug buil dof Xen then we''ll get some info as to why the page transfer is failing. -- Keir> I was running my diff script all night which itself reported no errors, but this morning I have the following in dom0''s kern.log: > > Jul 25 21:53:58 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > Jul 25 23:02:49 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > Jul 25 23:31:25 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > Jul 26 01:07:55 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > Jul 26 01:38:59 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > Jul 26 02:35:21 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > Jul 26 02:47:33 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > Jul 26 04:55:37 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > Jul 26 06:32:56 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > Jul 26 06:59:22 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > Jul 26 08:00:19 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > Jul 26 08:24:50 xen1 kernel: (file=main.c, line=270) Failed MMU update transferring to DOM2 > > and in dom2: > > Jul 25 21:53:58 mail2 kernel: bad buffer on RX ring!(-1) > Jul 25 23:02:49 mail2 kernel: bad buffer on RX ring!(-1) > Jul 25 23:31:25 mail2 kernel: bad buffer on RX ring!(-1) > Jul 26 01:07:55 mail2 kernel: bad buffer on RX ring!(-1) > Jul 26 01:38:59 mail2 kernel: bad buffer on RX ring!(-1) > Jul 26 02:35:21 mail2 kernel: bad buffer on RX ring!(-1) > Jul 26 02:47:33 mail2 kernel: bad buffer on RX ring!(-1) > Jul 26 04:55:37 mail2 kernel: bad buffer on RX ring!(-1) > Jul 26 06:32:56 mail2 kernel: bad buffer on RX ring!(-1) > Jul 26 06:59:22 mail2 kernel: bad buffer on RX ring!(-1) > Jul 26 08:00:19 mail2 kernel: bad buffer on RX ring!(-1) > Jul 26 08:24:50 mail2 kernel: bad buffer on RX ring!(-1) > > so something funny is going on. i started my diff and ping scripts at about 21:20. At least the above error is detected though. > > James > > > > From: Chris Andrews > Sent: Mon 26/07/2004 1:08 AM > To: James Harper > Cc: xen-devel@lists.sourceforge.net > Subject: Re: [Xen-devel] segfault in VM - FIXED! > > > On 25 Jul 2004, at 12:24, James Harper wrote: > > > so far so good. It''s been running for a while now with no errors. much > > longer than it would have survived previously. > > It''s broken for me - I suspect it''s that although it checks that > requests to be merged begin in the same page, it doesn''t also check > they end in that same page. I''m testing a version now that tries to do > that. > > Chris.-=- MIME -=- --_7B4740D2-5940-4EA9-8376-C62BADEDF385_ Content-Type: text/plain; charset="iso-8859-1"; format=flowed Content-Transfer-Encoding: quoted-printable I was running my diff script all night which itself reported no errors, but this morning I have the following in dom0''s kern.log: Jul 25 21:53:58 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 Jul 25 23:02:49 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 Jul 25 23:31:25 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 Jul 26 01:07:55 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 Jul 26 01:38:59 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 Jul 26 02:35:21 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 Jul 26 02:47:33 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 Jul 26 04:55:37 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 Jul 26 06:32:56 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 Jul 26 06:59:22 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 Jul 26 08:00:19 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 Jul 26 08:24:50 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2 and in dom2: Jul 25 21:53:58 mail2 kernel: bad buffer on RX ring!(-1) Jul 25 23:02:49 mail2 kernel: bad buffer on RX ring!(-1) Jul 25 23:31:25 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 01:07:55 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 01:38:59 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 02:35:21 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 02:47:33 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 04:55:37 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 06:32:56 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 06:59:22 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 08:00:19 mail2 kernel: bad buffer on RX ring!(-1) Jul 26 08:24:50 mail2 kernel: bad buffer on RX ring!(-1) so something funny is going on. i started my diff and ping scripts at about 21:20. At least the above error is detected though. James From: Chris Andrews Sent: Mon 26/07/2004 1:08 AM To: James Harper Cc: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM - FIXED! On 25 Jul 2004, at 12:24, James Harper wrote:> so far so good. It''s been running for a while now with no errors. much=20 > longer than it would have survived previously.It''s broken for me - I suspect it''s that although it checks that=20 requests to be merged begin in the same page, it doesn''t also check=20 they end in that same page. I''m testing a version now that tries to do=20 that. Chris. --_7B4740D2-5940-4EA9-8376-C62BADEDF385_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <HTML><HEAD></HEAD> <BODY> <DIV id=3DidOWAReplyText44056 dir=3Dltr> <DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>I was running my diff script all night which itself reported no errors, but this morning I have the following in dom0''s kern.log:</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>Jul 25 21:53:58 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 25 23:02:49 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 25 23:31:25 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 01:07:55 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 01:38:59 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 02:35:21 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 02:47:33 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 04:55:37 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 06:32:56 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 06:59:22 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 08:00:19 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 08:24:50 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR></FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>and in dom2:</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>Jul 25 21:53:58 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 25 23:02:49 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 25 23:31:25 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 01:07:55 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 01:38:59 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 02:35:21 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 02:47:33 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 04:55:37 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 06:32:56 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 06:59:22 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 08:00:19 mail2 kernel: bad buffer on RX ring!(-1)<BR>Jul 26 08:24:50 mail2 kernel: bad buffer on RX ring!(-1)<BR></FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>so something funny is going on. i started my diff and ping scripts at about 21:20. At least the above error is detected though.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT> </DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2> </DIV></FONT> <DIV dir=3Dltr> <HR tabIndex=3D-1> <FONT face=3DTahoma size=3D2><B>From:</B> Chris Andrews<BR><B>Sent:</B> Mon 26/07/2004 1:08 AM<BR><B>To:</B> James Harper<BR><B>Cc:</B> xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM - FIXED!<BR></FONT><BR></DIV></DIV> <DIV><PRE style=3D"WORD-WRAP: break-word">On 25 Jul 2004, at 12:24, James Harper wrote: > so far so good. It''s been running for a while now with no errors. much=20 > longer than it would have survived previously. It''s broken for me - I suspect it''s that although it checks that=20 requests to be merged begin in the same page, it doesn''t also check=20 they end in that same page. I''m testing a version now that tries to do=20 that. Chris. </PRE></DIV></BODY></HTML> --_7B4740D2-5940-4EA9-8376-C62BADEDF385_-- ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel