thr3ads.net - Xen devel - [Xen-devel] segfault in VM [Jul 2004]

If this information is useful, please help other people find it:
Share via:

Derek Glidden

2004-Jul-19 05:22 UTC

[Xen-devel] segfault in VM

Maybe related or maybe not, but it was the same VM getting all the 
scheduling time in my previous post.  (SMP Celeron box with 512M of 
RAM, no himem enabled.)

At the time, four VMs were all compiling, with dom0 copying a linux 
source tree from one place to another with rsync.  Everything copacetic 
until I started the big rsync in dom0, where within a minute or so, vm2 
bombed.  No messages on the dom0 console or in the VM other than the 
"Segmentation Fault" in the VM during compliation.

However XEN (compiled with debug=y) console spits out:

(XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
4GB segment.

at the time of the segmentation fault.

(and there are lots of these, pretty much any time there is heavy i/o 
on the machine, all with the same values:)

(XEN) (file=traps.c, line=466) GPF (0004): fc5277a8 -> fc52a294

Any further activity inside vm2 results in more segmentation faults and 
more "Bailing" messages.  The other VMs and dom0 seem to be ok.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-19 05:50 UTC

head link

RE: [Xen-devel] segfault in VM

that sounds like the same sort of errors i''m getting which appeared to
be filesystem corruption. First the corruption starts, then everything you do
causes a segfault, although i''ve only seen funny things happen in dom0.

In the limited testing i''ve done it looks like dom0 by itself is
stable, but crashes start occuring once I start up other domains and work dom0
hard (other domains running under light load). I''m running this script
in dom0:

#!/bin/sh
while [ 1 = 1 ]
do
 diff file3 file4 && echo okay
done

where file3 and file4 are around 300mb files, and the vm has 128mb of memory
with no swap. This ensures that none of the file is cached so there''s
lots of I/O.

When i''ve seen it crash most readily has been when i''m running
a few other domains and then start running dom0 out of memory, but nothing
conclusive yet.

I''ll let this test keep running for another hour (otherwise idle, no
other domains running) or so then start my running-out-of-memory program.

I wonder if it is coincidence that we both have smp boxes... each of the domains
only sees 1 cpu so I wouldn''t have thought that would be a problem
unless there''s a race in xen itself.

James









From: Derek Glidden
Sent: Mon 19/07/2004 3:22 PM
To: xen-devel@lists.sourceforge.net
Subject: [Xen-devel] segfault in VM


Maybe related or maybe not, but it was the same VM getting all the 
scheduling time in my previous post.  (SMP Celeron box with 512M of 
RAM, no himem enabled.)

At the time, four VMs were all compiling, with dom0 copying a linux 
source tree from one place to another with rsync.  Everything copacetic 
until I started the big rsync in dom0, where within a minute or so, vm2 
bombed.  No messages on the dom0 console or in the VM other than the 
"Segmentation Fault" in the VM during compliation.

However XEN (compiled with debug=y) console spits out:

(XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
4GB segment.

at the time of the segmentation fault.

(and there are lots of these, pretty much any time there is heavy i/o 
on the machine, all with the same values:)

(XEN) (file=traps.c, line=466) GPF (0004): fc5277a8 -> fc52a294

Any further activity inside vm2 results in more segmentation faults and 
more "Bailing" messages.  The other VMs and dom0 seem to be ok.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-19 07:27 UTC

head link

Re: [Xen-devel] segfault in VM

Clearly there''s some fairly random memory corruption going on, which
then causes segfaults (if the corruption hits code pages) and
filesystem corruption (if the corruption hits buffer-cache pages).

The "Bailing: not a -ve offset" and "GPF (0004):" messages
are almost
certainly just symptoms of executing a corrupted block of code. i.e.,
the bug has already triggered some time ago - probably corrupted a
page of glibc or the kernel.

It would be interesting to see whether or not this is SMP-related.
It''s also interesting that someone said they couldn''t
reproduce
corruption when using 2.6.7 for the non-privileged guest OSes.

 -- Keir
> that sounds like the same sort of errors i''m getting which
appeared to be filesystem corruption. First the corruption starts, then
everything you do causes a segfault, although i''ve only seen funny
things happen in dom0.
> 
> In the limited testing i''ve done it looks like dom0 by itself is
stable, but crashes start occuring once I start up other domains and work dom0
hard (other domains running under light load). I''m running this script
in dom0:
> 
> #!/bin/sh
> while [ 1 = 1 ]
> do
>  diff file3 file4 && echo okay
> done
> 
> where file3 and file4 are around 300mb files, and the vm has 128mb of
memory with no swap. This ensures that none of the file is cached so
there''s lots of I/O.
> 
> When i''ve seen it crash most readily has been when i''m
running a few other domains and then start running dom0 out of memory, but
nothing conclusive yet.
> 
> I''ll let this test keep running for another hour (otherwise idle,
no other domains running) or so then start my running-out-of-memory program.
> 
> I wonder if it is coincidence that we both have smp boxes... each of the
domains only sees 1 cpu so I wouldn''t have thought that would be a
problem unless there''s a race in xen itself.
> 
> James
> 
> 
> From: Derek Glidden
> Sent: Mon 19/07/2004 3:22 PM
> To: xen-devel@lists.sourceforge.net
> Subject: [Xen-devel] segfault in VM
> 
> 
> Maybe related or maybe not, but it was the same VM getting all the 
> scheduling time in my previous post.  (SMP Celeron box with 512M of 
> RAM, no himem enabled.)
> 
> At the time, four VMs were all compiling, with dom0 copying a linux 
> source tree from one place to another with rsync.  Everything copacetic 
> until I started the big rsync in dom0, where within a minute or so, vm2 
> bombed.  No messages on the dom0 console or in the VM other than the 
> "Segmentation Fault" in the VM during compliation.
> 
> However XEN (compiled with debug=y) console spits out:
> 
> (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
> 4GB segment.
> 
> at the time of the segmentation fault.
> 
> (and there are lots of these, pretty much any time there is heavy i/o 
> on the machine, all with the same values:)
> 
> (XEN) (file=traps.c, line=466) GPF (0004): fc5277a8 -> fc52a294
> 
> Any further activity inside vm2 results in more segmentation faults and 
> more "Bailing" messages.  The other VMs and dom0 seem to be ok.
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> "We all enter this world in the    | Support Electronic Freedom
> same way: naked; screaming; soaked |        eff.org
> in blood. But if you live your     |  anti-dmca.org
> life right, that kind of thing     |---------------------------
> doesn''t have to stop there." -- Dana Gould
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> lists.sourceforge.net/lists/listinfo/xen-devel -=- MIME -=- 
--_DA10D165-B49A-46A6-8E62-3E81282C36E8_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset="iso-8859-1";
	format=flowed

that sounds like the same sort of errors i''m getting which appeared to
be filesystem corruption. First the corruption starts, then everything you do
causes a segfault, although i''ve only seen funny things happen in dom0.

In the limited testing i''ve done it looks like dom0 by itself is
stable, but crashes start occuring once I start up other domains and work dom0
hard (other domains running under light load). I''m running this script
in dom0:

#!/bin/sh
while [ 1 =3D 1 ]
do
 diff file3 file4 && echo okay
done

where file3 and file4 are around 300mb files, and the vm has 128mb of memory
with no swap. This ensures that none of the file is cached so there''s
lots of I/O.

When i''ve seen it crash most readily has been when i''m running
a few other domains and then start running dom0 out of memory, but nothing
conclusive yet.

I''ll let this test keep running for another hour (otherwise idle, no
other domains running) or so then start my running-out-of-memory program.

I wonder if it is coincidence that we both have smp boxes... each of the domains
only sees 1 cpu so I wouldn''t have thought that would be a problem
unless there''s a race in xen itself.

James









From: Derek Glidden
Sent: Mon 19/07/2004 3:22 PM
To: xen-devel@lists.sourceforge.net
Subject: [Xen-devel] segfault in VM


Maybe related or maybe not, but it was the same VM getting all the=20
scheduling time in my previous post.  (SMP Celeron box with 512M of=20
RAM, no himem enabled.)

At the time, four VMs were all compiling, with dom0 copying a linux=20
source tree from one place to another with rsync.  Everything copacetic=20
until I started the big rsync in dom0, where within a minute or so, vm2=20
bombed.  No messages on the dom0 console or in the VM other than the=20
"Segmentation Fault" in the VM during compliation.

However XEN (compiled with debug=3Dy) console spits out:

(XEN) (file=3Dx86_32/emulate.c, line=3D228) Bailing: not a -ve offset into=20
4GB segment.

at the time of the segmentation fault.

(and there are lots of these, pretty much any time there is heavy i/o=20
on the machine, all with the same values:)

(XEN) (file=3Dtraps.c, line=3D466) GPF (0004): fc5277a8 -> fc52a294

Any further activity inside vm2 results in more segmentation faults and=20
more "Bailing" messages.  The other VMs and dom0 seem to be ok.

-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

--_DA10D165-B49A-46A6-8E62-3E81282C36E8_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText53940 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>that
sounds like the same sort of errors i''m getting which appeared to be
filesystem corruption. First the corruption starts, then everything you do
causes a segfault, although i''ve only seen funny things happen in
dom0.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>In the limited testing
i''ve done it looks like dom0 by itself is stable, but crashes start
occuring once I start up other domains and work dom0 hard (other domains running
under light load). I''m running this script in
dom0:</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>#!/bin/sh<BR>while
[ 1 =3D 1 ]<BR>do<BR>&nbsp;diff file3 file4 &amp;&amp;
echo okay<BR>done<BR></FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>where file3 and file4 are
around 300mb files, and the vm has 128mb of memory with no swap. This ensures
that none of the file is cached so there''s lots of
I/O.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>When i''ve seen
it crash most readily has been when i''m running a few other domains and
then start running dom0 out of memory, but nothing conclusive
yet.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>I''ll let this
test keep running for another hour (otherwise idle, no other domains running) or
so then start&nbsp;my running-out-of-memory
program.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>I wonder if it is
coincidence that we both have smp boxes... each of the domains only sees 1 cpu
so I wouldn''t have thought that would be a problem unless
there''s a race in xen itself.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT><FONT
face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2>James</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr>&nbsp;</DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Derek
Glidden<BR><B>Sent:</B> Mon 19/07/2004 3:22
PM<BR><B>To:</B>
xen-devel@lists.sourceforge.net<BR><B>Subject:</B> [Xen-devel]
segfault in VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">Maybe related or
maybe not, but it was the same VM getting all the=20
scheduling time in my previous post.  (SMP Celeron box with 512M of=20
RAM, no himem enabled.)

At the time, four VMs were all compiling, with dom0 copying a linux=20
source tree from one place to another with rsync.  Everything copacetic=20
until I started the big rsync in dom0, where within a minute or so, vm2=20
bombed.  No messages on the dom0 console or in the VM other than the=20
"Segmentation Fault" in the VM during compliation.

However XEN (compiled with debug=3Dy) console spits out:

(XEN) (file=3Dx86_32/emulate.c, line=3D228) Bailing: not a -ve offset into=20
4GB segment.

at the time of the segmentation fault.

(and there are lots of these, pretty much any time there is heavy i/o=20
on the machine, all with the same values:)

(XEN) (file=3Dtraps.c, line=3D466) GPF (0004): fc5277a8 -&gt; fc52a294

Any further activity inside vm2 results in more segmentation faults and=20
more "Bailing" messages.  The other VMs and dom0 seem to be ok.

-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel
</PRE></DIV></BODY></HTML>

--_DA10D165-B49A-46A6-8E62-3E81282C36E8_--


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Chris Andrews

2004-Jul-19 08:28 UTC

head link

Re: [Xen-devel] segfault in VM

Keir Fraser wrote:> Clearly there''s some fairly random memory corruption going on,
which
> then causes segfaults (if the corruption hits code pages) and
> filesystem corruption (if the corruption hits buffer-cache pages).
 >> The "Bailing: not a -ve offset" and "GPF (0004):"
messages are almost
> certainly just symptoms of executing a corrupted block of code. i.e.,
> the bug has already triggered some time ago - probably corrupted a
> page of glibc or the kernel.
> 
> It would be interesting to see whether or not this is SMP-related.
> It''s also interesting that someone said they couldn''t
reproduce
> corruption when using 2.6.7 for the non-privileged guest OSes.
I''m seeing this corruption on a single CPU machine, with a single 2.4 
guest running but idle. I only ran one 2.6.7 guest, and I didn''t give
it
any work, but it didn''t take any load in the 2.4 guest to provoke
problems.

The machine uses devicemapper, so I''m going to move some partitions 
around and see if I still get corruption without it. I can also build 
Xen with debug=y and try that, once I''ve got the disk sorted.


Chris.


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-19 08:57 UTC

head link

Re: [Xen-devel] segfault in VM

> Keir Fraser wrote:
> > Clearly there''s some fairly random memory corruption going
on, which
> > then causes segfaults (if the corruption hits code pages) and
> > filesystem corruption (if the corruption hits buffer-cache pages).
>  >
> > The "Bailing: not a -ve offset" and "GPF (0004):"
messages are almost
> > certainly just symptoms of executing a corrupted block of code. i.e.,
> > the bug has already triggered some time ago - probably corrupted a
> > page of glibc or the kernel.
> > 
> > It would be interesting to see whether or not this is SMP-related.
> > It''s also interesting that someone said they
couldn''t reproduce
> > corruption when using 2.6.7 for the non-privileged guest OSes.
> 
> I''m seeing this corruption on a single CPU machine, with a single
2.4
> guest running but idle. I only ran one 2.6.7 guest, and I didn''t
give it
> any work, but it didn''t take any load in the 2.4 guest to provoke
problems.
Do you mean a single 2.4 or 2.6 guest in addition to your 2.4 DOM0?

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Chris Andrews

2004-Jul-19 09:01 UTC

head link

Re: [Xen-devel] segfault in VM

Keir Fraser wrote:>>Keir Fraser wrote:
>>
>>>Clearly there''s some fairly random memory corruption going
on, which
>>>then causes segfaults (if the corruption hits code pages) and
>>>filesystem corruption (if the corruption hits buffer-cache pages).
>>
>> >
>>
>>>The "Bailing: not a -ve offset" and "GPF
(0004):" messages are almost
>>>certainly just symptoms of executing a corrupted block of code.
i.e.,
>>>the bug has already triggered some time ago - probably corrupted a
>>>page of glibc or the kernel.
>>>
>>>It would be interesting to see whether or not this is SMP-related.
>>>It''s also interesting that someone said they
couldn''t reproduce
>>>corruption when using 2.6.7 for the non-privileged guest OSes.
>>
>>I''m seeing this corruption on a single CPU machine, with a
single 2.4
>>guest running but idle. I only ran one 2.6.7 guest, and I
didn''t give it
>>any work, but it didn''t take any load in the 2.4 guest to
provoke problems.
> 
> 
> Do you mean a single 2.4 or 2.6 guest in addition to your 2.4 DOM0?
Yes, that''s right. With just the 2.4 domain0 on its own, everything 
seems fine.


Chris.


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

2004-Jul-19 12:48 UTC

head link

Re: [Xen-devel] segfault in VM

On Mon, Jul 19, 2004 at 10:01:54AM +0100, Chris Andrews
wrote:> Keir Fraser wrote:
> >>Keir Fraser wrote:
> >Do you mean a single 2.4 or 2.6 guest in addition to your 2.4 DOM0?
> 
> Yes, that''s right. With just the 2.4 domain0 on its own,
everything
> seems fine.
OK using an image from Chris of dom1 i have been able to semi reliably cause
all sorts of corruption including Oopses in dom0 and other domains.

This is on 2 different machines one of which is thought to be atleast semi 
reliable.

I first noticed it when doing a bk pull and having bitkeeper deciding
that my tree was rather corrupt (in a dom0), but with other domains
running.

Running while (:) do  tar cpf - . | gzip -3vc | cat >/dev/null; done

yuri.org.uk/~murble/boom has some output, with the
domain0 deciding to try and access beyond the end of device lots,
but this could be caused by random memory corruption.

Whilst trying to build somthing in dom0 seems a fairly reliable way
of triggering it.

Also my dom0 only had 48mb or so of ram but plenty of swap.

Before the crash i noticed user programs that allocated lots of memory
in dom0 randomly segfaulting, including apt-get update and apt-get
build-deps.

Again the oopsen i get were generally rather random, although i
have noticed another possible XenoLinux bug.  When you boot with
panic=30 it takes ages for dom0 to reboot, far longer than 30 seconds.

Even though after the panic it says rebooting in 30 seconds.

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-19 13:22 UTC

head link

Re: [Xen-devel] segfault in VM

It strikes me that many people have started seeing this bug all of a
sudden, so it has probably been introduced in the last week. Perhaps
it is worth someone backing off to an older repository version and
seeing whether they can reproduce the problems?

If we can ''binary chop'' the changesets to isolate the bad one,
it
would be a much easier bug to fix. ;-) Sounds like it would be a
fairly tedious process though...

The first person to complain I think was Jody Belka, who was using 
the changeset with comment ''Fairly major fixes to the network frontend
driver...'' (2004-07-13 18:24:48). Perhaps backing off to a day before
that would be a sensible place to start? 

 -- Keir
> On Mon, Jul 19, 2004 at 10:01:54AM +0100, Chris Andrews wrote:
> > Keir Fraser wrote:
> > >>Keir Fraser wrote:
> > >Do you mean a single 2.4 or 2.6 guest in addition to your 2.4
DOM0?
> > 
> > Yes, that''s right. With just the 2.4 domain0 on its own,
everything
> > seems fine.
> 
> OK using an image from Chris of dom1 i have been able to semi reliably
cause
> all sorts of corruption including Oopses in dom0 and other domains.
> 
> This is on 2 different machines one of which is thought to be atleast semi 
> reliable.
> 
> I first noticed it when doing a bk pull and having bitkeeper deciding
> that my tree was rather corrupt (in a dom0), but with other domains
> running.
> 
> Running while (:) do  tar cpf - . | gzip -3vc | cat >/dev/null; done
> 
> yuri.org.uk/~murble/boom has some output, with the
> domain0 deciding to try and access beyond the end of device lots,
> but this could be caused by random memory corruption.
> 
> Whilst trying to build somthing in dom0 seems a fairly reliable way
> of triggering it.
> 
> Also my dom0 only had 48mb or so of ram but plenty of swap.
> 
> Before the crash i noticed user programs that allocated lots of memory
> in dom0 randomly segfaulting, including apt-get update and apt-get
> build-deps.
> 
> Again the oopsen i get were generally rather random, although i
> have noticed another possible XenoLinux bug.  When you boot with
> panic=30 it takes ages for dom0 to reboot, far longer than 30 seconds.
> 
> Even though after the panic it says rebooting in 30 seconds.
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> lists.sourceforge.net/lists/listinfo/xen-devel


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-19 18:52 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 19, 2004, at 1:50 AM, James Harper wrote:
> where file3 and file4 are around 300mb files, and the vm has 128mb of 
> memory with no swap. This ensures that none of the file is cached so 
> there''s lots of I/O.
>  
> When i''ve seen it crash most readily has been when i''m
running a few
> other domains and then start running dom0 out of memory, but nothing 
> conclusive yet.
>  
> I''ll let this test keep running for another hour (otherwise idle,
no
> other domains running) or so then start my running-out-of-memory 
> program.
similarly, I can reproduce it reasonably reliably if I wait until all 
the VMs are busy either doing I/o or high CPU utilization and then I 
start dom0 doing lots of I/o either through an rsync or something along 
those lines.  If I let the system run for a little while to "prime"
it,
so far I think I can pretty much crash it whenever I want.
>  
> I wonder if it is coincidence that we both have smp boxes... each of 
> the domains only sees 1 cpu so I wouldn''t have thought that would
be a
> problem unless there''s a race in xen itself.
I have another, single-CPU, box that I can play with that I''ll try to 
get to building and deploying Xen tonight and see if it makes any 
difference.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_idG21&alloc_id040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-19 18:56 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 19, 2004, at 3:27 AM, Keir Fraser wrote:
>
> Clearly there''s some fairly random memory corruption going on,
which
> then causes segfaults (if the corruption hits code pages) and
> filesystem corruption (if the corruption hits buffer-cache pages).
>
> The "Bailing: not a -ve offset" and "GPF (0004):"
messages are almost
> certainly just symptoms of executing a corrupted block of code. i.e.,
> the bug has already triggered some time ago - probably corrupted a
> page of glibc or the kernel.
>
> It would be interesting to see whether or not this is SMP-related.
> It''s also interesting that someone said they couldn''t
reproduce
> corruption when using 2.6.7 for the non-privileged guest OSes.
I''ll be building Xen on a non-SMP box I also have at home tonight, with
any luck.

I''ll also be running memtest on the SMP box that''s been seeing
the
corruption when I get home as well, probably followed by CTCS.  It was 
stable for a week or so under reasonably heavy load before I installed 
Xen on it, but you never know...

If it passes all the testing, I''ll build a 2.6.7 guest kernel and give 
that a try.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that''s what they mean by   |
"nickels a day can feed a child."   |       eff.org
I thought, "How can food be so      | anti-dmca.org
cheap over there?"  It''s not, they  |--------------------------
just eat the nickels." -- Peter Nguyen



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-19 18:58 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 19, 2004, at 4:28 AM, Chris Andrews wrote:
>
> I''m seeing this corruption on a single CPU machine, with a single
2.4
> guest running but idle. I only ran one 2.6.7 guest, and I didn''t
give
> it any work, but it didn''t take any load in the 2.4 guest to
provoke
> problems.
I''ve not really tried real hard at not loading the VMs or dom0 OS yet.
I''ve got too much I want to make them do.   :)
> The machine uses devicemapper, so I''m going to move some
partitions
> around and see if I still get corruption without it. I can also build 
> Xen with debug=y and try that, once I''ve got the disk sorted.
This box uses dm as well...  Clue or coincidence?  Probably 
coincidence...  I''ve had the segfaults in different VMs and dom0, so I 
doubt it''s related to any specific LV or disk sector or anything.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-19 19:06 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 19, 2004, at 9:22 AM, Keir Fraser wrote:>
> The first person to complain I think was Jody Belka, who was using
> the changeset with comment ''Fairly major fixes to the network
frontend
> driver...'' (2004-07-13 18:24:48). Perhaps backing off to a day
before
> that would be a sensible place to start?
I''m either going to blow your theory out of the water or help a lot 
because my first "real" build of all the Xen tools & kernel &
linux
kernels where I actually booted into a dom0 kernel from Xen was from a 
checkout on either the 12th or 13th.  Prior to that I was working out 
getting everything built under gentoo and not actually running it.  And 
that''s what I''ve been using until I checked out and rebuilt
everything
fresh this sunday afternoon and still have the problem as of last 
night.  Although a VM will segfault while dom0 seems to panic, it''s 
probably the same root problem.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Chris Andrews

2004-Jul-19 19:34 UTC

head link

Re: [Xen-devel] segfault in VM

On 19 Jul 2004, at 19:58, Derek Glidden wrote:
>
> On Jul 19, 2004, at 4:28 AM, Chris Andrews wrote:
>
>
>> The machine uses devicemapper, so I''m going to move some
partitions
>> around and see if I still get corruption without it. I can also build 
>> Xen with debug=y and try that, once I''ve got the disk sorted.
>
> This box uses dm as well...  Clue or coincidence?  Probably 
> coincidence...  I''ve had the segfaults in different VMs and dom0,
so I
> doubt it''s related to any specific LV or disk sector or anything.
I''ve moved stuff around my machine''s disk so I don''t
need dm and
recompiled without it, and I''ve seen the same crashes with the guest fs
on a loop device, and with the guest fs on an ordinary disk partition, 
so I guess it''s not specific to dm.

Chris.



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-19 23:06 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 19, 2004, at 2:56 PM, Derek Glidden wrote:
> I''ll also be running memtest on the SMP box that''s been
seeing the
> corruption when I get home as well, probably followed by CTCS.  It was 
> stable for a week or so under reasonably heavy load before I installed 
> Xen on it, but you never know...
FWIW - memtest ran for a couple of hours with no trouble.

I''ve booted Xen with "nosmp" and will do the same things
I''ve been
doing to it to make it break and see what happens.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that''s what they mean by   |
"nickels a day can feed a child."   |       eff.org
I thought, "How can food be so      | anti-dmca.org
cheap over there?"  It''s not, they  |--------------------------
just eat the nickels." -- Peter Nguyen



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-20 00:01 UTC

head link

RE: [Xen-devel] segfault in VM

I''m pretty sure i''ve seen it earlier than that, but
couldn''t be certain. Initially I more or less expected instabilities
and so wasn''t really taking much notice.

so I guess my comments above are of absolutely no help at all. :)

i''ll be trying a bk pull and build today (under normal linux - 2 cpus
and max memory = faster builds) then verify that i can still make it crash, then
try nosmp, although i''ve seen a few posts about single cpu crashes.

james



From: Derek Glidden
Sent: Tue 20/07/2004 5:06 AM
To: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


On Jul 19, 2004, at 9:22 AM, Keir Fraser wrote:>
> The first person to complain I think was Jody Belka, who was using
> the changeset with comment ''Fairly major fixes to the network
frontend
> driver...'' (2004-07-13 18:24:48). Perhaps backing off to a day
before
> that would be a sensible place to start?
I''m either going to blow your theory out of the water or help a lot 
because my first "real" build of all the Xen tools & kernel &
linux
kernels where I actually booted into a dom0 kernel from Xen was from a 
checkout on either the 12th or 13th.  Prior to that I was working out 
getting everything built under gentoo and not actually running it.  And 
that''s what I''ve been using until I checked out and rebuilt
everything
fresh this sunday afternoon and still have the problem as of last 
night.  Although a VM will segfault while dom0 seems to panic, it''s 
probably the same root problem.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-20 00:04 UTC

head link

RE: [Xen-devel] segfault in VM

i''m not using dm and see lots of crashes.



From: Chris Andrews
Sent: Tue 20/07/2004 5:34 AM
To: Derek Glidden
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


On 19 Jul 2004, at 19:58, Derek Glidden wrote:
>
> On Jul 19, 2004, at 4:28 AM, Chris Andrews wrote:
>
>
>> The machine uses devicemapper, so I''m going to move some
partitions
>> around and see if I still get corruption without it. I can also build 
>> Xen with debug=y and try that, once I''ve got the disk sorted.
>
> This box uses dm as well...  Clue or coincidence?  Probably 
> coincidence...  I''ve had the segfaults in different VMs and dom0,
so I
> doubt it''s related to any specific LV or disk sector or anything.
I''ve moved stuff around my machine''s disk so I don''t
need dm and
recompiled without it, and I''ve seen the same crashes with the guest fs
on a loop device, and with the guest fs on an ordinary disk partition, 
so I guess it''s not specific to dm.

Chris.



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-20 01:01 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 19, 2004, at 7:06 PM, Derek Glidden wrote:
>
> I''ve booted Xen with "nosmp" and will do the same things
I''ve been
> doing to it to make it break and see what happens.
hmm.  Running this same box, same Xen kernel, same linux kernel, but 
with "nosmp", just creating a domain gives me about two dozen of
these:

(XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
4GB segment.
(XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS 
!!!!

But so far, no crashes.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that''s what they mean by   |
"nickels a day can feed a child."   |       eff.org
I thought, "How can food be so      | anti-dmca.org
cheap over there?"  It''s not, they  |--------------------------
just eat the nickels." -- Peter Nguyen



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-20 01:04 UTC

head link

RE: [Xen-devel] segfault in VM

bk pull only showed 2 patches, neither of which affected kernels so I
didn''t bother recompiling.

I have seen an error (shown by my diff script ''compare'' or by
xend doing silly things like crashing), by simply starting another domain and
pinging it with something like:

ping -s 1400 -i 0.001 192.168.200.200

(ping -f might do it but I think it goes a bit fast)

That occured once after about 5 minutes, but then not again for the 10 or so
minutes I left it running.

running it out of memory with this code:

#include <stdio.h>
#include <stdlib.h>
int main() {
        char *buf;
        int mem = 0;
        int size = 1;
        char rnd;
        rnd = rand() & 255;
        while(1) {
                buf = (char *)malloc(size*1024*1024);
                memset(buf, rnd, size*1024*1024);
                if (buf != NULL) {
                        mem += size;
                        printf("%d\n", mem);
                }
        }
}

causes a crash far more quickly. I guess it''s possible that those are
two different errors though...

James







From: James Harper
Sent: Tue 20/07/2004 10:01 AM
To: Derek Glidden; xen-devel@lists.sourceforge.net
Subject: RE: [Xen-devel] segfault in VM


I''m pretty sure i''ve seen it earlier than that, but
couldn''t be certain. Initially I more or less expected instabilities
and so wasn''t really taking much notice.

so I guess my comments above are of absolutely no help at all. :)

i''ll be trying a bk pull and build today (under normal linux - 2 cpus
and max memory = faster builds) then verify that i can still make it crash, then
try nosmp, although i''ve seen a few posts about single cpu crashes.

james



From: Derek Glidden
Sent: Tue 20/07/2004 5:06 AM
To: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


On Jul 19, 2004, at 9:22 AM, Keir Fraser wrote:>
> The first person to complain I think was Jody Belka, who was using
> the changeset with comment ''Fairly major fixes to the network
frontend
> driver...'' (2004-07-13 18:24:48). Perhaps backing off to a day
before
> that would be a sensible place to start?
I''m either going to blow your theory out of the water or help a lot 
because my first "real" build of all the Xen tools & kernel &
linux
kernels where I actually booted into a dom0 kernel from Xen was from a 
checkout on either the 12th or 13th.  Prior to that I was working out 
getting everything built under gentoo and not actually running it.  And 
that''s what I''ve been using until I checked out and rebuilt
everything
fresh this sunday afternoon and still have the problem as of last 
night.  Although a VM will segfault while dom0 seems to panic, it''s 
probably the same root problem.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-20 06:56 UTC

head link

Re: [Xen-devel] segfault in VM

This could be harmless, or indicate memory corruption.

 -- Keir

> 
> On Jul 19, 2004, at 7:06 PM, Derek Glidden wrote:
> 
> >
> > I''ve booted Xen with "nosmp" and will do the same
things I''ve been
> > doing to it to make it break and see what happens.
> 
> hmm.  Running this same box, same Xen kernel, same linux kernel, but 
> with "nosmp", just creating a domain gives me about two dozen of
these:
> 
> (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
> 4GB segment.
> (XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS 
> !!!!
> 
> But so far, no crashes.
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> "I think that''s what they mean by   |
> "nickels a day can feed a child."   |       eff.org
> I thought, "How can food be so      | anti-dmca.org
> cheap over there?"  It''s not, they 
|--------------------------
> just eat the nickels." -- Peter Nguyen
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> lists.sourceforge.net/lists/listinfo/xen-devel


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-20 07:59 UTC

head link

Re: [Xen-devel] segfault in VM

I''ve just checked in a few networking fixes that should make things
rather more robust in low-memory conditions. I suspect there are still
some bugs lurking somewhere, but hopefully this has thinned out the
bugs somewhat.

 -- Keir
> bk pull only showed 2 patches, neither of which affected kernels so
> I didn''t bother recompiling.
> 
> I have seen an error (shown by my diff script ''compare''
or by xend
> doing silly things like crashing), by simply starting another domain
> and pinging it with something like:
> 
> ping -s 1400 -i 0.001 192.168.200.200
> 
> (ping -f might do it but I think it goes a bit fast)
> 
> That occured once after about 5 minutes, but then not again for the 10 or
so minutes I left it running.
> 
> running it out of memory with this code:
> 
> #include <stdio.h>
> #include <stdlib.h>
> int main() {
>         char *buf;
>         int mem = 0;
>         int size = 1;
>         char rnd;
>         rnd = rand() & 255;
>         while(1) {
>                 buf = (char *)malloc(size*1024*1024);
>                 memset(buf, rnd, size*1024*1024);
>                 if (buf != NULL) {
>                         mem += size;
>                         printf("%d\n", mem);
>                 }
>         }
> }
> 
> causes a crash far more quickly. I guess it''s possible that those
are two different errors though...
> 
> James

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-20 10:42 UTC

head link

RE: [Xen-devel] segfault in VM

I still get corruption with these latest patches. In this case I had started 2
domains and was pinging them both fairly hard, I didn''t get as far as
running it out of memory.

hth

James



From: Keir Fraser
Sent: Tue 20/07/2004 5:59 PM
To: James Harper
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


I''ve just checked in a few networking fixes that should make things
rather more robust in low-memory conditions. I suspect there are still
some bugs lurking somewhere, but hopefully this has thinned out the
bugs somewhat.

 -- Keir
> bk pull only showed 2 patches, neither of which affected kernels so
> I didn''t bother recompiling.
> 
> I have seen an error (shown by my diff script ''compare''
or by xend
> doing silly things like crashing), by simply starting another domain
> and pinging it with something like:
> 
> ping -s 1400 -i 0.001 192.168.200.200
> 
> (ping -f might do it but I think it goes a bit fast)
> 
> That occured once after about 5 minutes, but then not again for the 10 or
so minutes I left it running.
> 
> running it out of memory with this code:
> 
> #include <stdio.h>
> #include <stdlib.h>
> int main() {
>         char *buf;
>         int mem = 0;
>         int size = 1;
>         char rnd;
>         rnd = rand() & 255;
>         while(1) {
>                 buf = (char *)malloc(size*1024*1024);
>                 memset(buf, rnd, size*1024*1024);
>                 if (buf != NULL) {
>                         mem += size;
>                         printf("%d\n", mem);
>                 }
>         }
> }
> 
> causes a crash far more quickly. I guess it''s possible that those
are two different errors though...
> 
> James

Keir Fraser

2004-Jul-20 10:52 UTC

head link

Re: [Xen-devel] segfault in VM

I''ve checked in some more fixes that might entirely solve the problems
that everyone has been seeing.

Unfortunately xen.bkbits.net is down and I''m about to leave for
Canada. :-( Hopefully it will be possible to push to bkbits in a few
hours... 

The Changesets that will hopefully fix everything are:
   1.1116 04/07/20 11:32:39 kaf24@scramble.cl.cam.ac.uk +2 -0
   More backend driver fixes and robustifying.

   1.1115 04/07/20 11:14:24 kaf24@scramble.cl.cam.ac.uk +0 -0
   Merge scramble.cl.cam.ac.uk:/auto/groups/xeno/BK/xeno.bk
   into scramble.cl.cam.ac.uk:/local/scratch/kaf24/xeno

So keep an eye out for these when you pull --- we''re very interested
to hear of further bugs in builds /with/ these changesets. :-)

 -- Keir
> I''ve just checked in a few networking fixes that should make
things
> rather more robust in low-memory conditions. I suspect there are still
> some bugs lurking somewhere, but hopefully this has thinned out the
> bugs somewhat.
> 
>  -- Keir
> 
> > bk pull only showed 2 patches, neither of which affected kernels so
> > I didn''t bother recompiling.
> > 
> > I have seen an error (shown by my diff script
''compare'' or by xend
> > doing silly things like crashing), by simply starting another domain
> > and pinging it with something like:
> > 
> > ping -s 1400 -i 0.001 192.168.200.200
> > 
> > (ping -f might do it but I think it goes a bit fast)
> > 
> > That occured once after about 5 minutes, but then not again for the 10
or so minutes I left it running.
> > 
> > running it out of memory with this code:
> > 
> > #include <stdio.h>
> > #include <stdlib.h>
> > int main() {
> >         char *buf;
> >         int mem = 0;
> >         int size = 1;
> >         char rnd;
> >         rnd = rand() & 255;
> >         while(1) {
> >                 buf = (char *)malloc(size*1024*1024);
> >                 memset(buf, rnd, size*1024*1024);
> >                 if (buf != NULL) {
> >                         mem += size;
> >                         printf("%d\n", mem);
> >                 }
> >         }
> > }
> > 
> > causes a crash far more quickly. I guess it''s possible that
those are two different errors though...
> > 
> > James
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> lists.sourceforge.net/lists/listinfo/xen-devel


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Christian Limpach

2004-Jul-20 13:38 UTC

head link

Re: [Xen-devel] segfault in VM

On Tue, Jul 20, 2004 at 11:52:39AM +0100, Keir Fraser
wrote:> I''ve checked in some more fixes that might entirely solve the
problems
> that everyone has been seeing.
> 
> Unfortunately xen.bkbits.net is down and I''m about to leave for
> Canada. :-( Hopefully it will be possible to push to bkbits in a few
> hours... 
> 
> The Changesets that will hopefully fix everything are:
>    1.1116 04/07/20 11:32:39 kaf24@scramble.cl.cam.ac.uk +2 -0
>    More backend driver fixes and robustifying.
> 
>    1.1115 04/07/20 11:14:24 kaf24@scramble.cl.cam.ac.uk +0 -0
>    Merge scramble.cl.cam.ac.uk:/auto/groups/xeno/BK/xeno.bk
>    into scramble.cl.cam.ac.uk:/local/scratch/kaf24/xeno
> 
> So keep an eye out for these when you pull --- we''re very
interested
> to hear of further bugs in builds /with/ these changesets. :-)
I''ve now pushed these changesets to the xen.bkbits repository.

    christian



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-20 15:51 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 19, 2004, at 9:01 PM, Derek Glidden wrote:
> (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
> 4GB segment.
> (XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS 
> !!!!
After pounding on that box pretty much all evening with "nosmp", I 
wasn''t able to make it crash, either in dom0 or a VM, like I had been 
able to do in SMP mode.

I had some weirdness in dom0 when I woke up and checked on it this 
morning - a compile had failed that shouldn''t have, but there were no 
log messages either from Xen or dom0, so I''m not really sure what that 
was.

Tonight I''ll pull the latest changes and rebuild everything, reboot it 
without "nosmp" (make it SMP again) and see what happens.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that''s what they mean by   |
"nickels a day can feed a child."   |       eff.org
I thought, "How can food be so      | anti-dmca.org
cheap over there?"  It''s not, they  |--------------------------
just eat the nickels." -- Peter Nguyen


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Chris Andrews

2004-Jul-20 18:10 UTC

head link

Re: [Xen-devel] segfault in VM

Derek Glidden wrote:> 
> On Jul 19, 2004, at 9:01 PM, Derek Glidden wrote:
> 
>> (XEN) (file=x86_32/emulate.c, line=228) Bailing: not a -ve offset into 
>> 4GB segment.
>> (XEN) (file=x86_32/emulate.c, line=235) !!!! DISALLOWING UNSAFE ACCESS 
>> !!!!
> 
> 
> After pounding on that box pretty much all evening with "nosmp",
I
> wasn''t able to make it crash, either in dom0 or a VM, like I had
been
> able to do in SMP mode.
Which revision of the code were you running there? I''d like to give it
a
go...
> I had some weirdness in dom0 when I woke up and checked on it this 
> morning - a compile had failed that shouldn''t have, but there were
no
> log messages either from Xen or dom0, so I''m not really sure what
that was.
> 
> Tonight I''ll pull the latest changes and rebuild everything,
reboot it
> without "nosmp" (make it SMP again) and see what happens.
I''ve been trying various old revisions as far back as 1.1068[*] (so 
far), and I can''t find one that doesn''t blow up.

My test is to run James'' ''compare'' script in domain0
on two large
identical files of randomness, and compile various things continuously 
in a 2.4.26 domain1. It usually takes only a few minutes to start 
showing differences, and if I leave it I''ll get segfaults in domain0, 
then (with at least one revision) a panic in domain0 and reboot.

Just now I tried the latest code (post Keir''s 1.1116/1.1117 csets) and 
I''m seeing much the same results.

Hardware is a Dell 1650, single CPU, 1G RAM, aacraid controller. I''ve 
got rid of the devicemapper stuff I was running before, and domain1''s 
root is on an ordinary disk partition.

Chris.

[*] is this a suitably precise way of specifying revision? 1.1068 is 
based on the list from:
xen.bkbits.net:8080/xeno-unstable.bk/ChangeSet@-2w?nav=index.html

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-21 01:14 UTC

head link

RE: [Xen-devel] segfault in VM

I downloaded these (from a tgz that Keir had given me a link to as bk was down -
I assume it''s identical to his latest fixes) and started my tests
running and went to bed, but it looks like I got errors within a very short
time.
The tests I was running were my ''compare'' script and pinging
the two domains I had running with
ping -q -i 0.01 -s 1400 <ip address>

Lots of oopses in the logs, most are probably as a result of the corruption and
not indicative of the cause. They look similar to Jody''s dump so I
won''t bother sending them unless someone thinks they might be useful.

btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in
/boot? ksymoops would be much happier.

James




From: Keir Fraser
Sent: Tue 20/07/2004 8:52 PM
To: Keir Fraser
Cc: James Harper; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


I''ve checked in some more fixes that might entirely solve the problems
that everyone has been seeing.

Unfortunately xen.bkbits.net is down and I''m about to leave for
Canada. :-( Hopefully it will be possible to push to bkbits in a few
hours... 

The Changesets that will hopefully fix everything are:
   1.1116 04/07/20 11:32:39 kaf24@scramble.cl.cam.ac.uk +2 -0
   More backend driver fixes and robustifying.

   1.1115 04/07/20 11:14:24 kaf24@scramble.cl.cam.ac.uk +0 -0
   Merge scramble.cl.cam.ac.uk:/auto/groups/xeno/BK/xeno.bk
   into scramble.cl.cam.ac.uk:/local/scratch/kaf24/xeno

So keep an eye out for these when you pull --- we''re very interested
to hear of further bugs in builds /with/ these changesets. :-)

 -- Keir
> I''ve just checked in a few networking fixes that should make
things
> rather more robust in low-memory conditions. I suspect there are still
> some bugs lurking somewhere, but hopefully this has thinned out the
> bugs somewhat.
> 
>  -- Keir
> 
> > bk pull only showed 2 patches, neither of which affected kernels so
> > I didn''t bother recompiling.
> > 
> > I have seen an error (shown by my diff script
''compare'' or by xend
> > doing silly things like crashing), by simply starting another domain
> > and pinging it with something like:
> > 
> > ping -s 1400 -i 0.001 192.168.200.200
> > 
> > (ping -f might do it but I think it goes a bit fast)
> > 
> > That occured once after about 5 minutes, but then not again for the 10
or so minutes I left it running.
> > 
> > running it out of memory with this code:
> > 
> > #include <stdio.h>
> > #include <stdlib.h>
> > int main() {
> >         char *buf;
> >         int mem = 0;
> >         int size = 1;
> >         char rnd;
> >         rnd = rand() & 255;
> >         while(1) {
> >                 buf = (char *)malloc(size*1024*1024);
> >                 memset(buf, rnd, size*1024*1024);
> >                 if (buf != NULL) {
> >                         mem += size;
> >                         printf("%d\n", mem);
> >                 }
> >         }
> > }
> > 
> > causes a crash far more quickly. I guess it''s possible that
those are two different errors though...
> > 
> > James
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> lists.sourceforge.net/lists/listinfo/xen-devel

Christian Limpach

2004-Jul-21 10:12 UTC

head link

Re: [Xen-devel] segfault in VM

On Wed, Jul 21, 2004 at 11:14:48AM +1000, James Harper
wrote:> btw, can the install be modified to give us a System.map-2.4.26-xen[0U]
> in /boot? ksymoops would be much happier.
done, the install target will now install the System.map along with
the kernel and config file.

    christian



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-21 13:30 UTC

head link

Re: [Xen-devel] segfault in VM

Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always ''return 0;''.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always ''return 0;''.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you''ll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you''ll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was
down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
> The tests I was running were my ''compare'' script and
pinging the two domains I had running with
> ping -q -i 0.01 -s 1400 <ip address>
> 
> Lots of oopses in the logs, most are probably as a result of the corruption
and not indicative of the cause. They look similar to Jody''s dump so I
won''t bother sending them unless someone thinks they might be useful.
> 
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in
/boot? ksymoops would be much happier.
> 
> James

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-21 13:47 UTC

head link

RE: [Xen-devel] segfault in VM

i''ll try this out tomorrow morning (too late tonight - need sleep!)

From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always ''return 0;''.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always ''return 0;''.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you''ll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you''ll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was
down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
> The tests I was running were my ''compare'' script and
pinging the two domains I had running with
> ping -q -i 0.01 -s 1400 <ip address>
> 
> Lots of oopses in the logs, most are probably as a result of the corruption
and not indicative of the cause. They look similar to Jody''s dump so I
won''t bother sending them unless someone thinks they might be useful.
> 
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in
/boot? ksymoops would be much happier.
> 
> James

Keir Fraser

2004-Jul-21 14:17 UTC

head link

Re: [Xen-devel] segfault in VM

That would be extremely helpful! If it turns out to be the net backend
(probably most likely, although I guess it may not be a backend
problem at all, which would be harder to debug), then we can isolate
it to the receive or transmit path as follows:

To disable the receive path for guest OSes:
Edit netif_be_start_xmit in arch/xen/drivers/netif/backend/main.c to
always ''goto drop;''.

To disable the transmit path for guest OSes:
Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the
call to netif_schedule_work(), add:
  make_tx_response(netif, txreq.id, NETIF_RSP_OKAY);
  netif_put(netif);
  continue;

With one half of the network path disabled, to load up the remaining
direction you''ll need to flood ping from an external machine to the
guest OS (when you disable the guest''s transmit path) or flood ping
out from the guest (when you disable it''s rx path). I guess in both
cases you''ll need a broadcast ping (yuk!) since ARP won''t work
(needs
both tx and rx).

 -- Keir
> i''ll try this out tomorrow morning (too late tonight - need
sleep!)
> 
> 
> 
> From: Keir Fraser
> Sent: Wed 21/07/2004 11:30 PM
> To: James Harper
> Cc: Keir Fraser; xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> Could someone try to isolate this to either the network backend driver
> or the blkdev backend driver?
> 
> The best way to do this is to disable the frontend drivers so that
> they never try to coinnect to the backend driver...
> 
> To disable networking:
> Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
> always ''return 0;''.
> 
> To disable block devices:
> Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
> always ''return 0;''.
> 
> Oh yes -- the 2.4 sparse tree no longer contains the net frontend
> driver - you''ll find the build tree symlinks to
> linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
> edit that instead...
> 
> Obviously, if you disable blkdevs you''ll need to boot off a
ramdisk
> or via a networked mount. :-)
> 
>  Cheers,
>  Keir
> 
> 
> > I downloaded these (from a tgz that Keir had given me a link to as bk
was down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
> > The tests I was running were my ''compare'' script and
pinging the two domains I had running with
> > ping -q -i 0.01 -s 1400 <ip address>
> > 
> > Lots of oopses in the logs, most are probably as a result of the
corruption and not indicative of the cause. They look similar to Jody''s
dump so I won''t bother sending them unless someone thinks they might be
useful.
> > 
> > btw, can the install be modified to give us a
System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.
> > 
> > James -=- MIME -=- 
--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

i''ll try this out tomorrow morning (too late tonight - need sleep!)



From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always ''return 0;''.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always ''return 0;''.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you''ll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you''ll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was
down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
> The tests I was running were my ''compare'' script and
pinging the two domains I had running with
> ping -q -i 0.01 -s 1400 <ip address>
>=20
> Lots of oopses in the logs, most are probably as a result of the corruption
and not indicative of the cause. They look similar to Jody''s dump so I
won''t bother sending them unless someone thinks they might be useful.
>=20
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in
/boot? ksymoops would be much happier.
>=20
> James
--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText57341 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000
size=3D2>i''ll try this out tomorrow morning (too late tonight - need
sleep!)</FONT></DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir
Fraser<BR><B>Sent:</B> Wed 21/07/2004 11:30
PM<BR><B>To:</B> James Harper<BR><B>Cc:</B>
Keir Fraser;
xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re:
[Xen-devel] segfault in VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">Could someone
try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always ''return 0;''.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always ''return 0;''.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you''ll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you''ll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir


&gt; I downloaded these (from a tgz that Keir had given me a link to as bk
was down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
&gt; The tests I was running were my ''compare'' script and
pinging the two domains I had running with
&gt; ping -q -i 0.01 -s 1400 &lt;ip address&gt;
&gt;=20
&gt; Lots of oopses in the logs, most are probably as a result of the
corruption and not indicative of the cause. They look similar to Jody''s
dump so I won''t bother sending them unless someone thinks they might be
useful.
&gt;=20
&gt; btw, can the install be modified to give us a System.map-2.4.26-xen[0U]
in /boot? ksymoops would be much happier.
&gt;=20
&gt; James
</PRE></DIV></BODY></HTML>

--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_--



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-21 23:39 UTC

head link

Re: [Xen-devel] segfault in VM

FWIW: after coming home last night from work, dom0 crashed right away 
as soon as I logged in.

I rebooted, repaired, and checked out and rebuilt everything and so 
far, so good.  It hasn''t generated those same "Bailing"
messages when I
create a domain at least.

If I can keep everything up and running tonight, I''ll start hammering 
on them using the compare/ping thing and see what I can make break.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-22 01:48 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 21, 2004, at 9:30 AM, Keir Fraser wrote:
>
> Could someone try to isolate this to either the network backend driver
> or the blkdev backend driver?
>
> The best way to do this is to disable the frontend drivers so that
> they never try to coinnect to the backend driver...
I''ll give this a go as well, but, is this for dom0 or domU kernels?

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that''s what they mean by   |
"nickels a day can feed a child."   |       eff.org
I thought, "How can food be so      | anti-dmca.org
cheap over there?"  It''s not, they  |--------------------------
just eat the nickels." -- Peter Nguyen



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-22 01:54 UTC

head link

Re: [Xen-devel] segfault in VM

> 
> On Jul 21, 2004, at 9:30 AM, Keir Fraser wrote:
> 
> >
> > Could someone try to isolate this to either the network backend driver
> > or the blkdev backend driver?
> >
> > The best way to do this is to disable the frontend drivers so that
> > they never try to coinnect to the backend driver...
> 
> I''ll give this a go as well, but, is this for dom0 or domU
kernels?
It''s modifying the frontend drivers in the domU kernel so that the
data paths in the dom0 backend drivers do not get executed.

i.e., it''s the domU kernel that needs recompiling.

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-22 01:57 UTC

head link

RE: [Xen-devel] segfault in VM

i''m building this now, and am just thinking about how to test this... I
was using a ping as my test mechanism. I guess i''ll do lots of block
device copies. I guess this lends weight to your thoughts that it probably is a
net problem and not a block problem.

Instead of changing the source code to disable the net stuff, would it work if I
just specified ''nics=0'' or is some part of the net subsystem
still activated? I''ll test this too anyway.

In order to test disabling send or receive, this might be a bit trickier than
you first make out. Send-only should be easy enough, just start another domain
and then ping it (a manual arp table entry should alleviate the need to
broadcast). Receive-only will be tricker. How do you get a domain to send to it?
This problem of course assumes that corruption is not limited to the domain...
if it is limited to the domain then you should be able to have a send/receive
domain and ignore crashes in there, just focus on the crashes in the
receive-only domain.

i''m almost confused, but am about to start testing - firstly with no
network.

James


From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always ''return 0;''.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always ''return 0;''.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you''ll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you''ll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was
down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
> The tests I was running were my ''compare'' script and
pinging the two domains I had running with
> ping -q -i 0.01 -s 1400 <ip address>
> 
> Lots of oopses in the logs, most are probably as a result of the corruption
and not indicative of the cause. They look similar to Jody''s dump so I
won''t bother sending them unless someone thinks they might be useful.
> 
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in
/boot? ksymoops would be much happier.
> 
> James

Keir Fraser

2004-Jul-22 02:03 UTC

head link

Re: [Xen-devel] segfault in VM

> i''m building this now, and am just thinking about how to test
this... I was using a ping as my test mechanism. I guess i''ll do lots
of block device copies. I guess this lends weight to your thoughts that it
probably is a net problem and not a block problem.
> 
> Instead of changing the source code to disable the net stuff, would it work
if I just specified ''nics=0'' or is some part of the net
subsystem still activated? I''ll test this too anyway.
I think the source will need to be changed. In any case, it''s a
trivial change and then we can be certain that no device channel is
being set up.
> In order to test disabling send or receive, this might be a bit trickier
than you first make out. Send-only should be easy enough, just start another
domain and then ping it (a manual arp table entry should alleviate the need to
broadcast). Receive-only will be tricker. How do you get a domain to send to it?
This problem of course assumes that corruption is not limited to the domain...
if it is limited to the domain then you should be able to have a send/receive
domain and ignore crashes in there, just focus on the crashes in the
receive-only domain.
That''s the reason for the broadcast ping. Unfortunately I''m
not sure
how useful that will turn out to be -- e.g., we may just end up hosing
DOM0. 
> i''m almost confused, but am about to start testing - firstly with
no network.
Stage 1 (isolating blkdev and network) shouldn''t be too
hard. Basically we''re ensuring the data paths in teh backend drivers
do not get executed -- they will only ever execute if there is a
device channel set up to a frontend in another guest, so disabling the
frontend drivers ensures this.

 -- Keir

> James
> 
> 
> From: Keir Fraser
> Sent: Wed 21/07/2004 11:30 PM
> To: James Harper
> Cc: Keir Fraser; xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> Could someone try to isolate this to either the network backend driver
> or the blkdev backend driver?
> 
> The best way to do this is to disable the frontend drivers so that
> they never try to coinnect to the backend driver...
> 
> To disable networking:
> Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
> always ''return 0;''.
> 
> To disable block devices:
> Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
> always ''return 0;''.
> 
> Oh yes -- the 2.4 sparse tree no longer contains the net frontend
> driver - you''ll find the build tree symlinks to
> linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
> edit that instead...
> 
> Obviously, if you disable blkdevs you''ll need to boot off a
ramdisk
> or via a networked mount. :-)
> 
>  Cheers,
>  Keir
> 
> 
> > I downloaded these (from a tgz that Keir had given me a link to as bk
was down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
> > The tests I was running were my ''compare'' script and
pinging the two domains I had running with
> > ping -q -i 0.01 -s 1400 <ip address>
> > 
> > Lots of oopses in the logs, most are probably as a result of the
corruption and not indicative of the cause. They look similar to Jody''s
dump so I won''t bother sending them unless someone thinks they might be
useful.
> > 
> > btw, can the install be modified to give us a
System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.
> > 
> > James -=- MIME -=- 
--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

i''m building this now, and am just thinking about how to test this... I
was using a ping as my test mechanism. I guess i''ll do lots of block
device copies. I guess this lends weight to your thoughts that it probably is a
net problem and not a block problem.

Instead of changing the source code to disable the net stuff, would it work if I
just specified ''nics=3D0'' or is some part of the net subsystem
still activated? I''ll test this too anyway.

In order to test disabling send or receive, this might be a bit trickier than
you first make out. Send-only should be easy enough, just start another domain
and then ping it (a manual arp table entry should alleviate the need to
broadcast). Receive-only will be tricker. How do you get a domain to send to it?
This problem of course assumes that corruption is not limited to the domain...
if it is limited to the domain then you should be able to have a send/receive
domain and ignore crashes in there, just focus on the crashes in the
receive-only domain.

i''m almost confused, but am about to start testing - firstly with no
network.

James


From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always ''return 0;''.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always ''return 0;''.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you''ll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you''ll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was
down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
> The tests I was running were my ''compare'' script and
pinging the two domains I had running with
> ping -q -i 0.01 -s 1400 <ip address>
>=20
> Lots of oopses in the logs, most are probably as a result of the corruption
and not indicative of the cause. They look similar to Jody''s dump so I
won''t bother sending them unless someone thinks they might be useful.
>=20
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in
/boot? ksymoops would be much happier.
>=20
> James
--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText8898 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000
size=3D2>i''m building this now, and am</FONT><FONT
face=3DArial size=3D2> just thinking about how to test this... I was using a
ping as my test mechanism. I guess i''ll do lots of block device copies.
I guess this lends weight to your thoughts that it probably is a net problem and
not a block problem.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Instead of changing the
source code to disable the net stuff, would it work if I just specified
''nics=3D0'' or is some part of the net subsystem still
activated? </FONT><FONT face=3DArial size=3D2>I''ll test
this too anyway.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>In order to test
disabling send or receive, this might be a bit trickier than you first make out.
Send-only should be easy enough, just start another domain and then ping it (a
manual arp table entry should alleviate the need to broadcast). Receive-only
will be tricker. How do you get a domain to send to it? This problem of course
assumes that corruption is not&nbsp;limited to the domain... if it is
limited to the domain then you should be able to have a send/receive domain and
ignore crashes in there, just focus on the crashes in the receive-only
domain.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>i''m almost
confused, but am about to start testing - firstly with no
network.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2>James</FONT></DIV></DIV>
<DIV dir=3Dltr>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir
Fraser<BR><B>Sent:</B> Wed 21/07/2004 11:30
PM<BR><B>To:</B> James Harper<BR><B>Cc:</B>
Keir Fraser;
xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re:
[Xen-devel] segfault in VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">Could someone
try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always ''return 0;''.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always ''return 0;''.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you''ll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you''ll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir


&gt; I downloaded these (from a tgz that Keir had given me a link to as bk
was down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
&gt; The tests I was running were my ''compare'' script and
pinging the two domains I had running with
&gt; ping -q -i 0.01 -s 1400 &lt;ip address&gt;
&gt;=20
&gt; Lots of oopses in the logs, most are probably as a result of the
corruption and not indicative of the cause. They look similar to Jody''s
dump so I won''t bother sending them unless someone thinks they might be
useful.
&gt;=20
&gt; btw, can the install be modified to give us a System.map-2.4.26-xen[0U]
in /boot? ksymoops would be much happier.
&gt;=20
&gt; James
</PRE></DIV></BODY></HTML>

--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_--



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-22 02:39 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 21, 2004, at 9:54 PM, Keir Fraser wrote:
> It''s modifying the frontend drivers in the domU kernel so that the
> data paths in the dom0 backend drivers do not get executed.
>
> i.e., it''s the domU kernel that needs recompiling.
Got it.  domU kernel recompiled and now running large amounts of block 
i/o while dom0 gets pung and also large amounts of block i/o.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that''s what they mean by   |
"nickels a day can feed a child."   |       eff.org
I thought, "How can food be so      | anti-dmca.org
cheap over there?"  It''s not, they  |--------------------------
just eat the nickels." -- Peter Nguyen



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-22 02:48 UTC

head link

RE: [Xen-devel] segfault in VM

As a first test I have just disabled networking via nics=0 in the config, and
running this script in dom1:
#!/bin/sh
while [ 1 = 1 ]
do
  dd if=/dev/sda1 of=/dev/null bs=1024 count=128K &
  dd if=/dev/sda1 of=/dev/null bs=1024 skip=256K count=256K
done

it tells me ''ioctl 801c6d02 not supported by XL blkif'' but
that doesn''t seem to matter. Anyway, there are no crashes so far so
i''m thinking at this stage that the block interface stuff is probably
fine and I should now concentrate on the network. Disabling the block stuff will
be a huge hassle at this stage so i''ll have to let it go for the
moment.

I think i need a crash course in how all this hangs together before I can
understand what i''m testing... My understanding is as follows:

packets sent to dom0.vif1.0 appear at dom1.eth0.
packets sent to dom1.eth0 appear at dom0.vif1.0.

and that''s about it. Are they symmetrical? Is the transmit code for
dom0.vif1.0 the same as the transmit code for dom1.eth0? Ditto for receive?

James




From: Keir Fraser
Sent: Thu 22/07/2004 12:03 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

> i''m building this now, and am just thinking about how to test
this... I was using a ping as my test mechanism. I guess i''ll do lots
of block device copies. I guess this lends weight to your thoughts that it
probably is a net problem and not a block problem.
> 
> Instead of changing the source code to disable the net stuff, would it work
if I just specified ''nics=0'' or is some part of the net
subsystem still activated? I''ll test this too anyway.
I think the source will need to be changed. In any case, it''s a
trivial change and then we can be certain that no device channel is
being set up.
> In order to test disabling send or receive, this might be a bit trickier
than you first make out. Send-only should be easy enough, just start another
domain and then ping it (a manual arp table entry should alleviate the need to
broadcast). Receive-only will be tricker. How do you get a domain to send to it?
This problem of course assumes that corruption is not limited to the domain...
if it is limited to the domain then you should be able to have a send/receive
domain and ignore crashes in there, just focus on the crashes in the
receive-only domain.
That''s the reason for the broadcast ping. Unfortunately I''m
not sure
how useful that will turn out to be -- e.g., we may just end up hosing
DOM0. 
> i''m almost confused, but am about to start testing - firstly with
no network.
Stage 1 (isolating blkdev and network) shouldn''t be too
hard. Basically we''re ensuring the data paths in teh backend drivers
do not get executed -- they will only ever execute if there is a
device channel set up to a frontend in another guest, so disabling the
frontend drivers ensures this.

 -- Keir

> James
> 
> 
> From: Keir Fraser
> Sent: Wed 21/07/2004 11:30 PM
> To: James Harper
> Cc: Keir Fraser; xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> Could someone try to isolate this to either the network backend driver
> or the blkdev backend driver?
> 
> The best way to do this is to disable the frontend drivers so that
> they never try to coinnect to the backend driver...
> 
> To disable networking:
> Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
> always ''return 0;''.
> 
> To disable block devices:
> Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
> always ''return 0;''.
> 
> Oh yes -- the 2.4 sparse tree no longer contains the net frontend
> driver - you''ll find the build tree symlinks to
> linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
> edit that instead...
> 
> Obviously, if you disable blkdevs you''ll need to boot off a
ramdisk
> or via a networked mount. :-)
> 
>  Cheers,
>  Keir
> 
> 
> > I downloaded these (from a tgz that Keir had given me a link to as bk
was down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
> > The tests I was running were my ''compare'' script and
pinging the two domains I had running with
> > ping -q -i 0.01 -s 1400 <ip address>
> > 
> > Lots of oopses in the logs, most are probably as a result of the
corruption and not indicative of the cause. They look similar to Jody''s
dump so I won''t bother sending them unless someone thinks they might be
useful.
> > 
> > btw, can the install be modified to give us a
System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.
> > 
> > James -=- MIME -=- 
--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

i''m building this now, and am just thinking about how to test this... I
was using a ping as my test mechanism. I guess i''ll do lots of block
device copies. I guess this lends weight to your thoughts that it probably is a
net problem and not a block problem.

Instead of changing the source code to disable the net stuff, would it work if I
just specified ''nics=3D0'' or is some part of the net subsystem
still activated? I''ll test this too anyway.

In order to test disabling send or receive, this might be a bit trickier than
you first make out. Send-only should be easy enough, just start another domain
and then ping it (a manual arp table entry should alleviate the need to
broadcast). Receive-only will be tricker. How do you get a domain to send to it?
This problem of course assumes that corruption is not limited to the domain...
if it is limited to the domain then you should be able to have a send/receive
domain and ignore crashes in there, just focus on the crashes in the
receive-only domain.

i''m almost confused, but am about to start testing - firstly with no
network.

James


From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always ''return 0;''.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always ''return 0;''.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you''ll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you''ll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was
down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
> The tests I was running were my ''compare'' script and
pinging the two domains I had running with
> ping -q -i 0.01 -s 1400 <ip address>
>=20
> Lots of oopses in the logs, most are probably as a result of the corruption
and not indicative of the cause. They look similar to Jody''s dump so I
won''t bother sending them unless someone thinks they might be useful.
>=20
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in
/boot? ksymoops would be much happier.
>=20
> James
--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText8898 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000
size=3D2>i''m building this now, and am</FONT><FONT
face=3DArial size=3D2> just thinking about how to test this... I was using a
ping as my test mechanism. I guess i''ll do lots of block device copies.
I guess this lends weight to your thoughts that it probably is a net problem and
not a block problem.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Instead of changing the
source code to disable the net stuff, would it work if I just specified
''nics=3D0'' or is some part of the net subsystem still
activated? </FONT><FONT face=3DArial size=3D2>I''ll test
this too anyway.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>In order to test
disabling send or receive, this might be a bit trickier than you first make out.
Send-only should be easy enough, just start another domain and then ping it (a
manual arp table entry should alleviate the need to broadcast). Receive-only
will be tricker. How do you get a domain to send to it? This problem of course
assumes that corruption is not&nbsp;limited to the domain... if it is
limited to the domain then you should be able to have a send/receive domain and
ignore crashes in there, just focus on the crashes in the receive-only
domain.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>i''m almost
confused, but am about to start testing - firstly with no
network.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2>James</FONT></DIV></DIV>
<DIV dir=3Dltr>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir
Fraser<BR><B>Sent:</B> Wed 21/07/2004 11:30
PM<BR><B>To:</B> James Harper<BR><B>Cc:</B>
Keir Fraser;
xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re:
[Xen-devel] segfault in VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">Could someone
try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always ''return 0;''.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always ''return 0;''.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you''ll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you''ll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir


&gt; I downloaded these (from a tgz that Keir had given me a link to as bk
was down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
&gt; The tests I was running were my ''compare'' script and
pinging the two domains I had running with
&gt; ping -q -i 0.01 -s 1400 &lt;ip address&gt;
&gt;=20
&gt; Lots of oopses in the logs, most are probably as a result of the
corruption and not indicative of the cause. They look similar to Jody''s
dump so I won''t bother sending them unless someone thinks they might be
useful.
&gt;=20
&gt; btw, can the install be modified to give us a System.map-2.4.26-xen[0U]
in /boot? ksymoops would be much happier.
&gt;=20
&gt; James
</PRE></DIV></BODY></HTML>

--_6A1C7D2E-1D2E-47A8-818D-57D5389770AA_--

Keir Fraser

2004-Jul-22 02:56 UTC

head link

Re: [Xen-devel] segfault in VM

> As a first test I have just disabled networking via nics=0 in the config,
and running this script in dom1:
> #!/bin/sh
> while [ 1 = 1 ]
> do
>   dd if=/dev/sda1 of=/dev/null bs=1024 count=128K &
>   dd if=/dev/sda1 of=/dev/null bs=1024 skip=256K count=256K
> done
> 
> it tells me ''ioctl 801c6d02 not supported by XL blkif''
but that doesn''t seem to matter. Anyway, there are no crashes so far so
i''m thinking at this stage that the block interface stuff is probably
fine and I should now concentrate on the network. Disabling the block stuff will
be a huge hassle at this stage so i''ll have to let it go for the
moment.
It does seem more likely that the network backend driver is to blame
-- it''s considerably more complicated than the blkdev driver.
> I think i need a crash course in how all this hangs together before I can
understand what i''m testing... My understanding is as follows:
> 
> packets sent to dom0.vif1.0 appear at dom1.eth0.
> packets sent to dom1.eth0 appear at dom0.vif1.0.
Yes, it''s basically a point-to-point link. The transmit side on each
interface is directly linked to the receive side on the other.
> and that''s about it. Are they symmetrical? Is the transmit code
for dom0.vif1.0 the same as the transmit code for dom1.eth0? Ditto for receive?
No. dom1.eth0 is implemented by the frontend driver
arch/xen/drivers/netif/frontend/main.c
dom0.vif* is implemented by arch/xen/drivers/netif/backend/main.c 

So they look symmetric to users, but the implementation is not
symmetric. 

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-22 03:49 UTC

head link

RE: [Xen-devel] segfault in VM

Okay, I have made the following change in dom0:

To disable the transmit path for guest OSes:
Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the
call to netif_schedule_work(), add:
  make_tx_response(netif, txreq.id, NETIF_RSP_OKAY);
  netif_put(netif);
  continue;

compiled and rebooted with the new kernel. booted dom1, removed vif1.0 from the
bridge, gave it it''s own ip address, added a static arp entry and
pinged away. I could see the packet counters for dom0 and dom1 climbing rapiding
indicating that dom0 was sending packets, dom1 was receiving packets, but that a
packet sent by dom1 was unable to reach dom0 again. I got the same sort of
crashes after about 10 minutes.

I''m now testing the other half.

James











From: Keir Fraser
Sent: Thu 22/07/2004 12:56 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

> As a first test I have just disabled networking via nics=0 in the config,
and running this script in dom1:
> #!/bin/sh
> while [ 1 = 1 ]
> do
>   dd if=/dev/sda1 of=/dev/null bs=1024 count=128K &
>   dd if=/dev/sda1 of=/dev/null bs=1024 skip=256K count=256K
> done
> 
> it tells me ''ioctl 801c6d02 not supported by XL blkif''
but that doesn''t seem to matter. Anyway, there are no crashes so far so
i''m thinking at this stage that the block interface stuff is probably
fine and I should now concentrate on the network. Disabling the block stuff will
be a huge hassle at this stage so i''ll have to let it go for the
moment.
It does seem more likely that the network backend driver is to blame
-- it''s considerably more complicated than the blkdev driver.
> I think i need a crash course in how all this hangs together before I can
understand what i''m testing... My understanding is as follows:
> 
> packets sent to dom0.vif1.0 appear at dom1.eth0.
> packets sent to dom1.eth0 appear at dom0.vif1.0.
Yes, it''s basically a point-to-point link. The transmit side on each
interface is directly linked to the receive side on the other.
> and that''s about it. Are they symmetrical? Is the transmit code
for dom0.vif1.0 the same as the transmit code for dom1.eth0? Ditto for receive?
No. dom1.eth0 is implemented by the frontend driver
arch/xen/drivers/netif/frontend/main.c
dom0.vif* is implemented by arch/xen/drivers/netif/backend/main.c 

So they look symmetric to users, but the implementation is not
symmetric. 

 -- Keir

James Harper

2004-Jul-22 04:36 UTC

head link

RE: [Xen-devel] segfault in VM

At this stage, it looks like disabling the receive path for the guest os eg
netif_be_start_xmit  ''goto drop'' means that I can ping from
the guest OS all i like with no crashes. I hope that''s the right way
around to do it...

I''m just looking at that procedure, how is the ring actually managed -
what do all the _prod and _cons variables actually represent? And how is
synchronisation handled between the domains? i notice there is no spinlock in
there, is this done by the calling function?

james




From: Keir Fraser
Sent: Thu 22/07/2004 12:17 AM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


That would be extremely helpful! If it turns out to be the net backend
(probably most likely, although I guess it may not be a backend
problem at all, which would be harder to debug), then we can isolate
it to the receive or transmit path as follows:

To disable the receive path for guest OSes:
Edit netif_be_start_xmit in arch/xen/drivers/netif/backend/main.c to
always ''goto drop;''.

To disable the transmit path for guest OSes:
Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the
call to netif_schedule_work(), add:
  make_tx_response(netif, txreq.id, NETIF_RSP_OKAY);
  netif_put(netif);
  continue;

With one half of the network path disabled, to load up the remaining
direction you''ll need to flood ping from an external machine to the
guest OS (when you disable the guest''s transmit path) or flood ping
out from the guest (when you disable it''s rx path). I guess in both
cases you''ll need a broadcast ping (yuk!) since ARP won''t work
(needs
both tx and rx).

 -- Keir
> i''ll try this out tomorrow morning (too late tonight - need
sleep!)
> 
> 
> 
> From: Keir Fraser
> Sent: Wed 21/07/2004 11:30 PM
> To: James Harper
> Cc: Keir Fraser; xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> Could someone try to isolate this to either the network backend driver
> or the blkdev backend driver?
> 
> The best way to do this is to disable the frontend drivers so that
> they never try to coinnect to the backend driver...
> 
> To disable networking:
> Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
> always ''return 0;''.
> 
> To disable block devices:
> Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
> always ''return 0;''.
> 
> Oh yes -- the 2.4 sparse tree no longer contains the net frontend
> driver - you''ll find the build tree symlinks to
> linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
> edit that instead...
> 
> Obviously, if you disable blkdevs you''ll need to boot off a
ramdisk
> or via a networked mount. :-)
> 
>  Cheers,
>  Keir
> 
> 
> > I downloaded these (from a tgz that Keir had given me a link to as bk
was down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
> > The tests I was running were my ''compare'' script and
pinging the two domains I had running with
> > ping -q -i 0.01 -s 1400 <ip address>
> > 
> > Lots of oopses in the logs, most are probably as a result of the
corruption and not indicative of the cause. They look similar to Jody''s
dump so I won''t bother sending them unless someone thinks they might be
useful.
> > 
> > btw, can the install be modified to give us a
System.map-2.4.26-xen[0U] in /boot? ksymoops would be much happier.
> > 
> > James -=- MIME -=- 
--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

i''ll try this out tomorrow morning (too late tonight - need sleep!)



From: Keir Fraser
Sent: Wed 21/07/2004 11:30 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM


Could someone try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always ''return 0;''.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always ''return 0;''.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you''ll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you''ll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir

> I downloaded these (from a tgz that Keir had given me a link to as bk was
down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
> The tests I was running were my ''compare'' script and
pinging the two domains I had running with
> ping -q -i 0.01 -s 1400 <ip address>
>=20
> Lots of oopses in the logs, most are probably as a result of the corruption
and not indicative of the cause. They look similar to Jody''s dump so I
won''t bother sending them unless someone thinks they might be useful.
>=20
> btw, can the install be modified to give us a System.map-2.4.26-xen[0U] in
/boot? ksymoops would be much happier.
>=20
> James
--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText57341 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000
size=3D2>i''ll try this out tomorrow morning (too late tonight - need
sleep!)</FONT></DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir
Fraser<BR><B>Sent:</B> Wed 21/07/2004 11:30
PM<BR><B>To:</B> James Harper<BR><B>Cc:</B>
Keir Fraser;
xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re:
[Xen-devel] segfault in VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">Could someone
try to isolate this to either the network backend driver
or the blkdev backend driver?

The best way to do this is to disable the frontend drivers so that
they never try to coinnect to the backend driver...

To disable networking:
Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
always ''return 0;''.

To disable block devices:
Edit arch/xen/drivers/blkif/frontend/main.c. Change xlblk_init() to
always ''return 0;''.

Oh yes -- the 2.4 sparse tree no longer contains the net frontend
driver - you''ll find the build tree symlinks to
linux-2.6.7-xen-sparse/drivers/xen/net/network.c. So you might want to
edit that instead...

Obviously, if you disable blkdevs you''ll need to boot off a ramdisk
or via a networked mount. :-)

 Cheers,
 Keir


&gt; I downloaded these (from a tgz that Keir had given me a link to as bk
was down - I assume it''s identical to his latest fixes) and started my
tests running and went to bed, but it looks like I got errors within a very
short time.
&gt; The tests I was running were my ''compare'' script and
pinging the two domains I had running with
&gt; ping -q -i 0.01 -s 1400 &lt;ip address&gt;
&gt;=20
&gt; Lots of oopses in the logs, most are probably as a result of the
corruption and not indicative of the cause. They look similar to Jody''s
dump so I won''t bother sending them unless someone thinks they might be
useful.
&gt;=20
&gt; btw, can the install be modified to give us a System.map-2.4.26-xen[0U]
in /boot? ksymoops would be much happier.
&gt;=20
&gt; James
</PRE></DIV></BODY></HTML>

--_AD96A7AB-04BB-40C1-819D-80A6B56655A4_--

Derek Glidden

2004-Jul-22 05:28 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 21, 2004, at 9:30 AM, Keir Fraser wrote:>
> To disable networking:
> Edit arch/xen/drivers/netif/frontend/main.c. Change netif_init() to
> always ''return 0;''.
changing netif_init() so that all the returns are "return 0;"
doesn''t
seem to do much, the VMs still get network access, and everything looks 
and acts normal and there''s still corruption after a few minutes of 
stress testing and network traffic.  (Although it does seem to be 
network related.  It ran for a while with no network traffic and no 
corruption, and within a minute or two of starting the pings, it 
started to flake out.)

changing netif_init() so that it immediately does "return 0" runs for
a
good long time with no corruption, unless you try to send data to one 
of the vifs, which makes dom0 blow up real good. Running it for a while 
with just block I/o and ping traffic to dom0 didn''t result in any 
obvious corruption while running, but I did get these messages when I 
rebooted:

(XEN) (file=/opt/src/xeno/xeno-unstable.bk/xen/include/asm/mm.h, 
line=215) Unexpected type (saw c0000000 != exp e0000000) for pfn 
000032db
(XEN) DOM0: (file=memory.c, line=249) Bad page type for pfn 000032db 
(d0000005)
(XEN) (file=traps.c, line=466) GPF (0004): fc5277c8 -> fc52a094
Kernel panic: Failed to execute MMU updates
  (XEN) Domain 0 shutdown: rebooting machine!

which I''ve only seen on a reboot when there has been corruption.

disabling the receive path seems to still let packets through and shows 
signs of corruption, even with very little network traffic.  I''m not 
sure if that''s because I have everything doing NAT instead of bridging,
although that doesn''t really make sense since it''s still the
same
interface and the code looks like it should simply drop the packets...

It''s getting late so I''ll have to work on disabling the
transmit path
and working out how to go about testing the blockdev backend tomorrow.

I''ll let it run for a while without even up''ing the vifs on
the dom0
side, which should preclude any network traffic at all getting to the 
VMs and see if there''s any corruption going on running overnight or 
longer.

Can anyone else corrupt their systems with no network traffic?

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that''s what they mean by   |
"nickels a day can feed a child."   |       eff.org
I thought, "How can food be so      | anti-dmca.org
cheap over there?"  It''s not, they  |--------------------------
just eat the nickels." -- Peter Nguyen

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-22 11:22 UTC

head link

Re: [Xen-devel] segfault in VM

> At this stage, it looks like disabling the receive path for the
> guest os eg netif_be_start_xmit ''goto drop'' means that I
can ping
> from the guest OS all i like with no crashes. I hope that''s the
> right way around to do it...
Yep, an unconditional ''goto drop;'' at the start of
netif_be_start_xmit
will prevent the guest from ever receiving packets.

How did you do send packets from the guest -- did you poke an ARP
entry, or send broadcast packets?

Anyway - currently sounds like teh bug resides in the most complex
half of the most complex driver. Who''d''ve thought it? ;-)
> I''m just looking at that procedure,
> how is the ring actually managed - what do all the _prod and _cons
> variables actually represent? And how is synchronisation handled
> between the domains? i notice there is no spinlock in there, is this
> done by the calling function?
Synchronisation between backend and frontend is lock-free --- for each
ring one guy is producer and the other is consumer so they each update
a disjoint set of ring indexes.

Within the backend, there is implicit per-interface locking on
netif_be_start_xmit so we''ll never reenter for the same
interface. Then when we batch stuff up for a tasklet we''re still okay
because tasklets are guaranteed non-reentrant also.

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-22 11:54 UTC

head link

Re: [Xen-devel] segfault in VM

> Okay, I have made the following change in dom0:
> 
> To disable the transmit path for guest OSes:
> Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the
> call to netif_schedule_work(), add:
>   make_tx_response(netif, txreq.id, NETIF_RSP_OKAY);
>   netif_put(netif);
>   continue;
> 
> compiled and rebooted with the new kernel. booted dom1, removed vif1.0 from
the bridge, gave it it''s own ip address, added a static arp entry and
pinged away. I could see the packet counters for dom0 and dom1 climbing rapiding
indicating that dom0 was sending packets, dom1 was receiving packets, but that a
packet sent by dom1 was unable to reach dom0 again. I got the same sort of
crashes after about 10 minutes.
If you do a test with DPRINTK enabled in
linux-2.4.26-xen-sparse/arch/xen/drivers/netif/backend/common.h
and with debugging enabled in Xen ''debug=y make''
then you may get some useful debugging out of the machine when it all
goes horribly wrong. e.g., perhaps something is failing apparently
spuriously... one example would be that a page reassignment (from dom0
to the other guest) is failing for some weird reason.

If we can get somne debugging out when things first go wrong, that
would be very useful indeed.

 Thanks,
 Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-22 12:53 UTC

head link

RE: [Xen-devel] segfault in VM

I am trying this now. Within a few seconds of starting the flood ping, dom1
rebooted. no messages in the logs to give any hint as to why though. Trying
again and I didn''t get anything useful either once I started getting
noticable corruption.

just on the subject of page reassignment, I''m trying to figure out what
the code is doing.

in netif_be_start_xmit, there is a check to make sure that the packet is
entirely on 1 page. What happens if the packet is too big for one page, or if
there is other data on the same page? (it''s all black magic to me at
the moment!)

James



From: Keir Fraser
Sent: Thu 22/07/2004 9:54 PM
To: James Harper
Cc: Keir Fraser; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

> Okay, I have made the following change in dom0:
> 
> To disable the transmit path for guest OSes:
> Edit net_tx_action in arch/xen/drivers/netif/backend/main.c. After the
> call to netif_schedule_work(), add:
>   make_tx_response(netif, txreq.id, NETIF_RSP_OKAY);
>   netif_put(netif);
>   continue;
> 
> compiled and rebooted with the new kernel. booted dom1, removed vif1.0 from
the bridge, gave it it''s own ip address, added a static arp entry and
pinged away. I could see the packet counters for dom0 and dom1 climbing rapiding
indicating that dom0 was sending packets, dom1 was receiving packets, but that a
packet sent by dom1 was unable to reach dom0 again. I got the same sort of
crashes after about 10 minutes.
If you do a test with DPRINTK enabled in
linux-2.4.26-xen-sparse/arch/xen/drivers/netif/backend/common.h
and with debugging enabled in Xen ''debug=y make''
then you may get some useful debugging out of the machine when it all
goes horribly wrong. e.g., perhaps something is failing apparently
spuriously... one example would be that a page reassignment (from dom0
to the other guest) is failing for some weird reason.

If we can get somne debugging out when things first go wrong, that
would be very useful indeed.

 Thanks,
 Keir

Keir Fraser

2004-Jul-22 13:09 UTC

head link

Re: [Xen-devel] segfault in VM

> I am trying this now. Within a few seconds of starting the flood ping,
> dom1 rebooted. no messages in the logs to give any hint as to why
> though. Trying again and I didn''t get anything useful either once
I
> started getting noticable corruption.
Hmmm.... I guess maybe there''s a race somewhere, rather than the
problem being a broken error-handling path. Which is a shame, as it''s
bound to be harder to track down. :-(
> just on the subject of page reassignment, I''m trying to figure out
> what the code is doing.
>
> in netif_be_start_xmit, there is a check to make sure that the packet
> is entirely on 1 page. What happens if the packet is too big for one
> page, or if there is other data on the same page? (it''s all black
> magic to me at the moment!)
Unless you''re using jumbo Ethernet frames (which you''re almost
certainly not) then the packet will certainly fit in a page. We also
check that the packet buffer is at least half a page in size --- since
the slab allocator allocates in powers-of-two, that means the packet
buffer must actually be a full aligned page in size.

If our checks are insufficient and a few packets that are sharing
their data page are getting thru, for example, then we would be pretty
screwed! This might be another area to explore -- whether there are a
few skbuffs coming thru now and then that are of a layout that we 
mishandle. 

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-22 15:32 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 22, 2004, at 8:53 AM, James Harper wrote:
> I am trying this now. Within a few seconds of starting the flood ping, 
> dom1 rebooted. no messages in the logs to give any hint as to why 
> though. Trying again and I didn''t get anything useful either once
I
> started getting noticable corruption.
Just to corroborate, I''ve been able to pretty reliably induce 
corruption and I have my Xen kernel compiled with "debug=y".   Xen
will
pretty much continuously spit out "GPF (0004)" messages, but
I''ve only
ever seen it output "Bailing" a couple of times on a corruption.  Most
of the time there''s nothing when the corruption starts.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-22 15:38 UTC

head link

Re: [Xen-devel] segfault in VM

On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:>
> Anyway - currently sounds like teh bug resides in the most complex
> half of the most complex driver. Who''d''ve thought it? ;-)
At this point this data is surely redundant but...

When I went to sleep last night I let my box run dom0 and four VMs 
doing md5sum checks on a couple of large files, hammering the heck out 
of the block i/o drivers and CPU but with all the ifaces/vifs on the 
machine down.  When I woke up, all compares had been correct for the 
six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
and the VMs and within a minute of the pings starting dom0 started to 
report incorrect md5sums.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-22 17:48 UTC

head link

Re: [Xen-devel] segfault in VM

It''s useful to have the extra data points -- it adds to our confidence
that it''s the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver''s data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

> 
> On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> >
> > Anyway - currently sounds like teh bug resides in the most complex
> > half of the most complex driver. Who''d''ve thought
it? ;-)
> 
> At this point this data is surely redundant but...
> 
> When I went to sleep last night I let my box run dom0 and four VMs 
> doing md5sum checks on a couple of large files, hammering the heck out 
> of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> machine down.  When I woke up, all compares had been correct for the 
> six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
> and the VMs and within a minute of the pings starting dom0 started to 
> report incorrect md5sums.
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> "We all enter this world in the    | Support Electronic Freedom
> same way: naked; screaming; soaked |        eff.org
> in blood. But if you live your     |  anti-dmca.org
> life right, that kind of thing     |---------------------------
> doesn''t have to stop there." -- Dana Gould
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-23 01:03 UTC

head link

RE: [Xen-devel] segfault in VM

I just made a change so that the skbuf is always copied in netif_be_start_xmit
but it still crashes, which means most likely that bit is fine or at least
isn''t the only code containing bugs.

As another test I also put the ''goto done;'' after the
''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still
block the receive but do it later) and there were no crashes, so i''m
comfortable that we''ve exhausted netif_be_start_xmit as a source for
bugs.

So I guess that leaves net_rx_action. I''m unsure on one thing though,
the pages that get passed from dom0 to domU, how/where/do they get recycled back
to dom0? Is it possible that domU could still write to a page that dom0 thought
it had free to use for something else? If so, where would that be?

Keir: have you been able to reproduce these errors at all?

James

From: Keir Fraser
Sent: Fri 23/07/2004 3:48 AM
To: Derek Glidden
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

It''s useful to have the extra data points -- it adds to our confidence
that it''s the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver''s data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

> 
> On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> >
> > Anyway - currently sounds like teh bug resides in the most complex
> > half of the most complex driver. Who''d''ve thought
it? ;-)
> 
> At this point this data is surely redundant but...
> 
> When I went to sleep last night I let my box run dom0 and four VMs 
> doing md5sum checks on a couple of large files, hammering the heck out 
> of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> machine down.  When I woke up, all compares had been correct for the 
> six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
> and the VMs and within a minute of the pings starting dom0 started to 
> report incorrect md5sums.
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> "We all enter this world in the    | Support Electronic Freedom
> same way: naked; screaming; soaked |        eff.org
> in blood. But if you live your     |  anti-dmca.org
> life right, that kind of thing     |---------------------------
> doesn''t have to stop there." -- Dana Gould
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-23 01:11 UTC

head link

Re: [Xen-devel] segfault in VM

Yeah, it turns out I can reproduce this bug trivially by md5summing a
file just slightly bigger than dom0''s memory allocation, while
floodpinging dom1.

I''m trying out a few things right now, so hopefully I''ll be
able to
report progress on this evil bug r.s.n. :-)

 -- Keir
> I just made a change so that the skbuf is always copied in
netif_be_start_xmit but it still crashes, which means most likely that bit is
fine or at least isn''t the only code containing bugs.
> 
> As another test I also put the ''goto done;'' after the
''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still
block the receive but do it later) and there were no crashes, so i''m
comfortable that we''ve exhausted netif_be_start_xmit as a source for
bugs.
> 
> So I guess that leaves net_rx_action. I''m unsure on one thing
though, the pages that get passed from dom0 to domU, how/where/do they get
recycled back to dom0? Is it possible that domU could still write to a page that
dom0 thought it had free to use for something else? If so, where would that be?
> 
> Keir: have you been able to reproduce these errors at all?
> 
> James
> 
> 
> 
> 
> From: Keir Fraser
> Sent: Fri 23/07/2004 3:48 AM
> To: Derek Glidden
> Cc: xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> It''s useful to have the extra data points -- it adds to our
confidence
> that it''s the network driver that is somehow at fault here.
> 
> Quite how to proceed in narrowing down the problem is
> unclear. One approach is to perturb the backend driver''s data path
> (e.g., always copying packets into a known-safe page-sized buffer, as
> a check that our current copy-avoidancxe checks are not at fault; and
> replacing the current high-performance but convoluted code for
> batching hypercalls with something slower but easier to grok). The
> latter is useful because if the bug goes away then we have a smaller
> chunk of code to look at; if the bug remains then we end up with a
> less complex data path that is easier to instrument and bughunt.
> 
> If anyone is interested in pursuing this bug independently, the
> functions most under suspicion are netif_be_start_xmit and
> net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
> These two form the data path for packets getting sent to guest OSes.
> 
>  -- Keir
> 
> 
> > 
> > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> > >
> > > Anyway - currently sounds like teh bug resides in the most
complex
> > > half of the most complex driver. Who''d''ve
thought it? ;-)
> > 
> > At this point this data is surely redundant but...
> > 
> > When I went to sleep last night I let my box run dom0 and four VMs 
> > doing md5sum checks on a couple of large files, hammering the heck out
> > of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> > machine down.  When I woke up, all compares had been correct for the 
> > six hours or so it ran.  I re-upped the ifaces and started to ping
dom0
> > and the VMs and within a minute of the pings starting dom0 started to 
> > report incorrect md5sums.
> > 
> > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> > "We all enter this world in the    | Support Electronic Freedom
> > same way: naked; screaming; soaked |        eff.org
> > in blood. But if you live your     |  anti-dmca.org
> > life right, that kind of thing     |---------------------------
> > doesn''t have to stop there." -- Dana Gould
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by BEA Weblogic Workshop
> > FREE Java Enterprise J2EE developer tools!
> > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.sourceforge.net
> > lists.sourceforge.net/lists/listinfo/xen-devel
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> lists.sourceforge.net/lists/listinfo/xen-devel -=- MIME -=- 
--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I just made a change so that the skbuf is always copied in netif_be_start_xmit
but it still crashes, which means most likely that bit is fine or at least
isn''t the only code containing bugs.

As another test I also put the ''goto done;'' after the
''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still
block the receive but do it later) and there were no crashes, so i''m
comfortable that we''ve exhausted netif_be_start_xmit as a source for
bugs.

So I guess that leaves net_rx_action. I''m unsure on one thing though,
the pages that get passed from dom0 to domU, how/where/do they get recycled back
to dom0? Is it possible that domU could still write to a page that dom0 thought
it had free to use for something else? If so, where would that be?

Keir: have you been able to reproduce these errors at all?

James

From: Keir Fraser
Sent: Fri 23/07/2004 3:48 AM
To: Derek Glidden
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

It''s useful to have the extra data points -- it adds to our confidence
that it''s the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver''s data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

>=20
> On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> >
> > Anyway - currently sounds like teh bug resides in the most complex
> > half of the most complex driver. Who''d''ve thought
it? ;-)
>=20
> At this point this data is surely redundant but...
>=20
> When I went to sleep last night I let my box run dom0 and four VMs=20
> doing md5sum checks on a couple of large files, hammering the heck out=20
> of the block i/o drivers and CPU but with all the ifaces/vifs on the=20
> machine down.  When I woke up, all compares had been correct for the=20
> six hours or so it ran.  I re-upped the ifaces and started to ping dom0=20
> and the VMs and within a minute of the pings starting dom0 started to=20
> report incorrect md5sums.
>=20
>
-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
> "We all enter this world in the    | Support Electronic Freedom
> same way: naked; screaming; soaked |        eff.org
> in blood. But if you live your     |  anti-dmca.org
> life right, that kind of thing     |---------------------------
> doesn''t have to stop there." -- Dana Gould
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText58627 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>I just
made a change so that the skbuf is always copied in netif_be_start_xmit but it
still crashes, which means most likely that bit is fine or at least
isn''t the only code containing bugs.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>As another test I also
put the ''goto done;'' after the ''if ( skb_shared(skb)
|| skb_cloned(skb) || ...'' block, (still block the receive but do it
later) and there were no crashes, so i''m comfortable that
we''ve exhausted netif_be_start_xmit as a source for
bugs.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>So I guess that leaves
net_rx_action. I''m unsure on one thing though, the pages that get
passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it
possible that domU could still write to a page that dom0 thought it had free to
use for something else? If so, where would that be?</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Keir: have you been able
to reproduce these errors at all?</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2>James</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir
Fraser<BR><B>Sent:</B> Fri 23/07/2004 3:48
AM<BR><B>To:</B> Derek Glidden<BR><B>Cc:</B>
xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re:
[Xen-devel] segfault in VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">It''s
useful to have the extra data points -- it adds to our confidence
that it''s the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver''s data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

&gt;=20
&gt; On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
&gt; &gt;
&gt; &gt; Anyway - currently sounds like teh bug resides in the most
complex
&gt; &gt; half of the most complex driver. Who''d''ve
thought it? ;-)
&gt;=20
&gt; At this point this data is surely redundant but...
&gt;=20
&gt; When I went to sleep last night I let my box run dom0 and four VMs=20
&gt; doing md5sum checks on a couple of large files, hammering the heck
out=20
&gt; of the block i/o drivers and CPU but with all the ifaces/vifs on the=20
&gt; machine down.  When I woke up, all compares had been correct for the=20
&gt; six hours or so it ran.  I re-upped the ifaces and started to ping
dom0=20
&gt; and the VMs and within a minute of the pings starting dom0 started
to=20
&gt; report incorrect md5sums.
&gt;=20
&gt;
-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
&gt; "We all enter this world in the    | Support Electronic Freedom
&gt; same way: naked; screaming; soaked |        eff.org
&gt; in blood. But if you live your     |  anti-dmca.org
&gt; life right, that kind of thing     |---------------------------
&gt; doesn''t have to stop there." -- Dana Gould
&gt;=20
&gt;=20
&gt;=20
&gt; -------------------------------------------------------
&gt; This SF.Net email is sponsored by BEA Weblogic Workshop
&gt; FREE Java Enterprise J2EE developer tools!
&gt; Get your free copy of BEA WebLogic Workshop 8.1 today.
&gt;
ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
&gt; _______________________________________________
&gt; Xen-devel mailing list
&gt; Xen-devel@lists.sourceforge.net
&gt; lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel
</PRE></DIV></BODY></HTML>

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_--

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-23 04:49 UTC

head link

RE: [Xen-devel] segfault in VM

That''s comforting. I was starting to think of looking for gcc bugs and
the like.

Even so, it might be useful to collect the gcc versions of anyone who either has
seen the bug or has tried to reproduce it and can''t. Mine reports
itself as "gcc (GCC) 3.3.4 (Debian 1:3.3.4-2)" with "gcc
--version"

James

From: Keir Fraser
Sent: Fri 23/07/2004 11:11 AM
To: James Harper
Cc: Keir Fraser; Derek Glidden; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

Yeah, it turns out I can reproduce this bug trivially by md5summing a
file just slightly bigger than dom0''s memory allocation, while
floodpinging dom1.

I''m trying out a few things right now, so hopefully I''ll be
able to
report progress on this evil bug r.s.n. :-)

 -- Keir
> I just made a change so that the skbuf is always copied in
netif_be_start_xmit but it still crashes, which means most likely that bit is
fine or at least isn''t the only code containing bugs.
> 
> As another test I also put the ''goto done;'' after the
''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still
block the receive but do it later) and there were no crashes, so i''m
comfortable that we''ve exhausted netif_be_start_xmit as a source for
bugs.
> 
> So I guess that leaves net_rx_action. I''m unsure on one thing
though, the pages that get passed from dom0 to domU, how/where/do they get
recycled back to dom0? Is it possible that domU could still write to a page that
dom0 thought it had free to use for something else? If so, where would that be?
> 
> Keir: have you been able to reproduce these errors at all?
> 
> James
> 
> 
> 
> 
> From: Keir Fraser
> Sent: Fri 23/07/2004 3:48 AM
> To: Derek Glidden
> Cc: xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> It''s useful to have the extra data points -- it adds to our
confidence
> that it''s the network driver that is somehow at fault here.
> 
> Quite how to proceed in narrowing down the problem is
> unclear. One approach is to perturb the backend driver''s data path
> (e.g., always copying packets into a known-safe page-sized buffer, as
> a check that our current copy-avoidancxe checks are not at fault; and
> replacing the current high-performance but convoluted code for
> batching hypercalls with something slower but easier to grok). The
> latter is useful because if the bug goes away then we have a smaller
> chunk of code to look at; if the bug remains then we end up with a
> less complex data path that is easier to instrument and bughunt.
> 
> If anyone is interested in pursuing this bug independently, the
> functions most under suspicion are netif_be_start_xmit and
> net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
> These two form the data path for packets getting sent to guest OSes.
> 
>  -- Keir
> 
> 
> > 
> > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> > >
> > > Anyway - currently sounds like teh bug resides in the most
complex
> > > half of the most complex driver. Who''d''ve
thought it? ;-)
> > 
> > At this point this data is surely redundant but...
> > 
> > When I went to sleep last night I let my box run dom0 and four VMs 
> > doing md5sum checks on a couple of large files, hammering the heck out
> > of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> > machine down.  When I woke up, all compares had been correct for the 
> > six hours or so it ran.  I re-upped the ifaces and started to ping
dom0
> > and the VMs and within a minute of the pings starting dom0 started to 
> > report incorrect md5sums.
> > 
> > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> > "We all enter this world in the    | Support Electronic Freedom
> > same way: naked; screaming; soaked |        eff.org
> > in blood. But if you live your     |  anti-dmca.org
> > life right, that kind of thing     |---------------------------
> > doesn''t have to stop there." -- Dana Gould
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by BEA Weblogic Workshop
> > FREE Java Enterprise J2EE developer tools!
> > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.sourceforge.net
> > lists.sourceforge.net/lists/listinfo/xen-devel
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> lists.sourceforge.net/lists/listinfo/xen-devel -=- MIME -=- 
--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I just made a change so that the skbuf is always copied in netif_be_start_xmit
but it still crashes, which means most likely that bit is fine or at least
isn''t the only code containing bugs.

As another test I also put the ''goto done;'' after the
''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block, (still
block the receive but do it later) and there were no crashes, so i''m
comfortable that we''ve exhausted netif_be_start_xmit as a source for
bugs.

So I guess that leaves net_rx_action. I''m unsure on one thing though,
the pages that get passed from dom0 to domU, how/where/do they get recycled back
to dom0? Is it possible that domU could still write to a page that dom0 thought
it had free to use for something else? If so, where would that be?

Keir: have you been able to reproduce these errors at all?

James

From: Keir Fraser
Sent: Fri 23/07/2004 3:48 AM
To: Derek Glidden
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM

It''s useful to have the extra data points -- it adds to our confidence
that it''s the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver''s data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

>=20
> On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> >
> > Anyway - currently sounds like teh bug resides in the most complex
> > half of the most complex driver. Who''d''ve thought
it? ;-)
>=20
> At this point this data is surely redundant but...
>=20
> When I went to sleep last night I let my box run dom0 and four VMs=20
> doing md5sum checks on a couple of large files, hammering the heck out=20
> of the block i/o drivers and CPU but with all the ifaces/vifs on the=20
> machine down.  When I woke up, all compares had been correct for the=20
> six hours or so it ran.  I re-upped the ifaces and started to ping dom0=20
> and the VMs and within a minute of the pings starting dom0 started to=20
> report incorrect md5sums.
>=20
>
-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
> "We all enter this world in the    | Support Electronic Freedom
> same way: naked; screaming; soaked |        eff.org
> in blood. But if you live your     |  anti-dmca.org
> life right, that kind of thing     |---------------------------
> doesn''t have to stop there." -- Dana Gould
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText58627 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>I just
made a change so that the skbuf is always copied in netif_be_start_xmit but it
still crashes, which means most likely that bit is fine or at least
isn''t the only code containing bugs.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>As another test I also
put the ''goto done;'' after the ''if ( skb_shared(skb)
|| skb_cloned(skb) || ...'' block, (still block the receive but do it
later) and there were no crashes, so i''m comfortable that
we''ve exhausted netif_be_start_xmit as a source for
bugs.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>So I guess that leaves
net_rx_action. I''m unsure on one thing though, the pages that get
passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it
possible that domU could still write to a page that dom0 thought it had free to
use for something else? If so, where would that be?</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Keir: have you been able
to reproduce these errors at all?</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2>James</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir
Fraser<BR><B>Sent:</B> Fri 23/07/2004 3:48
AM<BR><B>To:</B> Derek Glidden<BR><B>Cc:</B>
xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re:
[Xen-devel] segfault in VM<BR></FONT><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">It''s
useful to have the extra data points -- it adds to our confidence
that it''s the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver''s data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir

&gt;=20
&gt; On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
&gt; &gt;
&gt; &gt; Anyway - currently sounds like teh bug resides in the most
complex
&gt; &gt; half of the most complex driver. Who''d''ve
thought it? ;-)
&gt;=20
&gt; At this point this data is surely redundant but...
&gt;=20
&gt; When I went to sleep last night I let my box run dom0 and four VMs=20
&gt; doing md5sum checks on a couple of large files, hammering the heck
out=20
&gt; of the block i/o drivers and CPU but with all the ifaces/vifs on the=20
&gt; machine down.  When I woke up, all compares had been correct for the=20
&gt; six hours or so it ran.  I re-upped the ifaces and started to ping
dom0=20
&gt; and the VMs and within a minute of the pings starting dom0 started
to=20
&gt; report incorrect md5sums.
&gt;=20
&gt;
-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
&gt; "We all enter this world in the    | Support Electronic Freedom
&gt; same way: naked; screaming; soaked |        eff.org
&gt; in blood. But if you live your     |  anti-dmca.org
&gt; life right, that kind of thing     |---------------------------
&gt; doesn''t have to stop there." -- Dana Gould
&gt;=20
&gt;=20
&gt;=20
&gt; -------------------------------------------------------
&gt; This SF.Net email is sponsored by BEA Weblogic Workshop
&gt; FREE Java Enterprise J2EE developer tools!
&gt; Get your free copy of BEA WebLogic Workshop 8.1 today.
&gt;
ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
&gt; _______________________________________________
&gt; Xen-devel mailing list
&gt; Xen-devel@lists.sourceforge.net
&gt; lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel
</PRE></DIV></BODY></HTML>

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_--

Keir Fraser

2004-Jul-23 16:01 UTC

head link

Re: [Xen-devel] segfault in VM - FIXED!

Okay, so I found that the problem is due to overly-aggressive merging
of block requests in the IDE driver. The code assumes that if buffers
are adjacent in virtual or physical address space then they can be
merged --- this isn''t always the case over Xen since those physical
addresses may map to different real machine pages.

I''ve checked in a fix that I think is safe for IDE --- in the
occasional instances that a merged scatter-gather list is invalid, we
should now cause IDE to fall back to a super-safe mode (basically
PIO). On my system this happens so occasionally that performance
shouldn''t be affected.

If this also turns out to be a problem for SCSI then we may need to do
some more work --- our safety check will still trigger and we will
still fail the scatter-gather list, but it doesn''t look as though many
SCSI drivers pick up the error return code and do anything sane. This
is a bug in those drivers, but this is small comfort to us in our aim
to work with the full range of Linux SCSI drivers.

What we need now is some more checking, particularly with SCSI block
devices, to see whether there are any more bugs to shake out.

 -- Keir

> 
> Yeah, it turns out I can reproduce this bug trivially by md5summing a
> file just slightly bigger than dom0''s memory allocation, while
> floodpinging dom1.
> 
> I''m trying out a few things right now, so hopefully I''ll
be able to
> report progress on this evil bug r.s.n. :-)
> 
>  -- Keir
> 
> > I just made a change so that the skbuf is always copied in
netif_be_start_xmit but it still crashes, which means most likely that bit is
fine or at least isn''t the only code containing bugs.
> > 
> > As another test I also put the ''goto done;'' after
the ''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block,
(still block the receive but do it later) and there were no crashes, so
i''m comfortable that we''ve exhausted netif_be_start_xmit as a
source for bugs.
> > 
> > So I guess that leaves net_rx_action. I''m unsure on one thing
though, the pages that get passed from dom0 to domU, how/where/do they get
recycled back to dom0? Is it possible that domU could still write to a page that
dom0 thought it had free to use for something else? If so, where would that be?
> > 
> > Keir: have you been able to reproduce these errors at all?
> > 
> > James
> > 
> > 
> > 
> > 
> > From: Keir Fraser
> > Sent: Fri 23/07/2004 3:48 AM
> > To: Derek Glidden
> > Cc: xen-devel@lists.sourceforge.net
> > Subject: Re: [Xen-devel] segfault in VM
> > 
> > 
> > It''s useful to have the extra data points -- it adds to our
confidence
> > that it''s the network driver that is somehow at fault here.
> > 
> > Quite how to proceed in narrowing down the problem is
> > unclear. One approach is to perturb the backend driver''s data
path
> > (e.g., always copying packets into a known-safe page-sized buffer, as
> > a check that our current copy-avoidancxe checks are not at fault; and
> > replacing the current high-performance but convoluted code for
> > batching hypercalls with something slower but easier to grok). The
> > latter is useful because if the bug goes away then we have a smaller
> > chunk of code to look at; if the bug remains then we end up with a
> > less complex data path that is easier to instrument and bughunt.
> > 
> > If anyone is interested in pursuing this bug independently, the
> > functions most under suspicion are netif_be_start_xmit and
> > net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
> > These two form the data path for packets getting sent to guest OSes.
> > 
> >  -- Keir
> > 
> > 
> > > 
> > > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> > > >
> > > > Anyway - currently sounds like teh bug resides in the most
complex
> > > > half of the most complex driver. Who''d''ve
thought it? ;-)
> > > 
> > > At this point this data is surely redundant but...
> > > 
> > > When I went to sleep last night I let my box run dom0 and four
VMs
> > > doing md5sum checks on a couple of large files, hammering the
heck out
> > > of the block i/o drivers and CPU but with all the ifaces/vifs on
the
> > > machine down.  When I woke up, all compares had been correct for
the
> > > six hours or so it ran.  I re-upped the ifaces and started to
ping dom0
> > > and the VMs and within a minute of the pings starting dom0
started to
> > > report incorrect md5sums.
> > > 
> > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> > > "We all enter this world in the    | Support Electronic
Freedom
> > > same way: naked; screaming; soaked |        eff.org
> > > in blood. But if you live your     |  anti-dmca.org
> > > life right, that kind of thing     |---------------------------
> > > doesn''t have to stop there." -- Dana Gould
> > > 
> > > 
> > > 
> > > -------------------------------------------------------
> > > This SF.Net email is sponsored by BEA Weblogic Workshop
> > > FREE Java Enterprise J2EE developer tools!
> > > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > > ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@lists.sourceforge.net
> > > lists.sourceforge.net/lists/listinfo/xen-devel
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by BEA Weblogic Workshop
> > FREE Java Enterprise J2EE developer tools!
> > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.sourceforge.net
> > lists.sourceforge.net/lists/listinfo/xen-devel
>  -=- MIME -=- 
> --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
> Content-Type: text/plain;
> 	charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
> 
> I just made a change so that the skbuf is always copied in
netif_be_start_x> mit but it still crashes, which means most likely that bit
is fine or at le> ast isn''t the only code containing bugs.
> 
> As another test I also put the ''goto done;'' after the
''if ( skb_shared(skb)>  || skb_cloned(skb) || ...'' block,
(still block the receive but do it later> ) and there were no crashes, so
i''m comfortable that we''ve exhausted netif_> be_start_xmit
as a source for bugs.
> 
> So I guess that leaves net_rx_action. I''m unsure on one thing
though, the p> ages that get passed from dom0 to domU, how/where/do they get
recycled back>  to dom0? Is it possible that domU could still write to a page
that dom0 th> ought it had free to use for something else? If so, where would
that be?
> 
> Keir: have you been able to reproduce these errors at all?
> 
> James
> 

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Derek Glidden

2004-Jul-23 17:44 UTC

head link

Re: [Xen-devel] segfault in VM - FIXED!

On Jul 23, 2004, at 12:01 PM, Keir Fraser wrote:
>
> Okay, so I found that the problem is due to overly-aggressive merging
> of block requests in the IDE driver. The code assumes that if buffers
> are adjacent in virtual or physical address space then they can be
> merged --- this isn''t always the case over Xen since those
physical
> addresses may map to different real machine pages.
And there was much rejoicing!

Thanks Keir for working so hard on digging this problem out and getting 
a fix in.

Other than the doms not dying after a halt, which you said you checked 
in a fix, and the occasional strange unbalanced dom scheduling, which I 
understand the scheduler is being worked on, the -unstable branch has 
worked very well for me so far.  (Well, outside of the random 
crashes... :)

I''ll do a pull tonight when I get home and rebuild everything and start
hammering on it some more.
> I''ve checked in a fix that I think is safe for IDE --- in the
> occasional instances that a merged scatter-gather list is invalid, we
> should now cause IDE to fall back to a super-safe mode (basically
> PIO). On my system this happens so occasionally that performance
> shouldn''t be affected.
Does it revert back to "normal" behaviour for consequent operations?  
i.e. is the "basically PIO" mode just for the operation that fails?
> What we need now is some more checking, particularly with SCSI block
> devices, to see whether there are any more bugs to shake out.
Would it help at all for me to set up a box as ide-scsi, or is it 
strictly the data path inside the individual SCSI drivers that could 
cause problems?

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I think that''s what they mean by   |
"nickels a day can feed a child."   |       eff.org
I thought, "How can food be so      | anti-dmca.org
cheap over there?"  It''s not, they  |--------------------------
just eat the nickels." -- Peter Nguyen


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"We all enter this world in the    | Support Electronic Freedom
same way: naked; screaming; soaked |        eff.org
in blood. But if you live your     |  anti-dmca.org
life right, that kind of thing     |---------------------------
doesn''t have to stop there." -- Dana Gould



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-23 17:55 UTC

head link

Re: [Xen-devel] segfault in VM - FIXED!

> > I''ve checked in a fix that I think is safe for IDE --- in the
> > occasional instances that a merged scatter-gather list is invalid, we
> > should now cause IDE to fall back to a super-safe mode (basically
> > PIO). On my system this happens so occasionally that performance
> > shouldn''t be affected.
> 
> Does it revert back to "normal" behaviour for consequent
operations?
> i.e. is the "basically PIO" mode just for the operation that
fails?
That is correct -- in practice very very few requests should end up
using PIO.
> > What we need now is some more checking, particularly with SCSI block
> > devices, to see whether there are any more bugs to shake out.
> 
> Would it help at all for me to set up a box as ide-scsi, or is it 
> strictly the data path inside the individual SCSI drivers that could 
> cause problems?
Stress-testing in as many environments and setups as possible is very
welcome! 

I''ve also contacted the linux-kernel mailing list to find out whether
anyone there has a btter fix, or would be amenable to some
Xen-friendly patches being sent their way. ;-)

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Chris Andrews

2004-Jul-23 19:14 UTC

head link

Re: [Xen-devel] segfault in VM - FIXED!

On 23 Jul 2004, at 17:01, Keir Fraser wrote:>
> What we need now is some more checking, particularly with SCSI block
> devices, to see whether there are any more bugs to shake out.
I''ve given this change a go on my PE1650 (aacraid driver). 
Unfortunately this seems to be one of the SCSI drivers that doesn''t 
correctly handle the error condition.

Running my usual test (''compare'' in dom0, compiles in other
domains), I
don''t see any differences in the compares, but after a few minutes, I 
get the following on the console, and everything is stuck waiting for 
disk.

aacraid: cmd len 00000000 cmd underflow 00010000
aacraid: Host adapter reset request. SCSI hang ?

The latter message repeats every few seconds. I rebooted the box with 
the Xen console after a few lines. I''m going to try the aacraid driver 
from 2.4.27-rc3, which I believe has had some attention recently.

Chris.

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-24 08:52 UTC

head link

RE: [Xen-devel] segfault in VM - FIXED!

My system doesn''t have any ide devices, it''s scsi only. The
scsi driver is aic7xxx, and i''m still having crashes even with the
latest checkout. I noticed in the logs for the first time some scsi errors in
amongst all the others, but given the nature of the crash i don''t know
if that means anything.

Is this the same problem that we thought was in the network code? I could not
readily induce the crash without creating lots of network traffic.

James



From: Keir Fraser
Sent: Sat 24/07/2004 2:01 AM
To: Keir Fraser
Cc: James Harper; Derek Glidden; xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM - FIXED!


Okay, so I found that the problem is due to overly-aggressive merging
of block requests in the IDE driver. The code assumes that if buffers
are adjacent in virtual or physical address space then they can be
merged --- this isn''t always the case over Xen since those physical
addresses may map to different real machine pages.

I''ve checked in a fix that I think is safe for IDE --- in the
occasional instances that a merged scatter-gather list is invalid, we
should now cause IDE to fall back to a super-safe mode (basically
PIO). On my system this happens so occasionally that performance
shouldn''t be affected.

If this also turns out to be a problem for SCSI then we may need to do
some more work --- our safety check will still trigger and we will
still fail the scatter-gather list, but it doesn''t look as though many
SCSI drivers pick up the error return code and do anything sane. This
is a bug in those drivers, but this is small comfort to us in our aim
to work with the full range of Linux SCSI drivers.

What we need now is some more checking, particularly with SCSI block
devices, to see whether there are any more bugs to shake out.

 -- Keir

> 
> Yeah, it turns out I can reproduce this bug trivially by md5summing a
> file just slightly bigger than dom0''s memory allocation, while
> floodpinging dom1.
> 
> I''m trying out a few things right now, so hopefully I''ll
be able to
> report progress on this evil bug r.s.n. :-)
> 
>  -- Keir
> 
> > I just made a change so that the skbuf is always copied in
netif_be_start_xmit but it still crashes, which means most likely that bit is
fine or at least isn''t the only code containing bugs.
> > 
> > As another test I also put the ''goto done;'' after
the ''if ( skb_shared(skb) || skb_cloned(skb) || ...'' block,
(still block the receive but do it later) and there were no crashes, so
i''m comfortable that we''ve exhausted netif_be_start_xmit as a
source for bugs.
> > 
> > So I guess that leaves net_rx_action. I''m unsure on one thing
though, the pages that get passed from dom0 to domU, how/where/do they get
recycled back to dom0? Is it possible that domU could still write to a page that
dom0 thought it had free to use for something else? If so, where would that be?
> > 
> > Keir: have you been able to reproduce these errors at all?
> > 
> > James
> > 
> > 
> > 
> > 
> > From: Keir Fraser
> > Sent: Fri 23/07/2004 3:48 AM
> > To: Derek Glidden
> > Cc: xen-devel@lists.sourceforge.net
> > Subject: Re: [Xen-devel] segfault in VM
> > 
> > 
> > It''s useful to have the extra data points -- it adds to our
confidence
> > that it''s the network driver that is somehow at fault here.
> > 
> > Quite how to proceed in narrowing down the problem is
> > unclear. One approach is to perturb the backend driver''s data
path
> > (e.g., always copying packets into a known-safe page-sized buffer, as
> > a check that our current copy-avoidancxe checks are not at fault; and
> > replacing the current high-performance but convoluted code for
> > batching hypercalls with something slower but easier to grok). The
> > latter is useful because if the bug goes away then we have a smaller
> > chunk of code to look at; if the bug remains then we end up with a
> > less complex data path that is easier to instrument and bughunt.
> > 
> > If anyone is interested in pursuing this bug independently, the
> > functions most under suspicion are netif_be_start_xmit and
> > net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
> > These two form the data path for packets getting sent to guest OSes.
> > 
> >  -- Keir
> > 
> > 
> > > 
> > > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> > > >
> > > > Anyway - currently sounds like teh bug resides in the most
complex
> > > > half of the most complex driver. Who''d''ve
thought it? ;-)
> > > 
> > > At this point this data is surely redundant but...
> > > 
> > > When I went to sleep last night I let my box run dom0 and four
VMs
> > > doing md5sum checks on a couple of large files, hammering the
heck out
> > > of the block i/o drivers and CPU but with all the ifaces/vifs on
the
> > > machine down.  When I woke up, all compares had been correct for
the
> > > six hours or so it ran.  I re-upped the ifaces and started to
ping dom0
> > > and the VMs and within a minute of the pings starting dom0
started to
> > > report incorrect md5sums.
> > > 
> > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> > > "We all enter this world in the    | Support Electronic
Freedom
> > > same way: naked; screaming; soaked |        eff.org
> > > in blood. But if you live your     |  anti-dmca.org
> > > life right, that kind of thing     |---------------------------
> > > doesn''t have to stop there." -- Dana Gould
> > > 
> > > 
> > > 
> > > -------------------------------------------------------
> > > This SF.Net email is sponsored by BEA Weblogic Workshop
> > > FREE Java Enterprise J2EE developer tools!
> > > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > > ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@lists.sourceforge.net
> > > lists.sourceforge.net/lists/listinfo/xen-devel
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by BEA Weblogic Workshop
> > FREE Java Enterprise J2EE developer tools!
> > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.sourceforge.net
> > lists.sourceforge.net/lists/listinfo/xen-devel
>  -=- MIME -=- 
> --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
> Content-Type: text/plain;
> 	charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
> 
> I just made a change so that the skbuf is always copied in
netif_be_start_x> mit but it still crashes, which means most likely that bit
is fine or at le> ast isn''t the only code containing bugs.
> 
> As another test I also put the ''goto done;'' after the
''if ( skb_shared(skb)>  || skb_cloned(skb) || ...'' block,
(still block the receive but do it later> ) and there were no crashes, so
i''m comfortable that we''ve exhausted netif_> be_start_xmit
as a source for bugs.
> 
> So I guess that leaves net_rx_action. I''m unsure on one thing
though, the p> ages that get passed from dom0 to domU, how/where/do they get
recycled back>  to dom0? Is it possible that domU could still write to a page
that dom0 th> ought it had free to use for something else? If so, where would
that be?
> 
> Keir: have you been able to reproduce these errors at all?
> 
> James
>

Chris Andrews

2004-Jul-24 15:54 UTC

head link

Re: [Xen-devel] segfault in VM - FIXED!

On 24 Jul 2004, at 13:47, I wrote:
> I''m just testing a patch which disables merging in the scsi layer
when
> it believes it has contiguous requests in different pages. I think 
> this is more pessimistic that it needs to be, as the pages may after 
> all be contiguous, but it does allow some merging  to happen and so 
> far seems to be stable.
I''ve given this a bit more testing, and it seems to be working fine - 
the machine is now running a dom0 kernel built while running the patch. 
As for performance, it''s ''not bad'' -- I''ve
just done a bonnie++ run,
and some compiles. Based on sticking printks in and watching the 
console, it''s allowing merges much more often than not, but still I 
suspect not as much as it could. Probably it should use something with 
more arch-knowledge than page_to_phys().

patch for linux-2.4.26-xen0: 
munky.nodnol.org/~chris/xen_scsi_merge.diff
bonnie++ stats: munky.nodnol.org/~chris/munkyII_stats.txt

Chris.



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-25 09:27 UTC

head link

RE: [Xen-devel] segfault in VM - FIXED!

I''m building this now.

The way I see it, currently it must be incorrect for only a very very small
number of cases or the system would crash and burn almost instantly. So in
theory, unless these cases are undetectable, or the cost of detecting them is
high for some reason, the performance difference should be almost unnoticable

I assume the patch would only affect dom0 and so should matter if domU is
patched or not. Is there a way of installing a patch so that it''s
picked up by ''make world''?

i''ll follow up with results shortly.

James

From: Chris Andrews
Sent: Sun 25/07/2004 1:54 AM
To: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM - FIXED!

On 24 Jul 2004, at 13:47, I wrote:
> I''m just testing a patch which disables merging in the scsi layer
when
> it believes it has contiguous requests in different pages. I think 
> this is more pessimistic that it needs to be, as the pages may after 
> all be contiguous, but it does allow some merging  to happen and so 
> far seems to be stable.
I''ve given this a bit more testing, and it seems to be working fine - 
the machine is now running a dom0 kernel built while running the patch. 
As for performance, it''s ''not bad'' -- I''ve
just done a bonnie++ run,
and some compiles. Based on sticking printks in and watching the 
console, it''s allowing merges much more often than not, but still I 
suspect not as much as it could. Probably it should use something with 
more arch-knowledge than page_to_phys().

patch for linux-2.4.26-xen0: 
munky.nodnol.org/~chris/xen_scsi_merge.diff
bonnie++ stats: munky.nodnol.org/~chris/munkyII_stats.txt

Chris.

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-25 11:24 UTC

head link

RE: [Xen-devel] segfault in VM - FIXED!

so far so good. It''s been running for a while now with no errors. much
longer than it would have survived previously.

James

From: James Harper
Sent: Sun 25/07/2004 7:27 PM
To: Chris Andrews; xen-devel@lists.sourceforge.net
Subject: RE: [Xen-devel] segfault in VM - FIXED!

I''m building this now.

The way I see it, currently it must be incorrect for only a very very small
number of cases or the system would crash and burn almost instantly. So in
theory, unless these cases are undetectable, or the cost of detecting them is
high for some reason, the performance difference should be almost unnoticable

I assume the patch would only affect dom0 and so should matter if domU is
patched or not. Is there a way of installing a patch so that it''s
picked up by ''make world''?

i''ll follow up with results shortly.

James

From: Chris Andrews
Sent: Sun 25/07/2004 1:54 AM
To: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM - FIXED!

On 24 Jul 2004, at 13:47, I wrote:
> I''m just testing a patch which disables merging in the scsi layer
when
> it believes it has contiguous requests in different pages. I think 
> this is more pessimistic that it needs to be, as the pages may after 
> all be contiguous, but it does allow some merging  to happen and so 
> far seems to be stable.
I''ve given this a bit more testing, and it seems to be working fine - 
the machine is now running a dom0 kernel built while running the patch. 
As for performance, it''s ''not bad'' -- I''ve
just done a bonnie++ run,
and some compiles. Based on sticking printks in and watching the 
console, it''s allowing merges much more often than not, but still I 
suspect not as much as it could. Probably it should use something with 
more arch-knowledge than page_to_phys().

patch for linux-2.4.26-xen0: 
munky.nodnol.org/~chris/xen_scsi_merge.diff
bonnie++ stats: munky.nodnol.org/~chris/munkyII_stats.txt

Chris.

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Chris Andrews

2004-Jul-25 15:08 UTC

head link

Re: [Xen-devel] segfault in VM - FIXED!

On 25 Jul 2004, at 12:24, James Harper wrote:
> so far so good. It''s been running for a while now with no errors.
much
> longer than it would have survived previously.
It''s broken for me - I suspect it''s that although it checks
that
requests to be merged begin in the same page, it doesn''t also check 
they end in that same page. I''m testing a version now that tries to do 
that.

Chris.



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

James Harper

2004-Jul-25 23:23 UTC

head link

RE: [Xen-devel] segfault in VM - FIXED!

I was running my diff script all night which itself reported no errors, but this
morning I have the following in dom0''s kern.log:

Jul 25 21:53:58 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
Jul 25 23:02:49 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
Jul 25 23:31:25 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
Jul 26 01:07:55 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
Jul 26 01:38:59 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
Jul 26 02:35:21 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
Jul 26 02:47:33 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
Jul 26 04:55:37 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
Jul 26 06:32:56 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
Jul 26 06:59:22 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
Jul 26 08:00:19 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
Jul 26 08:24:50 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2

and in dom2:

Jul 25 21:53:58 mail2 kernel: bad buffer on RX ring!(-1)
Jul 25 23:02:49 mail2 kernel: bad buffer on RX ring!(-1)
Jul 25 23:31:25 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 01:07:55 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 01:38:59 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 02:35:21 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 02:47:33 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 04:55:37 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 06:32:56 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 06:59:22 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 08:00:19 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 08:24:50 mail2 kernel: bad buffer on RX ring!(-1)

so something funny is going on. i started my diff and ping scripts at about
21:20. At least the above error is detected though.

James



From: Chris Andrews
Sent: Mon 26/07/2004 1:08 AM
To: James Harper
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM - FIXED!


On 25 Jul 2004, at 12:24, James Harper wrote:
> so far so good. It''s been running for a while now with no errors.
much
> longer than it would have survived previously.
It''s broken for me - I suspect it''s that although it checks
that
requests to be merged begin in the same page, it doesn''t also check 
they end in that same page. I''m testing a version now that tries to do 
that.

Chris.

Keir Fraser

2004-Jul-26 12:07 UTC

head link

Re: [Xen-devel] segfault in VM - FIXED!

> 
> On 23 Jul 2004, at 17:01, Keir Fraser wrote:
> >
> > What we need now is some more checking, particularly with SCSI block
> > devices, to see whether there are any more bugs to shake out.
> 
> I''ve given this change a go on my PE1650 (aacraid driver). 
> Unfortunately this seems to be one of the SCSI drivers that
doesn''t
> correctly handle the error condition.
> 
> Running my usual test (''compare'' in dom0, compiles in
other domains), I
> don''t see any differences in the compares, but after a few
minutes, I
> get the following on the console, and everything is stuck waiting for 
> disk.
> 
> aacraid: cmd len 00000000 cmd underflow 00010000
> aacraid: Host adapter reset request. SCSI hang ?
> 
> The latter message repeats every few seconds. I rebooted the box with 
> the Xen console after a few lines. I''m going to try the aacraid
driver
> from 2.4.27-rc3, which I believe has had some attention recently.
Looks like the SCSI-merge code will have to be modified. I''ll do
IDE-merge code at the same time, so that we just merge less
aggressively rather than falling back to PIO transfers.

I''ll leave the check in pci_map_sg(), but it shouldn''t ever
trigger
after I patch the IDE and SCSI merge routines, so I''ll add a warning
message if an invalid scatter-gather list is detected.

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Jul-26 12:12 UTC

head link

Re: [Xen-devel] segfault in VM - FIXED!

Looks like this is a very occasional failure, from the timestamps
between messages. If you make a debug buil dof Xen then we''ll get some
info as to why the page transfer is failing.

 -- Keir
> I was running my diff script all night which itself reported no errors, but
this morning I have the following in dom0''s kern.log:
> 
> Jul 25 21:53:58 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> Jul 25 23:02:49 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> Jul 25 23:31:25 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> Jul 26 01:07:55 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> Jul 26 01:38:59 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> Jul 26 02:35:21 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> Jul 26 02:47:33 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> Jul 26 04:55:37 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> Jul 26 06:32:56 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> Jul 26 06:59:22 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> Jul 26 08:00:19 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> Jul 26 08:24:50 xen1 kernel: (file=main.c, line=270) Failed MMU update
transferring to DOM2
> 
> and in dom2:
> 
> Jul 25 21:53:58 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 25 23:02:49 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 25 23:31:25 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 01:07:55 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 01:38:59 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 02:35:21 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 02:47:33 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 04:55:37 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 06:32:56 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 06:59:22 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 08:00:19 mail2 kernel: bad buffer on RX ring!(-1)
> Jul 26 08:24:50 mail2 kernel: bad buffer on RX ring!(-1)
> 
> so something funny is going on. i started my diff and ping scripts at about
21:20. At least the above error is detected though.
> 
> James
> 
> 
> 
> From: Chris Andrews
> Sent: Mon 26/07/2004 1:08 AM
> To: James Harper
> Cc: xen-devel@lists.sourceforge.net
> Subject: Re: [Xen-devel] segfault in VM - FIXED!
> 
> 
> On 25 Jul 2004, at 12:24, James Harper wrote:
> 
> > so far so good. It''s been running for a while now with no
errors. much
> > longer than it would have survived previously.
> 
> It''s broken for me - I suspect it''s that although it
checks that
> requests to be merged begin in the same page, it doesn''t also
check
> they end in that same page. I''m testing a version now that tries
to do
> that.
> 
> Chris. -=- MIME -=- 
--_7B4740D2-5940-4EA9-8376-C62BADEDF385_
Content-Type: text/plain;
	charset="iso-8859-1";
	format=flowed
Content-Transfer-Encoding: quoted-printable

I was running my diff script all night which itself reported no errors, but this
morning I have the following in dom0''s kern.log:

Jul 25 21:53:58 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2
Jul 25 23:02:49 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2
Jul 25 23:31:25 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2
Jul 26 01:07:55 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2
Jul 26 01:38:59 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2
Jul 26 02:35:21 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2
Jul 26 02:47:33 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2
Jul 26 04:55:37 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2
Jul 26 06:32:56 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2
Jul 26 06:59:22 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2
Jul 26 08:00:19 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2
Jul 26 08:24:50 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2

and in dom2:

Jul 25 21:53:58 mail2 kernel: bad buffer on RX ring!(-1)
Jul 25 23:02:49 mail2 kernel: bad buffer on RX ring!(-1)
Jul 25 23:31:25 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 01:07:55 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 01:38:59 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 02:35:21 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 02:47:33 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 04:55:37 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 06:32:56 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 06:59:22 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 08:00:19 mail2 kernel: bad buffer on RX ring!(-1)
Jul 26 08:24:50 mail2 kernel: bad buffer on RX ring!(-1)

so something funny is going on. i started my diff and ping scripts at about
21:20. At least the above error is detected though.

James

From: Chris Andrews
Sent: Mon 26/07/2004 1:08 AM
To: James Harper
Cc: xen-devel@lists.sourceforge.net
Subject: Re: [Xen-devel] segfault in VM - FIXED!

On 25 Jul 2004, at 12:24, James Harper wrote:
> so far so good. It''s been running for a while now with no errors.
much=20
> longer than it would have survived previously.
It''s broken for me - I suspect it''s that although it checks
that=20
requests to be merged begin in the same page, it doesn''t also check=20
they end in that same page. I''m testing a version now that tries to
do=20
that.

Chris.

--_7B4740D2-5940-4EA9-8376-C62BADEDF385_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText44056 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>I was
running my diff script all night which itself reported no errors, but this
morning I have the following in dom0''s
kern.log:</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Jul 25 21:53:58 xen1
kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to
DOM2<BR>Jul 25 23:02:49 xen1 kernel: (file=3Dmain.c, line=3D270) Failed
MMU update transferring to DOM2<BR>Jul 25 23:31:25 xen1 kernel:
(file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul
26 01:07:55 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2<BR>Jul 26 01:38:59 xen1 kernel: (file=3Dmain.c,
line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 02:35:21 xen1
kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to
DOM2<BR>Jul 26 02:47:33 xen1 kernel: (file=3Dmain.c, line=3D270) Failed
MMU update transferring to DOM2<BR>Jul 26 04:55:37 xen1 kernel:
(file=3Dmain.c, line=3D270) Failed MMU update transferring to DOM2<BR>Jul
26 06:32:56 xen1 kernel: (file=3Dmain.c, line=3D270) Failed MMU update
transferring to DOM2<BR>Jul 26 06:59:22 xen1 kernel: (file=3Dmain.c,
line=3D270) Failed MMU update transferring to DOM2<BR>Jul 26 08:00:19 xen1
kernel: (file=3Dmain.c, line=3D270) Failed MMU update transferring to
DOM2<BR>Jul 26 08:24:50 xen1 kernel: (file=3Dmain.c, line=3D270) Failed
MMU update transferring to DOM2<BR></FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>and in
dom2:</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Jul 25 21:53:58 mail2
kernel: bad buffer on RX ring!(-1)<BR>Jul 25 23:02:49 mail2 kernel: bad
buffer on RX ring!(-1)<BR>Jul 25 23:31:25 mail2 kernel: bad buffer on RX
ring!(-1)<BR>Jul 26 01:07:55 mail2 kernel: bad buffer on RX
ring!(-1)<BR>Jul 26 01:38:59 mail2 kernel: bad buffer on RX
ring!(-1)<BR>Jul 26 02:35:21 mail2 kernel: bad buffer on RX
ring!(-1)<BR>Jul 26 02:47:33 mail2 kernel: bad buffer on RX
ring!(-1)<BR>Jul 26 04:55:37 mail2 kernel: bad buffer on RX
ring!(-1)<BR>Jul 26 06:32:56 mail2 kernel: bad buffer on RX
ring!(-1)<BR>Jul 26 06:59:22 mail2 kernel: bad buffer on RX
ring!(-1)<BR>Jul 26 08:00:19 mail2 kernel: bad buffer on RX
ring!(-1)<BR>Jul 26 08:24:50 mail2 kernel: bad buffer on RX
ring!(-1)<BR></FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>so something funny is
going on. i started my diff and ping scripts at about 21:20. At least the above
error is detected though.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2>James</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial
size=3D2>&nbsp;</DIV></FONT>
<DIV dir=3Dltr>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Chris
Andrews<BR><B>Sent:</B> Mon 26/07/2004 1:08
AM<BR><B>To:</B> James Harper<BR><B>Cc:</B>
xen-devel@lists.sourceforge.net<BR><B>Subject:</B> Re:
[Xen-devel] segfault in VM -
FIXED!<BR></FONT><BR></DIV></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">On 25 Jul 2004,
at 12:24, James Harper wrote:

&gt; so far so good. It''s been running for a while now with no
errors. much=20
&gt; longer than it would have survived previously.

It''s broken for me - I suspect it''s that although it checks
that=20
requests to be merged begin in the same page, it doesn''t also check=20
they end in that same page. I''m testing a version now that tries to
do=20
that.

Chris.

</PRE></DIV></BODY></HTML>

--_7B4740D2-5940-4EA9-8376-C62BADEDF385_--

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
lists.sourceforge.net/lists/listinfo/xen-devel

Xen devel - Jul 2004 - segfault in VM

[Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM

RE: [Xen-devel] segfault in VM

Re: [Xen-devel] segfault in VM - FIXED!

Re: [Xen-devel] segfault in VM - FIXED!

Re: [Xen-devel] segfault in VM - FIXED!

Re: [Xen-devel] segfault in VM - FIXED!

RE: [Xen-devel] segfault in VM - FIXED!

Re: [Xen-devel] segfault in VM - FIXED!

RE: [Xen-devel] segfault in VM - FIXED!

RE: [Xen-devel] segfault in VM - FIXED!

Re: [Xen-devel] segfault in VM - FIXED!

RE: [Xen-devel] segfault in VM - FIXED!

Re: [Xen-devel] segfault in VM - FIXED!

Re: [Xen-devel] segfault in VM - FIXED!