thr3ads.net - Xen devel - [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0 [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Joanna Rutkowska

2010-Mar-06 10:12 UTC

[Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

There is a nasty data corruption problem most likely allowed by a bug in
the Xen 4.0.0-x hypervisors.

The problem occurs with a frequency of "a few chunks per 10 GB of data
copied", and only when running a VM (PV domU) with a specific kernel.
The problem, however, affects not only the VM but also the Dom0, which
is of significant importance.

How to reproduce:

1) Start at least one Xen PV VM with a pvops0 kernel. One kernel known
to demonstrate the problem is the one built by Michael Young, based on
xen/master git from Dec 23. It has recently been replaced by a newer
kernel, which doesn''t always show the problem, but I uploaded the
previous one at the URL below, so people can use it for testing:

http://invisiblethingslab.com/pub/kernel-2.6.31.9-1.2.82.xendom0.fc12.x86_64.rpm

Now you can start a dummy VM with this kernel, e.g.:

# xm create -c /dev/null memory=400 kernel=<path/to/kernel>
extra="rootdelay=1000"

2) Now, in Dom0, after having started this dummy VM, create a big test
file, filled all with zeros. Make sure to choose a size bigger than your
DRAM size, to avoid fs caching effect, e.g.:

$ dd if=/dev/zero of=test bs=1M count=10000

That should create a 10GB file. Make sure to use /dev/zero and not
/dev/null!

3) Once the test file got created, check if it really consists of zeros
only:

$ xxd test.bin | grep -v "0000 0000 0000 0000 0000 0000 0000 0000"

Normally you should not get any output. However, I consistently get
something like this:

4593a000:940d 0000 0000 0000 2d40 d6fc c803 0000  ........-@......
4593a010:00f6 1f52 b301 0000 b620 dcd5 ff00 0000  ...R..... ......
a5df0000:e542 712c 77da c9f9 a429 4b85 ecc4 9395  .Bq,w....)K.....
a5df0010:d9d6 971f 0d58 5c70 aba6 387d 805f 09e2  .....X\p..8}._..
ceecb000:f80d 0000 0000 0000 096e 1cdc e403 0000  .........n......
ceecb010:2460 7ef6 be01 0000 b620 dcd5 ff00 0000  $`~...... ......
148432000580e 0000 0000 0000 5665 ed9d ff03 0000  X.......Ve......
1484320107bcc a023 ca01 0000 b620 dcd5 ff00 0000  {..#..... ......
1c548b000bc0e 0000 0000 0000 6942 387d 1b04 0000  ........iB8}....
1c548b010872b 01c8 d501 0000 b620 dcd5 ff00 0000  .+....... ......
225d450004448 27cd b966 b37e 1f0c e9e3 c2db b6ee  DH''..f.~........
225d45010d2b2 55b8 9ef1 e818 a7e3 364d 2322 dc75  ..U.......6M#".u
242056000140f 0000 0000 0000 0bb0 3704 3404 0000  ..........7.4...
2420560109601 b606 e001 0000 b620 dcd5 ff00 0000  ......... ......

The actual data vary between tests, however, the "dcd5 ff00 0000"
pattern seems to be repeatable on a given system with a given hypervisor
binary (the above numbers are for Xen-4.0.0-rc5 built from Michael
Young''s SRPM). The errors always occur in chunks of 32-bytes.

We have tested this in our lab on three different machines, with various
Dom0 kernels -- based on xen/master (AKA xen/stable-2.6.31) and
xen/stable (AKA xen/stable-2.6.32) -- and with a few Xen 4 hypervisors
(rc2, rc4, rc5). Not every kernel allows for reproducing the error with
such a simple "dummy" VM as the one given above -- e.g. the
2.6.32-based
kernels required some more regular VMs to be started for the problem to
be noticeable. However, with the previously mentioned kernel (M. Young
Dec23), the problem has been 100% reproducible us.

When downgraded to Xen 3.4.2 the problem went away.

Of course this problem cannot be attributed to a buggy VM kernel, as the
hypervisor should be resistant to any kind of "wrong" software (buggy
or
malicious) that executes in a VM.

It''s really interesting how much control does the VM have over the data
(and location) that are corrupted in Dom0 -- if it has any control, then
it might allow for an interesting VM escape attack perhaps :)

Unfortunately we don''t have time to investigate this problem any
further
in our lab.

Regards,
joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Mar-06 11:53 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/06/2010 07:02 AM, Keir Fraser wrote:> On 06/03/2010 10:12, "Joanna Rutkowska"
<joanna@invisiblethingslab.com>
> wrote:
> 
>> It''s really interesting how much control does the VM have over
the data
>> (and location) that are corrupted in Dom0 -- if it has any control,
then
>> it might allow for an interesting VM escape attack perhaps :)
>>
>> Unfortunately we don''t have time to investigate this problem
any further
>> in our lab.
> 
> Thanks, I''ll see if I can repro with your simple setup.
It''s an interesting
> one since presumably the domU is not doing much other waiting on its
> rootdelay timeout when the corruption manifests. Sounds like the dom0
kernel
> version doesn''t matter at all?
> Yes, I tried at least a few different Dom0 kernels (based on 2.6.31 and
2.6.32 git).

One correction to the report: I think I actually haven''t tried
2.6.32-based kernel in the VM -- only in Dom0, and a Rafal tried 2.6.32
in a VM and it didn''t show the corruption in that case. So, it
something
specific to xen/master kernel branch (and 4.0 hypervisors).

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Mar-06 12:02 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 06/03/2010 10:12, "Joanna Rutkowska"
<joanna@invisiblethingslab.com>
wrote:
> It''s really interesting how much control does the VM have over the
data
> (and location) that are corrupted in Dom0 -- if it has any control, then
> it might allow for an interesting VM escape attack perhaps :)
> 
> Unfortunately we don''t have time to investigate this problem any
further
> in our lab.
Thanks, I''ll see if I can repro with your simple setup. It''s
an interesting
one since presumably the domU is not doing much other waiting on its
rootdelay timeout when the corruption manifests. Sounds like the dom0 kernel
version doesn''t matter at all?

 Regards,
 Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Mar-06 13:36 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 06/03/2010 12:02, "Keir Fraser" <keir.fraser@eu.citrix.com>
wrote:
> On 06/03/2010 10:12, "Joanna Rutkowska"
<joanna@invisiblethingslab.com>
> wrote:
> 
>> It''s really interesting how much control does the VM have over
the data
>> (and location) that are corrupted in Dom0 -- if it has any control,
then
>> it might allow for an interesting VM escape attack perhaps :)
>> 
>> Unfortunately we don''t have time to investigate this problem
any further
>> in our lab.
> 
> Thanks, I''ll see if I can repro with your simple setup.
It''s an interesting
> one since presumably the domU is not doing much other waiting on its
> rootdelay timeout when the corruption manifests. Sounds like the dom0
kernel
> version doesn''t matter at all?
Tried a few times and no luck reproducing so far. I hope some other people
on the list also will give it a go, since it''s so easy to try it out.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Mar-06 13:37 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/06/2010 08:56 AM, Keir Fraser wrote:> On 06/03/2010 13:25, "Joanna Rutkowska"
<joanna@invisiblethingslab.com>
> wrote:
> 
>>> Tried a few times and no luck reproducing so far. I hope some other
people
>>> on the list also will give it a go, since it''s so easy to
try it out.
>>>
>> Which versions of the hypervisor and Dom0 have you tried? Perhaps some
>> custom builds? This problem should be reproducible for sure with the
>> Dom0 based on the kernel I already mentioned:
>>
>>
http://invisiblethingslab.com/pub/kernel-2.6.31.9-1.2.82.xendom0.fc12.x86_64.r
>> pm
> 
> I''ll see if I can find time to upgrade my dom0 kernel next week. I
currently
> run my own non-modular 2.6.18 dom0.
We never tested this on a non-pvops kernel in Dom0, so perhaps this is
why you got no symptoms...

j.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Mar-06 17:18 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 06/03/2010 13:37, "Joanna Rutkowska"
<joanna@invisiblethingslab.com>
wrote:
>>>
http://invisiblethingslab.com/pub/kernel-2.6.31.9-1.2.82.xendom0.fc12.x86_64
>>> .r
>>> pm
>> 
>> I''ll see if I can find time to upgrade my dom0 kernel next
week. I currently
>> run my own non-modular 2.6.18 dom0.
> 
> We never tested this on a non-pvops kernel in Dom0, so perhaps this is
> why you got no symptoms...
I''ll try with tip of xen-unstable (basically Xen 4.0.0 latest RC), and
tip
of our pv_ops development repo
(git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git, branch
xen/master) for dom0. Those are what will form the upcoming release, so the
primary scope within which we care about reproducibility.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Pasi Kärkkäinen

2010-Mar-07 14:36 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On Sat, Mar 06, 2010 at 01:36:15PM +0000, Keir Fraser
wrote:> On 06/03/2010 12:02, "Keir Fraser"
<keir.fraser@eu.citrix.com> wrote:
> 
> > On 06/03/2010 10:12, "Joanna Rutkowska"
<joanna@invisiblethingslab.com>
> > wrote:
> > 
> >> It''s really interesting how much control does the VM have
over the data
> >> (and location) that are corrupted in Dom0 -- if it has any
control, then
> >> it might allow for an interesting VM escape attack perhaps :)
> >> 
> >> Unfortunately we don''t have time to investigate this
problem any further
> >> in our lab.
> > 
> > Thanks, I''ll see if I can repro with your simple setup.
It''s an interesting
> > one since presumably the domU is not doing much other waiting on its
> > rootdelay timeout when the corruption manifests. Sounds like the dom0
kernel
> > version doesn''t matter at all?
> 
> Tried a few times and no luck reproducing so far. I hope some other people
> on the list also will give it a go, since it''s so easy to try it
out.
> 
I''m able to reproduce this with xen/master 2.6.31.6 dom0 kernel (from
2010-02-20),
but I''m not able to reproduce it with the current xen/stable 2.6.32.9.

I''ll try with the most recent 2.6.31.6 dom0 kernel aswell..

-- Pasi


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Mar-07 14:39 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 07/03/2010 14:36, "Pasi Kärkkäinen" <pasik@iki.fi> wrote:
>> Tried a few times and no luck reproducing so far. I hope some other
people
>> on the list also will give it a go, since it''s so easy to try
it out.
>> 
> 
> I''m able to reproduce this with xen/master 2.6.31.6 dom0 kernel
(from
> 2010-02-20),
> but I''m not able to reproduce it with the current xen/stable
2.6.32.9.
> 
> I''ll try with the most recent 2.6.31.6 dom0 kernel aswell..
Thanks Pasi!

 K.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Pasi Kärkkäinen

2010-Mar-07 16:12 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On Sun, Mar 07, 2010 at 02:39:09PM +0000, Keir Fraser
wrote:> On 07/03/2010 14:36, "Pasi Kärkkäinen" <pasik@iki.fi>
wrote:
> 
> >> Tried a few times and no luck reproducing so far. I hope some
other people
> >> on the list also will give it a go, since it''s so easy to
try it out.
> >> 
> > 
> > I''m able to reproduce this with xen/master 2.6.31.6 dom0
kernel (from
> > 2010-02-20),
> > but I''m not able to reproduce it with the current xen/stable
2.6.32.9.
> > 
> > I''ll try with the most recent 2.6.31.6 dom0 kernel aswell..
> 
> Thanks Pasi!
> 
It seems to happen with the latest xen/master 2.6.31.6 aswell!

-- Pasi


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Mar-08 22:24 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/06/2010 02:12 AM, Joanna Rutkowska wrote:> There is a nasty data corruption problem most likely allowed by a bug in
> the Xen 4.0.0-x hypervisors.
>
> The problem occurs with a frequency of "a few chunks per 10 GB of data
> copied", and only when running a VM (PV domU) with a specific kernel.
> The problem, however, affects not only the VM but also the Dom0, which
> is of significant importance.
>
> How to reproduce:
>
> 1) Start at least one Xen PV VM with a pvops0 kernel. One kernel known
> to demonstrate the problem is the one built by Michael Young, based on
> xen/master git from Dec 23. It has recently been replaced by a newer
> kernel, which doesn''t always show the problem, but I uploaded the
> previous one at the URL below, so people can use it for testing:
>
>
http://invisiblethingslab.com/pub/kernel-2.6.31.9-1.2.82.xendom0.fc12.x86_64.rpm
>
> Now you can start a dummy VM with this kernel, e.g.:
>
> # xm create -c /dev/null memory=400 kernel=<path/to/kernel>
> extra="rootdelay=1000"
>
> 2) Now, in Dom0, after having started this dummy VM, create a big test
> file, filled all with zeros. Make sure to choose a size bigger than your
> DRAM size, to avoid fs caching effect, e.g.:
>
> $ dd if=/dev/zero of=test bs=1M count=10000
>
> That should create a 10GB file. Make sure to use /dev/zero and not
> /dev/null!
>
> 3) Once the test file got created, check if it really consists of zeros
> only:
>
> $ xxd test.bin | grep -v "0000 0000 0000 0000 0000 0000 0000
0000"
>
> Normally you should not get any output. However, I consistently get
> something like this:
>
> 4593a000:940d 0000 0000 0000 2d40 d6fc c803 0000  ........-@......
> 4593a010:00f6 1f52 b301 0000 b620 dcd5 ff00 0000  ...R..... ......
> a5df0000:e542 712c 77da c9f9 a429 4b85 ecc4 9395  .Bq,w....)K.....
> a5df0010:d9d6 971f 0d58 5c70 aba6 387d 805f 09e2  .....X\p..8}._..
> ceecb000:f80d 0000 0000 0000 096e 1cdc e403 0000  .........n......
> ceecb010:2460 7ef6 be01 0000 b620 dcd5 ff00 0000  $`~...... ......
> 148432000580e 0000 0000 0000 5665 ed9d ff03 0000  X.......Ve......
> 1484320107bcc a023 ca01 0000 b620 dcd5 ff00 0000  {..#..... ......
> 1c548b000bc0e 0000 0000 0000 6942 387d 1b04 0000  ........iB8}....
> 1c548b010872b 01c8 d501 0000 b620 dcd5 ff00 0000  .+....... ......
> 225d450004448 27cd b966 b37e 1f0c e9e3 c2db b6ee  DH''..f.~........
> 225d45010d2b2 55b8 9ef1 e818 a7e3 364d 2322 dc75  ..U.......6M#".u
> 242056000140f 0000 0000 0000 0bb0 3704 3404 0000  ..........7.4...
> 2420560109601 b606 e001 0000 b620 dcd5 ff00 0000  ......... ......
>
> The actual data vary between tests, however, the "dcd5 ff00 0000"
> pattern seems to be repeatable on a given system with a given hypervisor
> binary (the above numbers are for Xen-4.0.0-rc5 built from Michael
> Young''s SRPM). The errors always occur in chunks of 32-bytes.
>
> We have tested this in our lab on three different machines, with various
> Dom0 kernels -- based on xen/master (AKA xen/stable-2.6.31) and
> xen/stable (AKA xen/stable-2.6.32) -- and with a few Xen 4 hypervisors
> (rc2, rc4, rc5). Not every kernel allows for reproducing the error with
> such a simple "dummy" VM as the one given above -- e.g. the
2.6.32-based
> kernels required some more regular VMs to be started for the problem to
> be noticeable. However, with the previously mentioned kernel (M. Young
> Dec23), the problem has been 100% reproducible us.
>
> When downgraded to Xen 3.4.2 the problem went away.
>
> Of course this problem cannot be attributed to a buggy VM kernel, as the
> hypervisor should be resistant to any kind of "wrong" software
(buggy or
> malicious) that executes in a VM.
>    
Why "of course"?  You report looks to me like a bug in dom0 which is 
causing data corruption when there''s another domain running.  I
don''t
see anything that specifically implicates Xen.  The fact that the 
symptoms change with a different Xen version could mean kernel bug is 
effected by the Xen version (different memory layout, for example, or 
different paths in the kernel caused by different feature availability).

     J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Mar-08 22:34 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/08/2010 11:24 PM, Jeremy Fitzhardinge wrote:> On 03/06/2010 02:12 AM, Joanna Rutkowska wrote:
>> There is a nasty data corruption problem most likely allowed by a bug
in
>> the Xen 4.0.0-x hypervisors.
>>
>> The problem occurs with a frequency of "a few chunks per 10 GB of
data
>> copied", and only when running a VM (PV domU) with a specific
kernel.
>> The problem, however, affects not only the VM but also the Dom0, which
>> is of significant importance.
>>
>> How to reproduce:
>>
>> 1) Start at least one Xen PV VM with a pvops0 kernel. One kernel known
>> to demonstrate the problem is the one built by Michael Young, based on
>> xen/master git from Dec 23. It has recently been replaced by a newer
>> kernel, which doesn''t always show the problem, but I uploaded
the
>> previous one at the URL below, so people can use it for testing:
>>
>>
http://invisiblethingslab.com/pub/kernel-2.6.31.9-1.2.82.xendom0.fc12.x86_64.rpm
>>
>>
>> Now you can start a dummy VM with this kernel, e.g.:
>>
>> # xm create -c /dev/null memory=400 kernel=<path/to/kernel>
>> extra="rootdelay=1000"
>>
>> 2) Now, in Dom0, after having started this dummy VM, create a big test
>> file, filled all with zeros. Make sure to choose a size bigger than
your
>> DRAM size, to avoid fs caching effect, e.g.:
>>
>> $ dd if=/dev/zero of=test bs=1M count=10000
>>
>> That should create a 10GB file. Make sure to use /dev/zero and not
>> /dev/null!
>>
>> 3) Once the test file got created, check if it really consists of zeros
>> only:
>>
>> $ xxd test.bin | grep -v "0000 0000 0000 0000 0000 0000 0000
0000"
>>
>> Normally you should not get any output. However, I consistently get
>> something like this:
>>
>> 4593a000:940d 0000 0000 0000 2d40 d6fc c803 0000  ........-@......
>> 4593a010:00f6 1f52 b301 0000 b620 dcd5 ff00 0000  ...R..... ......
>> a5df0000:e542 712c 77da c9f9 a429 4b85 ecc4 9395  .Bq,w....)K.....
>> a5df0010:d9d6 971f 0d58 5c70 aba6 387d 805f 09e2  .....X\p..8}._..
>> ceecb000:f80d 0000 0000 0000 096e 1cdc e403 0000  .........n......
>> ceecb010:2460 7ef6 be01 0000 b620 dcd5 ff00 0000  $`~...... ......
>> 148432000580e 0000 0000 0000 5665 ed9d ff03 0000  X.......Ve......
>> 1484320107bcc a023 ca01 0000 b620 dcd5 ff00 0000  {..#..... ......
>> 1c548b000bc0e 0000 0000 0000 6942 387d 1b04 0000  ........iB8}....
>> 1c548b010872b 01c8 d501 0000 b620 dcd5 ff00 0000  .+....... ......
>> 225d450004448 27cd b966 b37e 1f0c e9e3 c2db b6ee 
DH''..f.~........
>> 225d45010d2b2 55b8 9ef1 e818 a7e3 364d 2322 dc75  ..U.......6M#".u
>> 242056000140f 0000 0000 0000 0bb0 3704 3404 0000  ..........7.4...
>> 2420560109601 b606 e001 0000 b620 dcd5 ff00 0000  ......... ......
>>
>> The actual data vary between tests, however, the "dcd5 ff00
0000"
>> pattern seems to be repeatable on a given system with a given
hypervisor
>> binary (the above numbers are for Xen-4.0.0-rc5 built from Michael
>> Young''s SRPM). The errors always occur in chunks of 32-bytes.
>>
>> We have tested this in our lab on three different machines, with
various
>> Dom0 kernels -- based on xen/master (AKA xen/stable-2.6.31) and
>> xen/stable (AKA xen/stable-2.6.32) -- and with a few Xen 4 hypervisors
>> (rc2, rc4, rc5). Not every kernel allows for reproducing the error with
>> such a simple "dummy" VM as the one given above -- e.g. the
2.6.32-based
>> kernels required some more regular VMs to be started for the problem to
>> be noticeable. However, with the previously mentioned kernel (M. Young
>> Dec23), the problem has been 100% reproducible us.
>>
>> When downgraded to Xen 3.4.2 the problem went away.
>>
>> Of course this problem cannot be attributed to a buggy VM kernel, as
the
>> hypervisor should be resistant to any kind of "wrong"
software (buggy or
>> malicious) that executes in a VM.
>>    
> 
> Why "of course"?  You report looks to me like a bug in dom0 which
is
> causing data corruption when there''s another domain running.
Please note that the "of course" sentence refers to *VM* kernel not
Dom0.
> I don''t see anything that specifically implicates Xen.  The fact
that
> the symptoms change with a different Xen version could mean kernel
> bug is effected by the Xen version (different memory layout, for
> example, or different paths in the kernel caused by different feature
> availability).
> 
Sure, it can theoretically be anything, perhaps even a generic bug in
IA32 just accidentally triggered by some magic value in a register ;) As
I said in the first sentence it seems (to me) "most likely" to be a
bug
in the hypervisor, but there is only one way to find out where it is for
sure... (to nail it down (and I''m very sorry that I cannot help with
the
quest right now))

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Mar-08 23:12 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/08/2010 02:34 PM, Joanna Rutkowska wrote:> On 03/08/2010 11:24 PM, Jeremy Fitzhardinge wrote:
>    
>> On 03/06/2010 02:12 AM, Joanna Rutkowska wrote:
>>      
>>> There is a nasty data corruption problem most likely allowed by a
bug in
>>> the Xen 4.0.0-x hypervisors.
>>>
>>> The problem occurs with a frequency of "a few chunks per 10 GB
of data
>>> copied", and only when running a VM (PV domU) with a specific
kernel.
>>> The problem, however, affects not only the VM but also the Dom0,
which
>>> is of significant importance.
>>>
>>> How to reproduce:
>>>
>>> 1) Start at least one Xen PV VM with a pvops0 kernel. One kernel
known
>>> to demonstrate the problem is the one built by Michael Young, based
on
>>> xen/master git from Dec 23. It has recently been replaced by a
newer
>>> kernel, which doesn''t always show the problem, but I
uploaded the
>>> previous one at the URL below, so people can use it for testing:
>>>
>>>
http://invisiblethingslab.com/pub/kernel-2.6.31.9-1.2.82.xendom0.fc12.x86_64.rpm
>>>
>>>
>>> Now you can start a dummy VM with this kernel, e.g.:
>>>
>>> # xm create -c /dev/null memory=400 kernel=<path/to/kernel>
>>> extra="rootdelay=1000"
>>>
>>> 2) Now, in Dom0, after having started this dummy VM, create a big
test
>>> file, filled all with zeros. Make sure to choose a size bigger than
your
>>> DRAM size, to avoid fs caching effect, e.g.:
>>>
>>> $ dd if=/dev/zero of=test bs=1M count=10000
>>>
>>> That should create a 10GB file. Make sure to use /dev/zero and not
>>> /dev/null!
>>>
>>> 3) Once the test file got created, check if it really consists of
zeros
>>> only:
>>>
>>> $ xxd test.bin | grep -v "0000 0000 0000 0000 0000 0000 0000
0000"
>>>
>>> Normally you should not get any output. However, I consistently get
>>> something like this:
>>>
>>> 4593a000:940d 0000 0000 0000 2d40 d6fc c803 0000  ........-@......
>>> 4593a010:00f6 1f52 b301 0000 b620 dcd5 ff00 0000  ...R..... ......
>>> a5df0000:e542 712c 77da c9f9 a429 4b85 ecc4 9395  .Bq,w....)K.....
>>> a5df0010:d9d6 971f 0d58 5c70 aba6 387d 805f 09e2  .....X\p..8}._..
>>> ceecb000:f80d 0000 0000 0000 096e 1cdc e403 0000  .........n......
>>> ceecb010:2460 7ef6 be01 0000 b620 dcd5 ff00 0000  $`~...... ......
>>> 148432000580e 0000 0000 0000 5665 ed9d ff03 0000  X.......Ve......
>>> 1484320107bcc a023 ca01 0000 b620 dcd5 ff00 0000  {..#..... ......
>>> 1c548b000bc0e 0000 0000 0000 6942 387d 1b04 0000  ........iB8}....
>>> 1c548b010872b 01c8 d501 0000 b620 dcd5 ff00 0000  .+....... ......
>>> 225d450004448 27cd b966 b37e 1f0c e9e3 c2db b6ee 
DH''..f.~........
>>> 225d45010d2b2 55b8 9ef1 e818 a7e3 364d 2322 dc75 
..U.......6M#".u
>>> 242056000140f 0000 0000 0000 0bb0 3704 3404 0000  ..........7.4...
>>> 2420560109601 b606 e001 0000 b620 dcd5 ff00 0000  ......... ......
>>>
>>> The actual data vary between tests, however, the "dcd5 ff00
0000"
>>> pattern seems to be repeatable on a given system with a given
hypervisor
>>> binary (the above numbers are for Xen-4.0.0-rc5 built from Michael
>>> Young''s SRPM). The errors always occur in chunks of
32-bytes.
>>>
>>> We have tested this in our lab on three different machines, with
various
>>> Dom0 kernels -- based on xen/master (AKA xen/stable-2.6.31) and
>>> xen/stable (AKA xen/stable-2.6.32) -- and with a few Xen 4
hypervisors
>>> (rc2, rc4, rc5). Not every kernel allows for reproducing the error
with
>>> such a simple "dummy" VM as the one given above -- e.g.
the 2.6.32-based
>>> kernels required some more regular VMs to be started for the
problem to
>>> be noticeable. However, with the previously mentioned kernel (M.
Young
>>> Dec23), the problem has been 100% reproducible us.
>>>
>>> When downgraded to Xen 3.4.2 the problem went away.
>>>
>>> Of course this problem cannot be attributed to a buggy VM kernel,
as the
>>> hypervisor should be resistant to any kind of "wrong"
software (buggy or
>>> malicious) that executes in a VM.
>>>
>>>        
>> Why "of course"?  You report looks to me like a bug in dom0
which is
>> causing data corruption when there''s another domain running.
>>      
> Please note that the "of course" sentence refers to *VM* kernel
not Dom0.
>    
OK, but your terminology is imprecise, since dom0 is a "VM" as well.  
Yes, the domU kernel must be blameless.
>> I don''t see anything that specifically implicates Xen.  The
fact that
>> the symptoms change with a different Xen version could mean kernel
>> bug is effected by the Xen version (different memory layout, for
>> example, or different paths in the kernel caused by different feature
>> availability).
>>
>>      
> Sure, it can theoretically be anything, perhaps even a generic bug in
> IA32 just accidentally triggered by some magic value in a register ;) As
> I said in the first sentence it seems (to me) "most likely" to be
a bug
> in the hypervisor, but there is only one way to find out where it is for
> sure...
>    
I think its most likely to be a dom0 bug, specifically a bug in one of 
the backend drivers.  The common failure mode which causes symtoms like 
this is when a granted page (=a domU page mapped into dom0) is released 
back into dom0''s heap and reused as general memory while still being 
under the control of the domU.

However, given that the domU hasn''t got any devices assigned to it
aside
from the console, none of the backend should be coming into play.  It 
might be a more general problem with the privcmd interface.

Alternatively, I suppose, the domain builder could end up using some of 
dom0 pages to construct the domU without properly freeing them, which 
would suggest a bug in the balloon driver.

I can''t think of a Xen failure-mode which would cause these symptoms 
without also being massively obvious in other cases.  (But "I
can''t
think of..." is where all the best bugs hide.)

     J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Daniel Stodden

2010-Mar-08 23:22 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On Sun, 2010-03-07 at 11:12 -0500, Pasi Kärkkäinen
wrote:> On Sun, Mar 07, 2010 at 02:39:09PM +0000, Keir Fraser wrote:
> > On 07/03/2010 14:36, "Pasi Kärkkäinen" <pasik@iki.fi>
wrote:
> > 
> > >> Tried a few times and no luck reproducing so far. I hope some
other people
> > >> on the list also will give it a go, since it''s so
easy to try it out.
> > >> 
> > > 
> > > I''m able to reproduce this with xen/master 2.6.31.6 dom0
kernel (from
> > > 2010-02-20),
> > > but I''m not able to reproduce it with the current
xen/stable 2.6.32.9.
> > > 
> > > I''ll try with the most recent 2.6.31.6 dom0 kernel
aswell..
> > 
> > Thanks Pasi!
> > 
> 
> It seems to happen with the latest xen/master 2.6.31.6 aswell!
Does this look to you like we''re corrupting memory or on-disk storage?

E.g. does a
$ dd if=/dev/zero bs=1M | hexdump -C 
have the same issue?

I have some initial trouble with the idea that zero.read() in a PV domU
somehow unlearned to scrub a 1M user buffer.

Thanks,
Daniel





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Mar-08 23:23 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/09/2010 12:12 AM, Jeremy Fitzhardinge wrote:> I think its most likely to be a dom0 bug, specifically a bug in one of
> the backend drivers.  The common failure mode which causes symtoms like
> this is when a granted page (=a domU page mapped into dom0) is released
> back into dom0''s heap and reused as general memory while still
being
> under the control of the domU.
> 
> However, given that the domU hasn''t got any devices assigned to it
aside
> from the console, none of the backend should be coming into play.  It
> might be a more general problem with the privcmd interface.
> 
> Alternatively, I suppose, the domain builder could end up using some of
> dom0 pages to construct the domU without properly freeing them, which
> would suggest a bug in the balloon driver.
> 
> I can''t think of a Xen failure-mode which would cause these
symptoms
> without also being massively obvious in other cases.  (But "I
can''t
> think of..." is where all the best bugs hide.)
> 
But the corruptions always happen in 32-bytes chunks, which might
suggest it''s not a page-related problem (e.g. wrongly re-used page), as
in that case we would be observing (at least sometimes) much bigger
chunks of corrupted data, I think.

The reason why I still believe it''s a hypervisor related thing, it that
I''m currently using the very *same* Dom0 kernel (very recent
xen/stable-2.6.31) with Xen 3.4.2 and the system is damn stable. And I
really mean extensive use with 5-7 VMs running all the time doing
various things from Web browsing to kernel building.

If I was to make an educated guess I would say it''s something related
to
some interrupt handling, i.e. Xen mishandling it, e.g. the handler is
writing out-of-buffer somewhere and it just happens to land in the Dom0
fs buffer used by e.g. dd operation.

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Mar-08 23:30 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/09/2010 12:22 AM, Daniel Stodden wrote:> On Sun, 2010-03-07 at 11:12 -0500, Pasi Kärkkäinen wrote:
>> On Sun, Mar 07, 2010 at 02:39:09PM +0000, Keir Fraser wrote:
>>> On 07/03/2010 14:36, "Pasi Kärkkäinen"
<pasik@iki.fi> wrote:
>>>
>>>>> Tried a few times and no luck reproducing so far. I hope
some other people
>>>>> on the list also will give it a go, since it''s so
easy to try it out.
>>>>>
>>>>
>>>> I''m able to reproduce this with xen/master 2.6.31.6
dom0 kernel (from
>>>> 2010-02-20),
>>>> but I''m not able to reproduce it with the current
xen/stable 2.6.32.9.
>>>>
>>>> I''ll try with the most recent 2.6.31.6 dom0 kernel
aswell..
>>>
>>> Thanks Pasi!
>>>
>>
>> It seems to happen with the latest xen/master 2.6.31.6 aswell!
> 
> Does this look to you like we''re corrupting memory or on-disk
storage?
> 
> E.g. does a
> $ dd if=/dev/zero bs=1M | hexdump -C 
> have the same issue?
> 
I think there might be a chance that the above executes correctly, even
if we have memory corruption -- this might be e.g. because the actual
"dest" buffer here would be much smaller than the fs cache buffer used
when we copy onto disk. And so our small dest buffer might just not be
so likely to be hit with this presumably random corruption.

Perhaps dd''ing onto /dev/shm would be a better way to check this?

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Daniel Stodden

2010-Mar-08 23:32 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On Sat, 2010-03-06 at 05:12 -0500, Joanna Rutkowska
wrote:> There is a nasty data corruption problem most likely allowed by a bug in
> the Xen 4.0.0-x hypervisors.
Joanna, Pasi, which storage backend is hosting your fs''s /tmp?

Daniel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Mar-08 23:41 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/08/2010 03:23 PM, Joanna Rutkowska wrote:> But the corruptions always happen in 32-bytes chunks, which might
> suggest it''s not a page-related problem (e.g. wrongly re-used
page), as
> in that case we would be observing (at least sometimes) much bigger
> chunks of corrupted data, I think.
>    
Given that the domU doesn''t have any devices or much going on, it could
easily be corrupting memory in only small amounts.
> The reason why I still believe it''s a hypervisor related thing, it
that
> I''m currently using the very *same* Dom0 kernel (very recent
> xen/stable-2.6.31) with Xen 3.4.2 and the system is damn stable. And I
> really mean extensive use with 5-7 VMs running all the time doing
> various things from Web browsing to kernel building.
>    
OK, it''s always good to get some positive feedback.
> If I was to make an educated guess I would say it''s something
related to
> some interrupt handling, i.e. Xen mishandling it, e.g. the handler is
> writing out-of-buffer somewhere and it just happens to land in the Dom0
> fs buffer used by e.g. dd operation.
>    

It would be interesting to see what happens if you write the file with 
the test domain paused (xm pause ...).  If the corruption continues, 
then it is almost certainly Xen.  If it stops, then it either means the 
corruption was caused by pages inappropriately shared between dom0 and 
domU, or something like vcpu context switch is corrupting memory (which 
would be very sad).

     J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Daniel Stodden

2010-Mar-08 23:46 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On Mon, 2010-03-08 at 18:37 -0500, Joanna Rutkowska
wrote:> On 03/09/2010 12:32 AM, Daniel Stodden wrote:
> > On Sat, 2010-03-06 at 05:12 -0500, Joanna Rutkowska wrote:
> >> There is a nasty data corruption problem most likely allowed by a
bug in
> >> the Xen 4.0.0-x hypervisors.
> > 
> > Joanna, Pasi, which storage backend is hosting your fs''s
/tmp?
> > 
> 
> What exactly do you mean? I run the tests in Dom0. Are you asking about
> the fs?
Uh, oh, I missed the part in your description where you switched back to
dom0 after the domU thing.

Daniel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Mar-08 23:48 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/09/2010 12:41 AM, Jeremy Fitzhardinge wrote:> On 03/08/2010 03:23 PM, Joanna Rutkowska wrote:
>> But the corruptions always happen in 32-bytes chunks, which might
>> suggest it''s not a page-related problem (e.g. wrongly re-used
page), as
>> in that case we would be observing (at least sometimes) much bigger
>> chunks of corrupted data, I think.
>>    
> 
> Given that the domU doesn''t have any devices or much going on, it
could
> easily be corrupting memory in only small amounts.
> But see, before I tried this with such a small dummy do-nothing DomU
(which I did for the purpose of reporting to xen-devel), I experienced
very similar corruption when running regular VMs, i.e. with normal linux
and all the usual apps inside them. Same pattern of corruption.
>> The reason why I still believe it''s a hypervisor related
thing, it that
>> I''m currently using the very *same* Dom0 kernel (very recent
>> xen/stable-2.6.31) with Xen 3.4.2 and the system is damn stable. And I
>> really mean extensive use with 5-7 VMs running all the time doing
>> various things from Web browsing to kernel building.
>>    
> 
> OK, it''s always good to get some positive feedback.
> 
At least one full-time user of the pvops kernel ;)
>> If I was to make an educated guess I would say it''s something
related to
>> some interrupt handling, i.e. Xen mishandling it, e.g. the handler is
>> writing out-of-buffer somewhere and it just happens to land in the Dom0
>> fs buffer used by e.g. dd operation.
>>    
> 
> 
> It would be interesting to see what happens if you write the file with
> the test domain paused (xm pause ...).  If the corruption continues,
> then it is almost certainly Xen.
Right.
> If it stops, then it either means the
> corruption was caused by pages inappropriately shared between dom0 and
> domU, or something like vcpu context switch is corrupting memory (which
> would be very sad).
> 
Unfortunately, I cannot do any more tests. We have downgraded all our
test machines to Xen 3.4.2 and are using them for other things now. Sorry.

joanna.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Daniel Stodden

2010-Mar-08 23:52 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On Mon, 2010-03-08 at 18:30 -0500, Joanna Rutkowska
wrote:> On 03/09/2010 12:22 AM, Daniel Stodden wrote:
> > On Sun, 2010-03-07 at 11:12 -0500, Pasi Kärkkäinen wrote:
> >> On Sun, Mar 07, 2010 at 02:39:09PM +0000, Keir Fraser wrote:
> >>> On 07/03/2010 14:36, "Pasi Kärkkäinen"
<pasik@iki.fi> wrote:
> >>>
> >>>>> Tried a few times and no luck reproducing so far. I
hope some other people
> >>>>> on the list also will give it a go, since
it''s so easy to try it out.
> >>>>>
> >>>>
> >>>> I''m able to reproduce this with xen/master
2.6.31.6 dom0 kernel (from
> >>>> 2010-02-20),
> >>>> but I''m not able to reproduce it with the current
xen/stable 2.6.32.9.
> >>>>
> >>>> I''ll try with the most recent 2.6.31.6 dom0
kernel aswell..
> >>>
> >>> Thanks Pasi!
> >>>
> >>
> >> It seems to happen with the latest xen/master 2.6.31.6 aswell!
> > 
> > Does this look to you like we''re corrupting memory or on-disk
storage?
> > 
> > E.g. does a
> > $ dd if=/dev/zero bs=1M | hexdump -C 
> > have the same issue?
> > 
> 
> I think there might be a chance that the above executes correctly, even
> if we have memory corruption -- this might be e.g. because the actual
> "dest" buffer here would be much smaller than the fs cache buffer
used
> when we copy onto disk. And so our small dest buffer might just not be
> so likely to be hit with this presumably random corruption.
> 
> Perhaps dd''ing onto /dev/shm would be a better way to check this?
I agree that a negative doesn''t mean much. I''m just poking
around there
because the positive would have mattered: If we still get to see it,
we''re out of the storage discussion and can focus on memory corruption.

Daniel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Mar-08 23:56 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/09/2010 12:52 AM, Daniel Stodden wrote:> On Mon, 2010-03-08 at 18:30 -0500, Joanna Rutkowska wrote:
>> On 03/09/2010 12:22 AM, Daniel Stodden wrote:
>>> On Sun, 2010-03-07 at 11:12 -0500, Pasi Kärkkäinen wrote:
>>>> On Sun, Mar 07, 2010 at 02:39:09PM +0000, Keir Fraser wrote:
>>>>> On 07/03/2010 14:36, "Pasi Kärkkäinen"
<pasik@iki.fi> wrote:
>>>>>
>>>>>>> Tried a few times and no luck reproducing so far. I
hope some other people
>>>>>>> on the list also will give it a go, since
it''s so easy to try it out.
>>>>>>>
>>>>>>
>>>>>> I''m able to reproduce this with xen/master
2.6.31.6 dom0 kernel (from
>>>>>> 2010-02-20),
>>>>>> but I''m not able to reproduce it with the
current xen/stable 2.6.32.9.
>>>>>>
>>>>>> I''ll try with the most recent 2.6.31.6 dom0
kernel aswell..
>>>>>
>>>>> Thanks Pasi!
>>>>>
>>>>
>>>> It seems to happen with the latest xen/master 2.6.31.6 aswell!
>>>
>>> Does this look to you like we''re corrupting memory or
on-disk storage?
>>>
>>> E.g. does a
>>> $ dd if=/dev/zero bs=1M | hexdump -C 
>>> have the same issue?
>>>
>>
>> I think there might be a chance that the above executes correctly, even
>> if we have memory corruption -- this might be e.g. because the actual
>> "dest" buffer here would be much smaller than the fs cache
buffer used
>> when we copy onto disk. And so our small dest buffer might just not be
>> so likely to be hit with this presumably random corruption.
>>
>> Perhaps dd''ing onto /dev/shm would be a better way to check
this?
> 
> I agree that a negative doesn''t mean much. I''m just
poking around there
> because the positive would have mattered: If we still get to see it,
> we''re out of the storage discussion and can focus on memory
corruption.
> 
If you''re thinking about a potential Dom0 disk-driver problem, then I
think we can rule this out. This is because I have tried this on both
encrypted and non-encrypted filesystems, but the pattern of corruptions
was exactly the same. If the disk driver was feeding LUKS (the crypto
driver) with a wrong data, the corruptions would definitely look
differently.

I also tried ext4 and ext3 filesystems, but same results.

j.




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

James Harper

2010-Mar-09 00:18 UTC

head link

RE: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

> > I can''t think of a Xen failure-mode which would cause these
symptoms
> > without also being massively obvious in other cases.  (But "I
can''t
> > think of..." is where all the best bugs hide.)
> >
> 
> But the corruptions always happen in 32-bytes chunks, which might
> suggest it''s not a page-related problem (e.g. wrongly re-used
page),
as> in that case we would be observing (at least sometimes) much bigger
> chunks of corrupted data, I think.
Based on your hex dump output, it appears to be the first 32 bytes of a
page, which does lend itself to the idea that it''s a page allocated for
something with only the first 32 bytes used.

You''ve stated that you are no longer set up to reproduce it, which is
unfortunate. If you find yourself in a position to try it again there
are bunch of things you could try to figure out on which end of the copy
the problem lies.

James


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Joanna Rutkowska

2010-Mar-09 00:20 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/09/2010 01:18 AM, James Harper wrote:>>> I can''t think of a Xen failure-mode which would cause
these symptoms
>>> without also being massively obvious in other cases.  (But "I
can''t
>>> think of..." is where all the best bugs hide.)
>>>
>>
>> But the corruptions always happen in 32-bytes chunks, which might
>> suggest it''s not a page-related problem (e.g. wrongly re-used
page),
> as
>> in that case we would be observing (at least sometimes) much bigger
>> chunks of corrupted data, I think.
> 
> Based on your hex dump output, it appears to be the first 32 bytes of a
> page, which does lend itself to the idea that it''s a page
allocated for
> something with only the first 32 bytes used.
> 
> You''ve stated that you are no longer set up to reproduce it, which
is
> unfortunate. If you find yourself in a position to try it again there
> are bunch of things you could try to figure out on which end of the copy
> the problem lies.
> But everybody can try it with the kernels I provided, right? I''m not
the
only one person, who can do this...

j.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Daniel Stodden

2010-Mar-09 00:33 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On Mon, 2010-03-08 at 18:56 -0500, Joanna Rutkowska
wrote:> On 03/09/2010 12:52 AM, Daniel Stodden wrote:
> > On Mon, 2010-03-08 at 18:30 -0500, Joanna Rutkowska wrote:
> >> On 03/09/2010 12:22 AM, Daniel Stodden wrote:
> >>> On Sun, 2010-03-07 at 11:12 -0500, Pasi Kärkkäinen wrote:
> >>>> On Sun, Mar 07, 2010 at 02:39:09PM +0000, Keir Fraser
wrote:
> >>>>> On 07/03/2010 14:36, "Pasi Kärkkäinen"
<pasik@iki.fi> wrote:
> >>>>>
> >>>>>>> Tried a few times and no luck reproducing so
far. I hope some other people
> >>>>>>> on the list also will give it a go, since
it''s so easy to try it out.
> >>>>>>>
> >>>>>>
> >>>>>> I''m able to reproduce this with
xen/master 2.6.31.6 dom0 kernel (from
> >>>>>> 2010-02-20),
> >>>>>> but I''m not able to reproduce it with the
current xen/stable 2.6.32.9.
> >>>>>>
> >>>>>> I''ll try with the most recent 2.6.31.6
dom0 kernel aswell..
> >>>>>
> >>>>> Thanks Pasi!
> >>>>>
> >>>>
> >>>> It seems to happen with the latest xen/master 2.6.31.6
aswell!
> >>>
> >>> Does this look to you like we''re corrupting memory or
on-disk storage?
> >>>
> >>> E.g. does a
> >>> $ dd if=/dev/zero bs=1M | hexdump -C 
> >>> have the same issue?
> >>>
> >>
> >> I think there might be a chance that the above executes correctly,
even
> >> if we have memory corruption -- this might be e.g. because the
actual
> >> "dest" buffer here would be much smaller than the fs
cache buffer used
> >> when we copy onto disk. And so our small dest buffer might just
not be
> >> so likely to be hit with this presumably random corruption.
> >>
> >> Perhaps dd''ing onto /dev/shm would be a better way to
check this?
> > 
> > I agree that a negative doesn''t mean much. I''m just
poking around there
> > because the positive would have mattered: If we still get to see it,
> > we''re out of the storage discussion and can focus on memory
corruption.
> > 
> 
> If you''re thinking about a potential Dom0 disk-driver problem,
then I
> think we can rule this out. This is because I have tried this on both
> encrypted and non-encrypted filesystems, but the pattern of corruptions
> was exactly the same. If the disk driver was feeding LUKS (the crypto
> driver) with a wrong data, the corruptions would definitely look
> differently.
I''m not considering the device drivers, rather everything on top of
that. Also I didn''t understand the issue is present in dom0 by the time
I wrote that.

Still, it''d help to figure it of the corruption came in on the way up
from /dev/zero or down to the disk.

That dd crossing the page cache means it''s still got a long way to go. 

For now I''m most of all glad to hear it''s not in the backends,
so far
thanks for that :o)

Daniel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Pasi Kärkkäinen

2010-Mar-09 08:25 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On Mon, Mar 08, 2010 at 03:22:32PM -0800, Daniel Stodden
wrote:> On Sun, 2010-03-07 at 11:12 -0500, Pasi Kärkkäinen wrote:
> > On Sun, Mar 07, 2010 at 02:39:09PM +0000, Keir Fraser wrote:
> > > On 07/03/2010 14:36, "Pasi Kärkkäinen"
<pasik@iki.fi> wrote:
> > > 
> > > >> Tried a few times and no luck reproducing so far. I hope
some other people
> > > >> on the list also will give it a go, since it''s
so easy to try it out.
> > > >> 
> > > > 
> > > > I''m able to reproduce this with xen/master 2.6.31.6
dom0 kernel (from
> > > > 2010-02-20),
> > > > but I''m not able to reproduce it with the current
xen/stable 2.6.32.9.
> > > > 
> > > > I''ll try with the most recent 2.6.31.6 dom0 kernel
aswell..
> > > 
> > > Thanks Pasi!
> > > 
> > 
> > It seems to happen with the latest xen/master 2.6.31.6 aswell!
> 
> Does this look to you like we''re corrupting memory or on-disk
storage?
> 
> E.g. does a
> $ dd if=/dev/zero bs=1M | hexdump -C 
> have the same issue?
> 
> I have some initial trouble with the idea that zero.read() in a PV domU
> somehow unlearned to scrub a 1M user buffer.
> 
My setup:

Dom0 distro: Fedora 12
Xen hypervisor: 4.0.0-rc5 x86_64
Dom0 kernel: latest xen/master 2.6.31.6 x86_64

Xen hypervisor boot options in grub.conf: dom0_mem=1G loglvl=all
guest_loglvl=all
Dom0 kernel boot options in grub.conf: ro root=/dev/mapper/vg_f12test-lv01
SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=fi nomodeset

Steps to reproduce the bug:

1. Reboot the system
2. Start a dummy guest using the domU kernel (rpm) provided in the original
bugreport:

# xm create -c /dev/null memory=400
kernel="vmlinuz-2.6.31.9-1.2.82.xendom0.fc12.x86_64"
extra="rootdelay=1000"

3. run in dom0:

# dd if=/dev/zero of=test bs=1M count=10000 && sync && sync
&& xxd test | grep -v "0000 0000 0000 0000 0000 0000 0000
0000"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 233.621 s, 44.9 MB/s

6039000: 1000 0000 0000 0000 c0b0 ffff 0300 0000  ................
6039010: 1d5e 06ab b502 0000 1eb2 27b5 ff00 0000  .^........''.....
2dfe9000:3000 0000 0000 0000 a43c 7687 0e00 0000  0........<v.....
2dfe9010:cfc1 ba64 b902 0000 1eb2 27b5 ff00 0000  ...d......''.....
50685000:4800 0000 0000 0000 f954 0f6d 1600 0000  H........T.m....
50685010:5b1d 0230 bc02 0000 1eb2 27b5 ff00 0000  [..0......''.....
743f9000:6200 0000 0000 0000 e0e2 1ffb 1e00 0000  b...............
743f9010:acc3 e436 bf02 0000 1eb2 27b5 ff00 0000  ...6......''.....

As you can see, very easy to reproduce.

Now, I "xm destroy" the domU, run "sync" and "echo 3
> /proc/sys/vm/drop_caches" in dom0,
and then re-start the dummy domU, and try the other method as requested by
Daniel:

# dd if=/dev/zero bs=1M | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
^C20984+0 records in
20983+0 records out
22002270208 bytes (22 GB) copied, 206.353 s, 107 MB/s

So that method didn''t show the corruption..
Now immediately after (no domU restart) let''s try to reproduce again
with the dd + xxd method:

# dd if=/dev/zero of=test bs=1M count=10000 && sync && sync
&& xxd test | grep -v "0000 0000 0000 0000 0000 0000 0000
0000"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 258.85 s, 40.5 MB/s
7dc2000: 5a02 0000 0000 0000 760d d90c c500 0000  Z.......v.......
7dc2010: 3785 8def 8003 0000 1eb2 27b5 ff00 0000  7.........''.....
2dc0d000:7802 0000 0000 0000 ec70 d8eb ce00 0000  x........p......
2dc0d010:6fb9 a66d 8403 0000 1eb2 27b5 ff00 0000  o..m......''.....

So it seems to be related to disk IO in dom0? 

-- Pasi


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Mar-09 09:37 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

>>> Pasi Kärkkäinen<pasik@iki.fi> 09.03.10 09:25 >>>
>6039000: 1000 0000 0000 0000 c0b0 ffff 0300 0000  ................
>6039010: 1d5e 06ab b502 0000 1eb2 27b5 ff00 0000  .^........''.....
>2dfe9000:3000 0000 0000 0000 a43c 7687 0e00 0000  0........<v.....
>2dfe9010:cfc1 ba64 b902 0000 1eb2 27b5 ff00 0000  ...d......''.....
>50685000:4800 0000 0000 0000 f954 0f6d 1600 0000  H........T.m....
>50685010:5b1d 0230 bc02 0000 1eb2 27b5 ff00 0000  [..0......''.....
>743f9000:6200 0000 0000 0000 e0e2 1ffb 1e00 0000  b...............
>743f9010:acc3 e436 bf02 0000 1eb2 27b5 ff00 0000  ...6......''.....
>...
>7dc2000: 5a02 0000 0000 0000 760d d90c c500 0000  Z.......v.......
>7dc2010: 3785 8def 8003 0000 1eb2 27b5 ff00 0000  7.........''.....
>2dc0d000:7802 0000 0000 0000 ec70 d8eb ce00 0000  x........p......
>2dc0d010:6fb9 a66d 8403 0000 1eb2 27b5 ff00 0000  o..m......''.....
How about these being vcpu_time_info structures? The fields
appear to all make sense. The only thing not matching this would
be a few differently looking corruption entries sent earlier by Joanna,
so this may not be the only thing. But it would explain why with 3.4.2
the issue is not present.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Mar-09 10:15 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

>>> "Jan Beulich" <JBeulich@novell.com> 09.03.10 10:37
>>>
>How about these being vcpu_time_info structures? The fields
>appear to all make sense. The only thing not matching this would
>be a few differently looking corruption entries sent earlier by Joanna,
>so this may not be the only thing. But it would explain why with 3.4.2
>the issue is not present.
In particular I think the update_vcpu_system_time() invocation 
in schedule() isn''t right - VCPUOP_register_vcpu_time_memory_area
taking a virtual address, this call must not happen before
context_switch().

And btw., 32-on-64 also seems to be broken for
VCPUOP_register_vcpu_time_memory_area (since 64-bit Xen reads
the full 64-bit field).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Mar-09 10:15 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 09/03/2010 09:37, "Jan Beulich" <JBeulich@novell.com> wrote:
>> 7dc2000: 5a02 0000 0000 0000 760d d90c c500 0000  Z.......v.......
>> 7dc2010: 3785 8def 8003 0000 1eb2 27b5 ff00 0000 
7.........''.....
>> 2dc0d000:7802 0000 0000 0000 ec70 d8eb ce00 0000  x........p......
>> 2dc0d010:6fb9 a66d 8403 0000 1eb2 27b5 ff00 0000 
o..m......''.....
> 
> How about these being vcpu_time_info structures? The fields
> appear to all make sense. The only thing not matching this would
> be a few differently looking corruption entries sent earlier by Joanna,
> so this may not be the only thing. But it would explain why with 3.4.2
> the issue is not present.
Pasi, can you try the attached patch (which simply stubs out the new
VCPUOP_register_vcpu_time_memory_area hypercall)? I''m pretty sure this
is
it: just look at the implementation of __update_vcpu_system_time: when
v!=current it will write to a virtual address in v, using current''s
page
tables. This will happen on context switch dom0->domU for example.

A quite suitable fix for 4.0.0 is to leave the hypercall stubbed out imo.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Mar-09 10:17 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 09/03/2010 10:15, "Jan Beulich" <JBeulich@novell.com> wrote:
>>>> "Jan Beulich" <JBeulich@novell.com> 09.03.10
10:37 >>>
>> How about these being vcpu_time_info structures? The fields
>> appear to all make sense. The only thing not matching this would
>> be a few differently looking corruption entries sent earlier by Joanna,
>> so this may not be the only thing. But it would explain why with 3.4.2
>> the issue is not present.
> 
> In particular I think the update_vcpu_system_time() invocation
> in schedule() isn''t right - VCPUOP_register_vcpu_time_memory_area
> taking a virtual address, this call must not happen before
> context_switch().
> 
> And btw., 32-on-64 also seems to be broken for
> VCPUOP_register_vcpu_time_memory_area (since 64-bit Xen reads
> the full 64-bit field).
Yeah, agreed. We''ll stub it out for 4.0.0 I think. Things work quite
okay
without it.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Pasi Kärkkäinen

2010-Mar-09 10:25 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On Tue, Mar 09, 2010 at 10:15:45AM +0000, Keir Fraser
wrote:> On 09/03/2010 09:37, "Jan Beulich" <JBeulich@novell.com>
wrote:
> 
> >> 7dc2000: 5a02 0000 0000 0000 760d d90c c500 0000  Z.......v.......
> >> 7dc2010: 3785 8def 8003 0000 1eb2 27b5 ff00 0000 
7.........''.....
> >> 2dc0d000:7802 0000 0000 0000 ec70 d8eb ce00 0000  x........p......
> >> 2dc0d010:6fb9 a66d 8403 0000 1eb2 27b5 ff00 0000 
o..m......''.....
> > 
> > How about these being vcpu_time_info structures? The fields
> > appear to all make sense. The only thing not matching this would
> > be a few differently looking corruption entries sent earlier by
Joanna,
> > so this may not be the only thing. But it would explain why with 3.4.2
> > the issue is not present.
> 
> Pasi, can you try the attached patch (which simply stubs out the new
> VCPUOP_register_vcpu_time_memory_area hypercall)?
Yeah, but there''s no patch attached ;)

-- Pasi
 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Mar-09 10:42 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

>>> Keir Fraser <keir.fraser@eu.citrix.com> 09.03.10 11:15
>>>
>Pasi, can you try the attached patch (which simply stubs out the new
>VCPUOP_register_vcpu_time_memory_area hypercall)?
No patch attached?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Mar-09 10:43 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 09/03/2010 10:25, "Pasi Kärkkäinen" <pasik@iki.fi> wrote:
>>> How about these being vcpu_time_info structures? The fields
>>> appear to all make sense. The only thing not matching this would
>>> be a few differently looking corruption entries sent earlier by
Joanna,
>>> so this may not be the only thing. But it would explain why with
3.4.2
>>> the issue is not present.
>> 
>> Pasi, can you try the attached patch (which simply stubs out the new
>> VCPUOP_register_vcpu_time_memory_area hypercall)?
> 
> Yeah, but there''s no patch attached ;)
Good point. Attached now!

 K.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Pasi Kärkkäinen

2010-Mar-09 12:03 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On Tue, Mar 09, 2010 at 10:43:41AM +0000, Keir Fraser
wrote:> On 09/03/2010 10:25, "Pasi Kärkkäinen" <pasik@iki.fi>
wrote:
> 
> >>> How about these being vcpu_time_info structures? The fields
> >>> appear to all make sense. The only thing not matching this
would
> >>> be a few differently looking corruption entries sent earlier
by Joanna,
> >>> so this may not be the only thing. But it would explain why
with 3.4.2
> >>> the issue is not present.
> >> 
> >> Pasi, can you try the attached patch (which simply stubs out the
new
> >> VCPUOP_register_vcpu_time_memory_area hypercall)?
> > 
> > Yeah, but there''s no patch attached ;)
> 
> Good point. Attached now!
> 
I grabbed the latest rc6-pre xen-unstable.hg (*), applied your patch, compiled,
copied xen-4.0.0-rc6-pre.gz to /boot and booted using it. I didn''t
change anything else.

It seems the bug went away. No more corruption. I repeated the test three times.

-- Pasi

(*) earlier on this system I was using 4.0.0-rc5 built from src.rpm.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Mar-09 23:28 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/09/2010 02:15 AM, Keir Fraser wrote:> A quite suitable fix for 4.0.0 is to leave the hypercall stubbed out imo.
>    
Agreed.

     J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2010-Mar-10 01:33 UTC

head link

RE: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Am I correct in reading between the lines that this means
that Xen 4.0 will not support fast (vsyscall) gettimeofday?
> -----Original Message-----
> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> Sent: Tuesday, March 09, 2010 4:29 PM
> To: Keir Fraser
> Cc: xen-devel@lists.xensource.com; Joanna Rutkowska; =?ISO-8859-
> 1?Q?Pasi_K=E4rkk=E4in?=@rcsinet12.oracle.com; Jan Beulich; Daniel
> Stodden
> Subject: Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0
> 
> On 03/09/2010 02:15 AM, Keir Fraser wrote:
> > A quite suitable fix for 4.0.0 is to leave the hypercall stubbed out
> imo.
> >
> 
> Agreed.
> 
>      J
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Mar-10 18:02 UTC

head link

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

On 03/09/2010 05:33 PM, Dan Magenheimer wrote:> Am I correct in reading between the lines that this means
> that Xen 4.0 will not support fast (vsyscall) gettimeofday?
>    
Yep, unless someone comes up with a proper fix before release.

     J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Mar 2010 - Xen 4.0.0x allows for data corruption in Dom0

[Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

RE: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

RE: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0

Re: [Xen-devel] Xen 4.0.0x allows for data corruption in Dom0