thr3ads.net - Xen devel - BUG: ext3 corruption in domU [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Anthony Sheetz

2013-Apr-16 17:39 UTC

BUG: ext3 corruption in domU

(re-sending, first message seems to have gotten lost)

I was referred here by Ian Campbell ijc@hellion.org.uk from bugs.debian.org.

First, I''m happy to provide more information about this bug as
requsted. I recognize not all relevant data has
been collected yet.

Detailed information about this bug can be found at
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124.

The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with
LVM and full disk encryption with
Debian Stable (6.0, Squeeze) domU, transferring large files via scp or
rsync over openswan results in data corruption, with
eventual file system corruption. The culprit appears to be full disk
encryption, however that evidence may not be conclusive.

While I don''t mind providing additional information, I''d hate
to have
to repeat the information I''ve provided to the Debian bug hunting
folks.

Thanks in advance for any help you can provide.

Ian Campbell

2013-Apr-17 13:00 UTC

head link

Re: BUG: ext3 corruption in domU

On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote:> (re-sending, first message seems to have gotten lost)
> 
> I was referred here by Ian Campbell ijc@hellion.org.uk from
bugs.debian.org.
I''m here too (different hat ;-)), thanks for posting it here.
I''ve added
some people who know about the block stuff to the CC.

Guys, my suspicion is that the issue is that barriers issued by ext3
inside the guest aren''t making it all the way down the
ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the
filesystem to eventually corrupt itself.

The issue seems to relate to the use of dm-crypt since
ext3->blkfront->blkback->lvm->disk is reported work fine.

However there is no problem with the local dom0 ext3 root filesystem
which is also in the same lvm VG on the crypt device (i.e.
ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I
figure
something is up at the blkfront->back link which causes the barriers
which blkback is injecting into the block subsystem either don''t make
it
to the dm-crypt layer or do not DTRT once they arrive.

I''m not really sure with how to proceed (or how to ask Anthony to
proceed) with verifying any part of that hypothesis though.

ISTR issues with old vs new style barriers or barriers with no data in
them or something, could this be related to that? (or am I thinking of
DISCARD?)

The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU
on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with
Wheezy on Wheezy now so this isn''t cross version confusion about
barrier
semantics AFAICT.

Ian.
> First, I''m happy to provide more information about this bug as
> requsted. I recognize not all relevant data has
> been collected yet.
> 
> Detailed information about this bug can be found at
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124.
> 
> The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with
> LVM and full disk encryption with
> Debian Stable (6.0, Squeeze) domU, transferring large files via scp or
> rsync over openswan results in data corruption, with
> eventual file system corruption. The culprit appears to be full disk
> encryption, however that evidence may not be conclusive.
> 
> While I don''t mind providing additional information, I''d
hate to have
> to repeat the information I''ve provided to the Debian bug hunting
> folks.
> 
> Thanks in advance for any help you can provide.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Anthony Sheetz

2013-Apr-22 12:22 UTC

head link

Re: BUG: ext3 corruption in domU

I realize folks are pretty busy, but we''re still interested in getting
this problem solved, and I want to be sure it''s not lost in the
shuffle.
Any chance of getting some attention for it?

On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell <Ian.Campbell@citrix.com>
wrote:> On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote:
>> (re-sending, first message seems to have gotten lost)
>>
>> I was referred here by Ian Campbell ijc@hellion.org.uk from
bugs.debian.org.
>
> I''m here too (different hat ;-)), thanks for posting it here.
I''ve added
> some people who know about the block stuff to the CC.
>
> Guys, my suspicion is that the issue is that barriers issued by ext3
> inside the guest aren''t making it all the way down the
> ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading
the
> filesystem to eventually corrupt itself.
>
> The issue seems to relate to the use of dm-crypt since
> ext3->blkfront->blkback->lvm->disk is reported work fine.
>
> However there is no problem with the local dom0 ext3 root filesystem
> which is also in the same lvm VG on the crypt device (i.e.
> ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I
figure
> something is up at the blkfront->back link which causes the barriers
> which blkback is injecting into the block subsystem either don''t
make it
> to the dm-crypt layer or do not DTRT once they arrive.
>
> I''m not really sure with how to proceed (or how to ask Anthony to
> proceed) with verifying any part of that hypothesis though.
>
> ISTR issues with old vs new style barriers or barriers with no data in
> them or something, could this be related to that? (or am I thinking of
> DISCARD?)
>
> The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU
> on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with
> Wheezy on Wheezy now so this isn''t cross version confusion about
barrier
> semantics AFAICT.
>
> Ian.
>
>> First, I''m happy to provide more information about this bug as
>> requsted. I recognize not all relevant data has
>> been collected yet.
>>
>> Detailed information about this bug can be found at
>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124.
>>
>> The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with
>> LVM and full disk encryption with
>> Debian Stable (6.0, Squeeze) domU, transferring large files via scp or
>> rsync over openswan results in data corruption, with
>> eventual file system corruption. The culprit appears to be full disk
>> encryption, however that evidence may not be conclusive.
>>
>> While I don''t mind providing additional information,
I''d hate to have
>> to repeat the information I''ve provided to the Debian bug
hunting
>> folks.
>>
>> Thanks in advance for any help you can provide.
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>
>

Ian Campbell

2013-Apr-22 12:26 UTC

head link

Re: BUG: ext3 corruption in domU

Konrad is on vacation this week, so it''ll probably be next week before
this gets looked at by him.

Ian.

On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote:> I realize folks are pretty busy, but we''re still interested in
getting
> this problem solved, and I want to be sure it''s not lost in the
> shuffle.
> Any chance of getting some attention for it?
> 
> On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
> > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote:
> >> (re-sending, first message seems to have gotten lost)
> >>
> >> I was referred here by Ian Campbell ijc@hellion.org.uk from
bugs.debian.org.
> >
> > I''m here too (different hat ;-)), thanks for posting it here.
I''ve added
> > some people who know about the block stuff to the CC.
> >
> > Guys, my suspicion is that the issue is that barriers issued by ext3
> > inside the guest aren''t making it all the way down the
> > ext3->blkfront->blkback->lvm->dm-crypt->disk chain
leading the
> > filesystem to eventually corrupt itself.
> >
> > The issue seems to relate to the use of dm-crypt since
> > ext3->blkfront->blkback->lvm->disk is reported work fine.
> >
> > However there is no problem with the local dom0 ext3 root filesystem
> > which is also in the same lvm VG on the crypt device (i.e.
> > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt
issue. I figure
> > something is up at the blkfront->back link which causes the
barriers
> > which blkback is injecting into the block subsystem either
don''t make it
> > to the dm-crypt layer or do not DTRT once they arrive.
> >
> > I''m not really sure with how to proceed (or how to ask
Anthony to
> > proceed) with verifying any part of that hypothesis though.
> >
> > ISTR issues with old vs new style barriers or barriers with no data in
> > them or something, could this be related to that? (or am I thinking of
> > DISCARD?)
> >
> > The issue was initially reported with Squeeze (Jeremy 2.6.32 tree)
domU
> > on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with
> > Wheezy on Wheezy now so this isn''t cross version confusion
about barrier
> > semantics AFAICT.
> >
> > Ian.
> >
> >> First, I''m happy to provide more information about this
bug as
> >> requsted. I recognize not all relevant data has
> >> been collected yet.
> >>
> >> Detailed information about this bug can be found at
> >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124.
> >>
> >> The executive summary is: Using Debian Testing (7.0, wheezy) dom0
with
> >> LVM and full disk encryption with
> >> Debian Stable (6.0, Squeeze) domU, transferring large files via
scp or
> >> rsync over openswan results in data corruption, with
> >> eventual file system corruption. The culprit appears to be full
disk
> >> encryption, however that evidence may not be conclusive.
> >>
> >> While I don''t mind providing additional information,
I''d hate to have
> >> to repeat the information I''ve provided to the Debian bug
hunting
> >> folks.
> >>
> >> Thanks in advance for any help you can provide.
> >>
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@lists.xen.org
> >> http://lists.xen.org/xen-devel
> >
> >

Anthony Sheetz

2013-May-06 12:46 UTC

head link

Re: BUG: ext3 corruption in domU

I would once again like to request help with a bug in Xen. Repeating
message from April 16th:

First, I''m happy to provide more information about this bug as
requsted. I recognize not all relevant data has
been collected yet.

Detailed information about this bug can be found at
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124.

The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with
LVM and full disk encryption with
Debian Stable (6.0, Squeeze) domU, transferring large files via scp or
rsync over openswan results in data corruption, with
eventual file system corruption. The culprit appears to be full disk
encryption, however that evidence may not be conclusive.

While I don''t mind providing additional information, I''d hate
to have
to repeat the information I''ve provided to the Debian bug hunting
folks.

Thanks in advance for any help you can provide.

On Tue, Apr 16, 2013 at 1:39 PM, Anthony Sheetz <sheetzam@inspire.com>
wrote:> (re-sending, first message seems to have gotten lost)
>
> I was referred here by Ian Campbell ijc@hellion.org.uk from
bugs.debian.org.
>
> First, I''m happy to provide more information about this bug as
> requsted. I recognize not all relevant data has
> been collected yet.
>
> Detailed information about this bug can be found at
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124.
>
> The executive summary is: Using Debian Testing (7.0, wheezy) dom0 with
> LVM and full disk encryption with
> Debian Stable (6.0, Squeeze) domU, transferring large files via scp or
> rsync over openswan results in data corruption, with
> eventual file system corruption. The culprit appears to be full disk
> encryption, however that evidence may not be conclusive.
>
> While I don''t mind providing additional information, I''d
hate to have
> to repeat the information I''ve provided to the Debian bug hunting
> folks.
>
> Thanks in advance for any help you can provide.

Konrad Rzeszutek Wilk

2013-May-22 20:10 UTC

head link

Re: BUG: ext3 corruption in domU

On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell
wrote:> Konrad is on vacation this week, so it''ll probably be next week
before
> this gets looked at by him.
And I finally got to this email in my
''vacation-mbox''> 
> Ian.
> 
> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote:
> > I realize folks are pretty busy, but we''re still interested
in getting
> > this problem solved, and I want to be sure it''s not lost in
the
> > shuffle.
> > Any chance of getting some attention for it?
> > 
> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote:
> > >> (re-sending, first message seems to have gotten lost)
> > >>
> > >> I was referred here by Ian Campbell ijc@hellion.org.uk from
bugs.debian.org.
> > >
> > > I''m here too (different hat ;-)), thanks for posting it
here. I''ve added
> > > some people who know about the block stuff to the CC.
> > >
> > > Guys, my suspicion is that the issue is that barriers issued by
ext3
> > > inside the guest aren''t making it all the way down the
> > > ext3->blkfront->blkback->lvm->dm-crypt->disk chain
leading the
> > > filesystem to eventually corrupt itself.
> > >
> > > The issue seems to relate to the use of dm-crypt since
> > > ext3->blkfront->blkback->lvm->disk is reported work
fine.
> > >
> > > However there is no problem with the local dom0 ext3 root
filesystem
> > > which is also in the same lvm VG on the crypt device (i.e.
> > > ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt
issue. I figure
> > > something is up at the blkfront->back link which causes the
barriers
> > > which blkback is injecting into the block subsystem either
don''t make it
> > > to the dm-crypt layer or do not DTRT once they arrive.
> > >
> > > I''m not really sure with how to proceed (or how to ask
Anthony to
> > > proceed) with verifying any part of that hypothesis though.
> > >
> > > ISTR issues with old vs new style barriers or barriers with no
data in
> > > them or something, could this be related to that? (or am I
thinking of
> > > DISCARD?)
You are using two different kernel versions. The 2.6.32 domU is only using
WRITE_BARRIERs, while in the 3.2 kernels that have been completly eliminated.
The mechanism they use is called ''WRITE_FLUSH''. The 3.2 kernel
has a patch:
ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b
Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date:   Mon Oct 10 00:42:22 2011 -0400

    xen/blkback: Support ''feature-barrier'' aka old-style
BARRIER requests.


which emulates the barrier request by draining all of the oustanding I/Os and
then
sending the WRITE_FLUSH.

But it looks like you are hitting an issue here. Just to make sure 
that is the case, what happens if you use the _same_ kernel in both dom0 and
domU? Does it work then?
> > >
> > > The issue was initially reported with Squeeze (Jeremy 2.6.32
tree) domU
> > > on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated
with
> > > Wheezy on Wheezy now so this isn''t cross version
confusion about barrier
> > > semantics AFAICT.
> > >
> > > Ian.
> > >
> > >> First, I''m happy to provide more information about
this bug as
> > >> requsted. I recognize not all relevant data has
> > >> been collected yet.
> > >>
> > >> Detailed information about this bug can be found at
> > >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124.
> > >>
> > >> The executive summary is: Using Debian Testing (7.0, wheezy)
dom0 with
> > >> LVM and full disk encryption with
> > >> Debian Stable (6.0, Squeeze) domU, transferring large files
via scp or
> > >> rsync over openswan results in data corruption, with
> > >> eventual file system corruption. The culprit appears to be
full disk
> > >> encryption, however that evidence may not be conclusive.
> > >>
> > >> While I don''t mind providing additional information,
I''d hate to have
> > >> to repeat the information I''ve provided to the
Debian bug hunting
> > >> folks.
> > >>
> > >> Thanks in advance for any help you can provide.
> > >>
> > >> _______________________________________________
> > >> Xen-devel mailing list
> > >> Xen-devel@lists.xen.org
> > >> http://lists.xen.org/xen-devel
> > >
> > >
> 
>

Anthony Sheetz

2013-May-23 18:19 UTC

head link

Re: BUG: ext3 corruption in domU

On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:> On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell wrote:
>> Konrad is on vacation this week, so it''ll probably be next
week before
>> this gets looked at by him.
>
> And I finally got to this email in my ''vacation-mbox''
>>
>> Ian.
>>
>> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote:
>> > I realize folks are pretty busy, but we''re still
interested in getting
>> > this problem solved, and I want to be sure it''s not lost
in the
>> > shuffle.
>> > Any chance of getting some attention for it?
>> >
>> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
>> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote:
>> > >> (re-sending, first message seems to have gotten lost)
>> > >>
>> > >> I was referred here by Ian Campbell ijc@hellion.org.uk
from bugs.debian.org.
>> > >
>> > > I''m here too (different hat ;-)), thanks for posting
it here. I''ve added
>> > > some people who know about the block stuff to the CC.
>> > >
>> > > Guys, my suspicion is that the issue is that barriers issued
by ext3
>> > > inside the guest aren''t making it all the way down
the
>> > > ext3->blkfront->blkback->lvm->dm-crypt->disk
chain leading the
>> > > filesystem to eventually corrupt itself.
>> > >
>> > > The issue seems to relate to the use of dm-crypt since
>> > > ext3->blkfront->blkback->lvm->disk is reported
work fine.
>> > >
>> > > However there is no problem with the local dom0 ext3 root
filesystem
>> > > which is also in the same lvm VG on the crypt device (i.e.
>> > > ext3->lvm->dm-crypt->disk), so its not purely a
dm-crypt issue. I figure
>> > > something is up at the blkfront->back link which causes
the barriers
>> > > which blkback is injecting into the block subsystem either
don''t make it
>> > > to the dm-crypt layer or do not DTRT once they arrive.
>> > >
>> > > I''m not really sure with how to proceed (or how to
ask Anthony to
>> > > proceed) with verifying any part of that hypothesis though.
>> > >
>> > > ISTR issues with old vs new style barriers or barriers with
no data in
>> > > them or something, could this be related to that? (or am I
thinking of
>> > > DISCARD?)
>
> You are using two different kernel versions. The 2.6.32 domU is only using
> WRITE_BARRIERs, while in the 3.2 kernels that have been completly
eliminated.
> The mechanism they use is called ''WRITE_FLUSH''. The 3.2
kernel has a patch:
> ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b
> Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date:   Mon Oct 10 00:42:22 2011 -0400
>
>     xen/blkback: Support ''feature-barrier'' aka old-style
BARRIER requests.
>
>
> which emulates the barrier request by draining all of the oustanding I/Os
and then
> sending the WRITE_FLUSH.
>
> But it looks like you are hitting an issue here. Just to make sure
> that is the case, what happens if you use the _same_ kernel in both dom0
and
> domU? Does it work then?
>
First, thank you so much for getting back to me, it''s really
appreciated.
At this point I''ve forgotten if I did this with Wheezy on Wheezy, and
what the result was.
I''ll have to test using the 3.2 kernel on the domU Debian Squeeze and
get back to you. I should be able to do that early next week.
>> > >
>> > > The issue was initially reported with Squeeze (Jeremy 2.6.32
tree) domU
>> > > on a Wheezy (mainline 3.2) dom0 but IIRC has also been
repeated with
>> > > Wheezy on Wheezy now so this isn''t cross version
confusion about barrier
>> > > semantics AFAICT.
>> > >
>> > > Ian.

Konrad Rzeszutek Wilk

2013-May-24 14:20 UTC

head link

Re: BUG: ext3 corruption in domU

On Thu, May 23, 2013 at 02:19:50PM -0400, Anthony Sheetz
wrote:> On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell wrote:
> >> Konrad is on vacation this week, so it''ll probably be
next week before
> >> this gets looked at by him.
> >
> > And I finally got to this email in my
''vacation-mbox''
> >>
> >> Ian.
> >>
> >> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote:
> >> > I realize folks are pretty busy, but we''re still
interested in getting
> >> > this problem solved, and I want to be sure it''s not
lost in the
> >> > shuffle.
> >> > Any chance of getting some attention for it?
> >> >
> >> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
> >> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote:
> >> > >> (re-sending, first message seems to have gotten
lost)
> >> > >>
> >> > >> I was referred here by Ian Campbell
ijc@hellion.org.uk from bugs.debian.org.
> >> > >
> >> > > I''m here too (different hat ;-)), thanks for
posting it here. I''ve added
> >> > > some people who know about the block stuff to the CC.
> >> > >
> >> > > Guys, my suspicion is that the issue is that barriers
issued by ext3
> >> > > inside the guest aren''t making it all the way
down the
> >> > >
ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the
> >> > > filesystem to eventually corrupt itself.
> >> > >
> >> > > The issue seems to relate to the use of dm-crypt since
> >> > > ext3->blkfront->blkback->lvm->disk is
reported work fine.
> >> > >
> >> > > However there is no problem with the local dom0 ext3
root filesystem
> >> > > which is also in the same lvm VG on the crypt device
(i.e.
> >> > > ext3->lvm->dm-crypt->disk), so its not purely a
dm-crypt issue. I figure
> >> > > something is up at the blkfront->back link which
causes the barriers
> >> > > which blkback is injecting into the block subsystem
either don''t make it
> >> > > to the dm-crypt layer or do not DTRT once they arrive.
> >> > >
> >> > > I''m not really sure with how to proceed (or how
to ask Anthony to
> >> > > proceed) with verifying any part of that hypothesis
though.
> >> > >
> >> > > ISTR issues with old vs new style barriers or barriers
with no data in
> >> > > them or something, could this be related to that? (or am
I thinking of
> >> > > DISCARD?)
> >
> > You are using two different kernel versions. The 2.6.32 domU is only
using
> > WRITE_BARRIERs, while in the 3.2 kernels that have been completly
eliminated.
> > The mechanism they use is called ''WRITE_FLUSH''. The
3.2 kernel has a patch:
> > ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b
> > Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Date:   Mon Oct 10 00:42:22 2011 -0400
> >
> >     xen/blkback: Support ''feature-barrier'' aka
old-style BARRIER requests.
> >
> >
> > which emulates the barrier request by draining all of the oustanding
I/Os and then
> > sending the WRITE_FLUSH.
> >
> > But it looks like you are hitting an issue here. Just to make sure
> > that is the case, what happens if you use the _same_ kernel in both
dom0 and
> > domU? Does it work then?
> >
> 
> First, thank you so much for getting back to me, it''s really
appreciated.
> At this point I''ve forgotten if I did this with Wheezy on Wheezy,
and
> what the result was.
> I''ll have to test using the 3.2 kernel on the domU Debian Squeeze
and
> get back to you. I should be able to do that early next week.
Thank you. Also when you do this test, could you also provide the
''xenstore-ls''
output from dom0? And the ''dmesg'' output from the guest (or at
least
the ''xl console <guest> | tee /tmp/log'' ? That would
give me and idea if
the frontend/backend have the right negotiation parameters.

Have a good weekend!

Roger Pau Monné

2013-May-24 17:48 UTC

head link

Re: BUG: ext3 corruption in domU

On 17/04/13 15:00, Ian Campbell wrote:> On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote:
>> (re-sending, first message seems to have gotten lost)
>>
>> I was referred here by Ian Campbell ijc@hellion.org.uk from
bugs.debian.org.
> 
> I''m here too (different hat ;-)), thanks for posting it here.
I''ve added
> some people who know about the block stuff to the CC.
> 
> Guys, my suspicion is that the issue is that barriers issued by ext3
> inside the guest aren''t making it all the way down the
> ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading
the
> filesystem to eventually corrupt itself.
> 
> The issue seems to relate to the use of dm-crypt since
> ext3->blkfront->blkback->lvm->disk is reported work fine.
> 
> However there is no problem with the local dom0 ext3 root filesystem
> which is also in the same lvm VG on the crypt device (i.e.
> ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt issue. I
figure
> something is up at the blkfront->back link which causes the barriers
> which blkback is injecting into the block subsystem either don''t
make it
> to the dm-crypt layer or do not DTRT once they arrive.
> 
> I''m not really sure with how to proceed (or how to ask Anthony to
> proceed) with verifying any part of that hypothesis though.
> 
> ISTR issues with old vs new style barriers or barriers with no data in
> them or something, could this be related to that? (or am I thinking of
> DISCARD?)
> 
> The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU
> on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with
> Wheezy on Wheezy now so this isn''t cross version confusion about
barrier
> semantics AFAICT.
Hello,

I''ve been trying to reproduce this issue, but so far I haven''t
been able
to. I guess I''m missing something, so here are the steps I followed:

First, I''ve created a primary partition in my HDD, it''s sda3,
and then
I''ve executed the following in order to encrypt it and setup the lvm:

# cryptsetup luksFormat /dev/sda3
# cryptsetup luksOpen /dev/sda3 crypt
# pvcreate /dev/mapper/crypt
# vgcreate crypt /dev/mapper/crypt
# lvcreate -L 20G crypt -n debian

That gives me a block device /dev/crypt/debian, that I''m attaching to a
Debian DomU as xvdb, I''ve created a partition to fill the whole disk
and
formatted it inside the guest using mkfs.ext3.

Then, inside the guest, I''ve scp''ed a 10G file from a remote
host, and
checked the checksum, everything OK. So far, I''ve tested with a Dom0
kernel 3.2.0-0.bpo.4-amd64 and a DomU kernel 3.2.0-0.bpo.4-amd64 and
2.6.32-5-xen-amd64, both tests where OK.

Regards, Roger.

Anthony Sheetz

2013-May-28 12:10 UTC

head link

Re: BUG: ext3 corruption in domU

Missed a reply-all...

I would guess the difference is I am using LVM with full disk
encryption. Take a look at
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124 for the
details on exactly how I am able to recreate this bug.
In other words, I use the installer and chose the option to use full
disk encryption and LVM.
I''ll be starting with the rest of the testing and data collection
which was requested shortly.

On Fri, May 24, 2013 at 1:48 PM, Roger Pau Monné <roger.pau@citrix.com>
wrote:> On 17/04/13 15:00, Ian Campbell wrote:
>> On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote:
>>> (re-sending, first message seems to have gotten lost)
>>>
>>> I was referred here by Ian Campbell ijc@hellion.org.uk from
bugs.debian.org.
>>
>> I''m here too (different hat ;-)), thanks for posting it here.
I''ve added
>> some people who know about the block stuff to the CC.
>>
>> Guys, my suspicion is that the issue is that barriers issued by ext3
>> inside the guest aren''t making it all the way down the
>> ext3->blkfront->blkback->lvm->dm-crypt->disk chain
leading the
>> filesystem to eventually corrupt itself.
>>
>> The issue seems to relate to the use of dm-crypt since
>> ext3->blkfront->blkback->lvm->disk is reported work fine.
>>
>> However there is no problem with the local dom0 ext3 root filesystem
>> which is also in the same lvm VG on the crypt device (i.e.
>> ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt
issue. I figure
>> something is up at the blkfront->back link which causes the barriers
>> which blkback is injecting into the block subsystem either
don''t make it
>> to the dm-crypt layer or do not DTRT once they arrive.
>>
>> I''m not really sure with how to proceed (or how to ask Anthony
to
>> proceed) with verifying any part of that hypothesis though.
>>
>> ISTR issues with old vs new style barriers or barriers with no data in
>> them or something, could this be related to that? (or am I thinking of
>> DISCARD?)
>>
>> The issue was initially reported with Squeeze (Jeremy 2.6.32 tree) domU
>> on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated with
>> Wheezy on Wheezy now so this isn''t cross version confusion
about barrier
>> semantics AFAICT.
>
> Hello,
>
> I''ve been trying to reproduce this issue, but so far I
haven''t been able
> to. I guess I''m missing something, so here are the steps I
followed:
>
> First, I''ve created a primary partition in my HDD, it''s
sda3, and then
> I''ve executed the following in order to encrypt it and setup the
lvm:
>
> # cryptsetup luksFormat /dev/sda3
> # cryptsetup luksOpen /dev/sda3 crypt
> # pvcreate /dev/mapper/crypt
> # vgcreate crypt /dev/mapper/crypt
> # lvcreate -L 20G crypt -n debian
>
> That gives me a block device /dev/crypt/debian, that I''m attaching
to a
> Debian DomU as xvdb, I''ve created a partition to fill the whole
disk and
> formatted it inside the guest using mkfs.ext3.
>
> Then, inside the guest, I''ve scp''ed a 10G file from a
remote host, and
> checked the checksum, everything OK. So far, I''ve tested with a
Dom0
> kernel 3.2.0-0.bpo.4-amd64 and a DomU kernel 3.2.0-0.bpo.4-amd64 and
> 2.6.32-5-xen-amd64, both tests where OK.
>
> Regards, Roger.

Roger Pau Monné

2013-May-28 12:14 UTC

head link

Re: BUG: ext3 corruption in domU

On 28/05/13 14:10, Anthony Sheetz wrote:> Missed a reply-all...
> 
> I would guess the difference is I am using LVM with full disk
> encryption. Take a look at
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124 for the
> details on exactly how I am able to recreate this bug.
> In other words, I use the installer and chose the option to use full
> disk encryption and LVM.
> I''ll be starting with the rest of the testing and data collection
> which was requested shortly.
I would like to avoid reinstalling my whole OS, and I don''t have a
spare
HDD, so isn''t there anyway I can reproduce the full disk encryption
using a partition?
> 
> On Fri, May 24, 2013 at 1:48 PM, Roger Pau Monné
<roger.pau@citrix.com> wrote:
>> On 17/04/13 15:00, Ian Campbell wrote:
>>> On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz wrote:
>>>> (re-sending, first message seems to have gotten lost)
>>>>
>>>> I was referred here by Ian Campbell ijc@hellion.org.uk from
bugs.debian.org.
>>>
>>> I''m here too (different hat ;-)), thanks for posting it
here. I''ve added
>>> some people who know about the block stuff to the CC.
>>>
>>> Guys, my suspicion is that the issue is that barriers issued by
ext3
>>> inside the guest aren''t making it all the way down the
>>> ext3->blkfront->blkback->lvm->dm-crypt->disk chain
leading the
>>> filesystem to eventually corrupt itself.
>>>
>>> The issue seems to relate to the use of dm-crypt since
>>> ext3->blkfront->blkback->lvm->disk is reported work
fine.
>>>
>>> However there is no problem with the local dom0 ext3 root
filesystem
>>> which is also in the same lvm VG on the crypt device (i.e.
>>> ext3->lvm->dm-crypt->disk), so its not purely a dm-crypt
issue. I figure
>>> something is up at the blkfront->back link which causes the
barriers
>>> which blkback is injecting into the block subsystem either
don''t make it
>>> to the dm-crypt layer or do not DTRT once they arrive.
>>>
>>> I''m not really sure with how to proceed (or how to ask
Anthony to
>>> proceed) with verifying any part of that hypothesis though.
>>>
>>> ISTR issues with old vs new style barriers or barriers with no data
in
>>> them or something, could this be related to that? (or am I thinking
of
>>> DISCARD?)
>>>
>>> The issue was initially reported with Squeeze (Jeremy 2.6.32 tree)
domU
>>> on a Wheezy (mainline 3.2) dom0 but IIRC has also been repeated
with
>>> Wheezy on Wheezy now so this isn''t cross version confusion
about barrier
>>> semantics AFAICT.
>>
>> Hello,
>>
>> I''ve been trying to reproduce this issue, but so far I
haven''t been able
>> to. I guess I''m missing something, so here are the steps I
followed:
>>
>> First, I''ve created a primary partition in my HDD,
it''s sda3, and then
>> I''ve executed the following in order to encrypt it and setup
the lvm:
>>
>> # cryptsetup luksFormat /dev/sda3
>> # cryptsetup luksOpen /dev/sda3 crypt
>> # pvcreate /dev/mapper/crypt
>> # vgcreate crypt /dev/mapper/crypt
>> # lvcreate -L 20G crypt -n debian
>>
>> That gives me a block device /dev/crypt/debian, that I''m
attaching to a
>> Debian DomU as xvdb, I''ve created a partition to fill the
whole disk and
>> formatted it inside the guest using mkfs.ext3.
>>
>> Then, inside the guest, I''ve scp''ed a 10G file from a
remote host, and
>> checked the checksum, everything OK. So far, I''ve tested with
a Dom0
>> kernel 3.2.0-0.bpo.4-amd64 and a DomU kernel 3.2.0-0.bpo.4-amd64 and
>> 2.6.32-5-xen-amd64, both tests where OK.
>>
>> Regards, Roger.

Anthony Sheetz

2013-May-28 14:27 UTC

head link

Re: BUG: ext3 corruption in domU

> Thank you. Also when you do this test, could you also provide the
''xenstore-ls''
> output from dom0? And the ''dmesg'' output from the guest
(or at least
> the ''xl console <guest> | tee /tmp/log'' ? That
would give me and idea if
> the frontend/backend have the right negotiation parameters.
Attached is the output of xenstore-ls from dom0, and dmesg from a domU
with kernel 2.6.32-5-xen-amd64
Will be working on putting a 3.2 kernel in place next, testing file
transfer, and adding the output of dmesg from that.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Anthony Sheetz

2013-May-28 18:02 UTC

head link

Re: BUG: ext3 corruption in domU

On Tue, May 28, 2013 at 10:27 AM, Anthony Sheetz <sheetzam@inspire.com>
wrote:>> Thank you. Also when you do this test, could you also provide the
''xenstore-ls''
>> output from dom0? And the ''dmesg'' output from the
guest (or at least
>> the ''xl console <guest> | tee /tmp/log'' ? That
would give me and idea if
>> the frontend/backend have the right negotiation parameters.
>
> Attached is the output of xenstore-ls from dom0, and dmesg from a domU
> with kernel 2.6.32-5-xen-amd64
> Will be working on putting a 3.2 kernel in place next, testing file
> transfer, and adding the output of dmesg from that.
updated to 3.2 using
http://www.cyberciti.biz/faq/debian-linux-6-apt-get-install-linux-kernel-3-2/
for instructions.
During transfer of data saw this: BUG" scheduling while atomic:
kworker/0:2/10421/0x10000002
Transfer test resulted in a file which did not match md5sum. Attached
is the dmesg output from the domU.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Anthony Sheetz

2013-May-28 18:15 UTC

head link

Re: BUG: ext3 corruption in domU

>> I would guess the difference is I am using LVM with full disk
>> encryption. Take a look at
>> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124 for the
>> details on exactly how I am able to recreate this bug.
>> In other words, I use the installer and chose the option to use full
>> disk encryption and LVM.
>> I''ll be starting with the rest of the testing and data
collection
>> which was requested shortly.
>
> I would like to avoid reinstalling my whole OS, and I don''t have a
spare
> HDD, so isn''t there anyway I can reproduce the full disk
encryption
> using a partition?
As my colleague points out, the set up you have misses that a single
encrypted object is in use by both dom0 and domU. Without having your
dom0 on the same encrypted device as your domU (even though they use
different logical volumes) I''m not sure how to test it.

Konrad Rzeszutek Wilk

2013-May-28 18:18 UTC

head link

Re: BUG: ext3 corruption in domU

On Tue, May 28, 2013 at 02:02:41PM -0400, Anthony Sheetz
wrote:> On Tue, May 28, 2013 at 10:27 AM, Anthony Sheetz
<sheetzam@inspire.com> wrote:
> >> Thank you. Also when you do this test, could you also provide the
''xenstore-ls''
> >> output from dom0? And the ''dmesg'' output from
the guest (or at least
> >> the ''xl console <guest> | tee /tmp/log'' ?
That would give me and idea if
> >> the frontend/backend have the right negotiation parameters.
> >
> > Attached is the output of xenstore-ls from dom0, and dmesg from a domU
> > with kernel 2.6.32-5-xen-amd64
> > Will be working on putting a 3.2 kernel in place next, testing file
> > transfer, and adding the output of dmesg from that.
> 
> updated to 3.2 using
>
http://www.cyberciti.biz/faq/debian-linux-6-apt-get-install-linux-kernel-3-2/
> for instructions.
> During transfer of data saw this: BUG" scheduling while atomic:
> kworker/0:2/10421/0x10000002
? I don''t see it here?> Transfer test resulted in a file which did not match md5sum. Attached
> is the dmesg output from the domU.
Shouldn''t the BUG be present here?
> [    0.000000] Initializing cgroup subsys cpuset
> [    0.000000] Initializing cgroup subsys cpu
> [    0.000000] Linux version 3.2.0-0.bpo.4-amd64
(debian-kernel@lists.debian.org) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP
Debian 3.2.41-2+deb7u2~bpo60+1
> [    0.000000] Command line:  root=/dev/xvda2 ro 
> [    0.000000] ACPI in unprivileged domain disabled
> [    0.000000] Released 0 pages of unused memory
> [    0.000000] Set 0 page(s) to 1-1 mapping
> [    0.000000] BIOS-provided physical RAM map:
> [    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
> [    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
> [    0.000000]  Xen: 0000000000100000 - 0000000060800000 (usable)
> [    0.000000] NX (Execute Disable) protection: active
> [    0.000000] DMI not present or invalid.
> [    0.000000] e820 update range: 0000000000000000 - 0000000000010000
(usable) ==> (reserved)
> [    0.000000] e820 remove range: 00000000000a0000 - 0000000000100000
(usable)
> [    0.000000] No AGP bridge found
> [    0.000000] last_pfn = 0x60800 max_arch_pfn = 0x400000000
> [    0.000000] initial memory mapped : 0 - 03639000
> [    0.000000] Base memory trampoline at [ffff88000009b000] 9b000 size
20480
> [    0.000000] init_memory_mapping: 0000000000000000-0000000060800000
> [    0.000000]  0000000000 - 0060800000 page 4k
> [    0.000000] kernel direct mapping tables up to 60800000 @ cf9000-1000000
> [    0.000000] xen: setting RW the range fdc000 - 1000000
> [    0.000000] RAMDISK: 01949000 - 03639000
> [    0.000000] NUMA turned off
> [    0.000000] Faking a node at 0000000000000000-0000000060800000
> [    0.000000] Initmem setup node 0 0000000000000000-0000000060800000
> [    0.000000]   NODE_DATA [000000005fffb000 - 000000005fffffff]
> [    0.000000] Zone PFN ranges:
> [    0.000000]   DMA      0x00000010 -> 0x00001000
> [    0.000000]   DMA32    0x00001000 -> 0x00100000
> [    0.000000]   Normal   empty
> [    0.000000] Movable zone start PFN for each node
> [    0.000000] early_node_map[2] active PFN ranges
> [    0.000000]     0: 0x00000010 -> 0x000000a0
> [    0.000000]     0: 0x00000100 -> 0x00060800
> [    0.000000] On node 0 totalpages: 395152
> [    0.000000]   DMA zone: 56 pages used for memmap
> [    0.000000]   DMA zone: 744 pages reserved
> [    0.000000]   DMA zone: 3184 pages, LIFO batch:0
> [    0.000000]   DMA32 zone: 5348 pages used for memmap
> [    0.000000]   DMA32 zone: 385820 pages, LIFO batch:31
> [    0.000000] SFI: Simple Firmware Interface v0.81
http://simplefirmware.org
> [    0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs
> [    0.000000] No local APIC present
> [    0.000000] APIC: disable apic facility
> [    0.000000] APIC: switched to apic NOOP
> [    0.000000] nr_irqs_gsi: 16
> [    0.000000] PM: Registered nosave memory: 00000000000a0000 -
0000000000100000
> [    0.000000] Allocating PCI resources starting at 60800000 (gap:
60800000:9f800000)
> [    0.000000] Booting paravirtualized kernel on Xen
> [    0.000000] Xen version: 4.1.4 (preserve-AD)
> [    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:1
nr_node_ids:1
> [    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88005fc00000 s82880 r8192
d23616 u2097152
> [    0.000000] pcpu-alloc: s82880 r8192 d23616 u2097152 alloc=1*2097152
> [    0.000000] pcpu-alloc: [0] 0 
> [    0.000000] Built 1 zonelists in Node order, mobility grouping on. 
Total pages: 389004
> [    0.000000] Policy zone: DMA32
> [    0.000000] Kernel command line:  root=/dev/xvda2 ro 
> [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
> [    0.000000] Checking aperture...
> [    0.000000] No AGP bridge found
> [    0.000000] Calgary: detecting Calgary via BIOS EBDA area
> [    0.000000] Calgary: Unable to locate Rio Grande table in EBDA -
bailing!
> [    0.000000] Memory: 1504508k/1581056k available (3531k kernel code, 448k
absent, 76100k reserved, 3208k data, 616k init)
> [    0.000000] Hierarchical RCU implementation.
> [    0.000000] 	RCU dyntick-idle grace-period acceleration is enabled.
> [    0.000000] NR_IRQS:33024 nr_irqs:256 16
> [    0.000000] Console: colour dummy device 80x25
> [    0.000000] console [tty0] enabled
> [    0.000000] console [hvc0] enabled
> [    0.000000] Xen: using vcpuop timer interface
> [    0.000000] installing Xen timer for CPU 0
> [    0.000000] Detected 2294.848 MHz processor.
> [    0.004000] Calibrating delay loop (skipped), value calculated using
timer frequency.. 4589.69 BogoMIPS (lpj=9179392)
> [    0.004000] pid_max: default: 32768 minimum: 301
> [    0.004000] Security Framework initialized
> [    0.004000] AppArmor: AppArmor disabled by boot time parameter
> [    0.004000] Dentry cache hash table entries: 262144 (order: 9, 2097152
bytes)
> [    0.004000] Inode-cache hash table entries: 131072 (order: 8, 1048576
bytes)
> [    0.004000] Mount-cache hash table entries: 256
> [    0.004000] Initializing cgroup subsys cpuacct
> [    0.004000] Initializing cgroup subsys memory
> [    0.004000] Initializing cgroup subsys devices
> [    0.004000] Initializing cgroup subsys freezer
> [    0.004000] Initializing cgroup subsys net_cls
> [    0.004000] Initializing cgroup subsys blkio
> [    0.004000] Initializing cgroup subsys perf_event
> [    0.004000] ENERGY_PERF_BIAS: Set to ''normal'', was
''performance''
> [    0.004000] ENERGY_PERF_BIAS: View and update with
x86_energy_perf_policy(8)
> [    0.004000] CPU: Physical Processor ID: 0
> [    0.004000] CPU: Processor Core ID: 0
> [    0.004000] SMP alternatives: switching to UP code
> [    0.029088] Freeing SMP alternatives: 16k freed
> [    0.029163] Performance Events: unsupported p6 CPU model 58 no PMU
driver, software events only.
> [    0.029293] NMI watchdog disabled (cpu0): hardware events not enabled
> [    0.029318] Brought up 1 CPUs
> [    0.029448] devtmpfs: initialized
> [    0.032173] Grant table initialized
> [    0.032244] print_constraints: dummy: 
> [    0.032305] NET: Registered protocol family 16
> [    0.032510] PCI: setting up Xen PCI frontend stub
> [    0.032517] PCI: pci_cache_line_size set to 64 bytes
> [    0.033015] bio: create slab <bio-0> at 0
> [    0.033078] ACPI: Interpreter disabled.
> [    0.033098] xen/balloon: Initialising balloon driver.
> [    0.033098] xen-balloon: Initialising balloon driver.
> [    0.033098] vgaarb: loaded
> [    0.033098] PCI: System does not support PCI
> [    0.033098] PCI: System does not support PCI
> [    0.033098] Switching to clocksource xen
> [    0.033194] pnp: PnP ACPI: disabled
> [    0.034979] PCI: max bus depth: 0 pci_try_num: 1
> [    0.035010] NET: Registered protocol family 2
> [    0.035175] IP route cache hash table entries: 65536 (order: 7, 524288
bytes)
> [    0.036322] TCP established hash table entries: 262144 (order: 10,
4194304 bytes)
> [    0.037073] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
> [    0.037188] TCP: Hash tables configured (established 262144 bind 65536)
> [    0.037193] TCP reno registered
> [    0.037207] UDP hash table entries: 1024 (order: 3, 32768 bytes)
> [    0.037225] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes)
> [    0.037284] NET: Registered protocol family 1
> [    0.037292] PCI: CLS 0 bytes, default 64
> [    0.037327] Unpacking initramfs...
> [    0.061808] Freeing initrd memory: 29632k freed
> [    0.067281] platform rtc_cmos: registered platform RTC device (no PNP
device found)
> [    0.067460] audit: initializing netlink socket (disabled)
> [    0.067471] type=2000 audit(1369752979.409:1): initialized
> [    0.080739] HugeTLB registered 2 MB page size, pre-allocated 0 pages
> [    0.080910] VFS: Disk quotas dquot_6.5.2
> [    0.080931] Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> [    0.080980] msgmni has been set to 2996
> [    0.081099] alg: No test for stdrng (krng)
> [    0.081120] Block layer SCSI generic (bsg) driver version 0.4 loaded
(major 253)
> [    0.081126] io scheduler noop registered
> [    0.081129] io scheduler deadline registered
> [    0.081140] io scheduler cfq registered (default)
> [    0.081183] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
> [    0.081202] pciehp: PCI Express Hot Plug Controller Driver version: 0.4
> [    0.081206] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
> [    0.197788] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> [    0.198048] Linux agpgart interface v0.103
> [    0.198133] i8042: PNP: No PS/2 controller found. Probing ports
directly.
> [    1.200733] i8042: No controller found
> [    1.200830] mousedev: PS/2 mouse device common for all mice
> [    1.240666] rtc_cmos rtc_cmos: rtc core: registered rtc_cmos as rtc0
> [    1.240721] rtc_cmos: probe of rtc_cmos failed with error -38
> [    1.240885] TCP cubic registered
> [    1.240933] NET: Registered protocol family 10
> [    1.241267] Mobile IPv6
> [    1.241274] NET: Registered protocol family 17
> [    1.241283] Registering the dns_resolver key type
> [    1.241388] PM: Hibernation image not present or could not be loaded.
> [    1.241395] registered taskstats version 1
> [    1.241410] XENBUS: Device with no driver: device/vbd/51714
> [    1.241416] XENBUS: Device with no driver: device/vbd/51713
> [    1.241420] XENBUS: Device with no driver: device/vif/0
> [    1.241425] XENBUS: Device with no driver: device/console/0
> [    1.241442]
/build/buildd-linux_3.2.41-2+deb7u2~bpo60+1-amd64-mnypfK/linux-3.2.41/drivers/rtc/hctosys.c:
unable to open rtc device (rtc0)
> [    1.241476] Initializing network drop monitor service
> [    1.241791] Freeing unused kernel memory: 616k freed
> [    1.241910] Write protecting the kernel read-only data: 6144k
> [    1.244660] Freeing unused kernel memory: 548k freed
> [    1.245050] Freeing unused kernel memory: 708k freed
> [    1.276238] udev[45]: starting version 164
> [    1.312147] Initialising Xen virtual ethernet driver.
> [    1.327497] blkfront: xvda2: flush diskcache: enabled
> [    1.331984] blkfront: xvda1: flush diskcache: enabled
> [    1.667213] kjournald starting.  Commit interval 5 seconds
> [    1.667240] EXT3-fs (xvda2): mounted filesystem with ordered data mode
> [    2.738037] udev[140]: starting version 164
> [    3.172340] input: PC Speaker as /devices/platform/pcspkr/input/input0
> [    3.296421] alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni)
> [    3.660850] Error: Driver ''pcspkr'' is already
registered, aborting...
> [    3.965481] Adding 262140k swap on /dev/xvda1.  Priority:-1 extents:1
across:262140k SS
> [    4.075322] EXT3-fs (xvda2): using internal journal
> [    5.839480] sshd (534): /proc/534/oom_adj is deprecated, please use
/proc/534/oom_score_adj instead.
> [   15.408035] eth0: no IPv6 routers present

Anthony Sheetz

2013-May-28 18:19 UTC

head link

Re: BUG: ext3 corruption in domU

I''d have thought so as well. It''s possible that was console
output
from dom0, come to think of it.

On Tue, May 28, 2013 at 2:18 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:> On Tue, May 28, 2013 at 02:02:41PM -0400, Anthony Sheetz wrote:
>> On Tue, May 28, 2013 at 10:27 AM, Anthony Sheetz
<sheetzam@inspire.com> wrote:
>> >> Thank you. Also when you do this test, could you also provide
the ''xenstore-ls''
>> >> output from dom0? And the ''dmesg'' output
from the guest (or at least
>> >> the ''xl console <guest> | tee
/tmp/log'' ? That would give me and idea if
>> >> the frontend/backend have the right negotiation parameters.
>> >
>> > Attached is the output of xenstore-ls from dom0, and dmesg from a
domU
>> > with kernel 2.6.32-5-xen-amd64
>> > Will be working on putting a 3.2 kernel in place next, testing
file
>> > transfer, and adding the output of dmesg from that.
>>
>> updated to 3.2 using
>>
http://www.cyberciti.biz/faq/debian-linux-6-apt-get-install-linux-kernel-3-2/
>> for instructions.
>> During transfer of data saw this: BUG" scheduling while atomic:
>> kworker/0:2/10421/0x10000002
>
> ? I don''t see it here?
>> Transfer test resulted in a file which did not match md5sum. Attached
>> is the dmesg output from the domU.
>
> Shouldn''t the BUG be present here?
>
>> [    0.000000] Initializing cgroup subsys cpuset
>> [    0.000000] Initializing cgroup subsys cpu
>> [    0.000000] Linux version 3.2.0-0.bpo.4-amd64
(debian-kernel@lists.debian.org) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP
Debian 3.2.41-2+deb7u2~bpo60+1
>> [    0.000000] Command line:  root=/dev/xvda2 ro
>> [    0.000000] ACPI in unprivileged domain disabled
>> [    0.000000] Released 0 pages of unused memory
>> [    0.000000] Set 0 page(s) to 1-1 mapping
>> [    0.000000] BIOS-provided physical RAM map:
>> [    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
>> [    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
>> [    0.000000]  Xen: 0000000000100000 - 0000000060800000 (usable)
>> [    0.000000] NX (Execute Disable) protection: active
>> [    0.000000] DMI not present or invalid.
>> [    0.000000] e820 update range: 0000000000000000 - 0000000000010000
(usable) ==> (reserved)
>> [    0.000000] e820 remove range: 00000000000a0000 - 0000000000100000
(usable)
>> [    0.000000] No AGP bridge found
>> [    0.000000] last_pfn = 0x60800 max_arch_pfn = 0x400000000
>> [    0.000000] initial memory mapped : 0 - 03639000
>> [    0.000000] Base memory trampoline at [ffff88000009b000] 9b000 size
20480
>> [    0.000000] init_memory_mapping: 0000000000000000-0000000060800000
>> [    0.000000]  0000000000 - 0060800000 page 4k
>> [    0.000000] kernel direct mapping tables up to 60800000 @
cf9000-1000000
>> [    0.000000] xen: setting RW the range fdc000 - 1000000
>> [    0.000000] RAMDISK: 01949000 - 03639000
>> [    0.000000] NUMA turned off
>> [    0.000000] Faking a node at 0000000000000000-0000000060800000
>> [    0.000000] Initmem setup node 0 0000000000000000-0000000060800000
>> [    0.000000]   NODE_DATA [000000005fffb000 - 000000005fffffff]
>> [    0.000000] Zone PFN ranges:
>> [    0.000000]   DMA      0x00000010 -> 0x00001000
>> [    0.000000]   DMA32    0x00001000 -> 0x00100000
>> [    0.000000]   Normal   empty
>> [    0.000000] Movable zone start PFN for each node
>> [    0.000000] early_node_map[2] active PFN ranges
>> [    0.000000]     0: 0x00000010 -> 0x000000a0
>> [    0.000000]     0: 0x00000100 -> 0x00060800
>> [    0.000000] On node 0 totalpages: 395152
>> [    0.000000]   DMA zone: 56 pages used for memmap
>> [    0.000000]   DMA zone: 744 pages reserved
>> [    0.000000]   DMA zone: 3184 pages, LIFO batch:0
>> [    0.000000]   DMA32 zone: 5348 pages used for memmap
>> [    0.000000]   DMA32 zone: 385820 pages, LIFO batch:31
>> [    0.000000] SFI: Simple Firmware Interface v0.81
http://simplefirmware.org
>> [    0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs
>> [    0.000000] No local APIC present
>> [    0.000000] APIC: disable apic facility
>> [    0.000000] APIC: switched to apic NOOP
>> [    0.000000] nr_irqs_gsi: 16
>> [    0.000000] PM: Registered nosave memory: 00000000000a0000 -
0000000000100000
>> [    0.000000] Allocating PCI resources starting at 60800000 (gap:
60800000:9f800000)
>> [    0.000000] Booting paravirtualized kernel on Xen
>> [    0.000000] Xen version: 4.1.4 (preserve-AD)
>> [    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512
nr_cpu_ids:1 nr_node_ids:1
>> [    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88005fc00000 s82880
r8192 d23616 u2097152
>> [    0.000000] pcpu-alloc: s82880 r8192 d23616 u2097152 alloc=1*2097152
>> [    0.000000] pcpu-alloc: [0] 0
>> [    0.000000] Built 1 zonelists in Node order, mobility grouping on. 
Total pages: 389004
>> [    0.000000] Policy zone: DMA32
>> [    0.000000] Kernel command line:  root=/dev/xvda2 ro
>> [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
>> [    0.000000] Checking aperture...
>> [    0.000000] No AGP bridge found
>> [    0.000000] Calgary: detecting Calgary via BIOS EBDA area
>> [    0.000000] Calgary: Unable to locate Rio Grande table in EBDA -
bailing!
>> [    0.000000] Memory: 1504508k/1581056k available (3531k kernel code,
448k absent, 76100k reserved, 3208k data, 616k init)
>> [    0.000000] Hierarchical RCU implementation.
>> [    0.000000]        RCU dyntick-idle grace-period acceleration is
enabled.
>> [    0.000000] NR_IRQS:33024 nr_irqs:256 16
>> [    0.000000] Console: colour dummy device 80x25
>> [    0.000000] console [tty0] enabled
>> [    0.000000] console [hvc0] enabled
>> [    0.000000] Xen: using vcpuop timer interface
>> [    0.000000] installing Xen timer for CPU 0
>> [    0.000000] Detected 2294.848 MHz processor.
>> [    0.004000] Calibrating delay loop (skipped), value calculated using
timer frequency.. 4589.69 BogoMIPS (lpj=9179392)
>> [    0.004000] pid_max: default: 32768 minimum: 301
>> [    0.004000] Security Framework initialized
>> [    0.004000] AppArmor: AppArmor disabled by boot time parameter
>> [    0.004000] Dentry cache hash table entries: 262144 (order: 9,
2097152 bytes)
>> [    0.004000] Inode-cache hash table entries: 131072 (order: 8,
1048576 bytes)
>> [    0.004000] Mount-cache hash table entries: 256
>> [    0.004000] Initializing cgroup subsys cpuacct
>> [    0.004000] Initializing cgroup subsys memory
>> [    0.004000] Initializing cgroup subsys devices
>> [    0.004000] Initializing cgroup subsys freezer
>> [    0.004000] Initializing cgroup subsys net_cls
>> [    0.004000] Initializing cgroup subsys blkio
>> [    0.004000] Initializing cgroup subsys perf_event
>> [    0.004000] ENERGY_PERF_BIAS: Set to ''normal'', was
''performance''
>> [    0.004000] ENERGY_PERF_BIAS: View and update with
x86_energy_perf_policy(8)
>> [    0.004000] CPU: Physical Processor ID: 0
>> [    0.004000] CPU: Processor Core ID: 0
>> [    0.004000] SMP alternatives: switching to UP code
>> [    0.029088] Freeing SMP alternatives: 16k freed
>> [    0.029163] Performance Events: unsupported p6 CPU model 58 no PMU
driver, software events only.
>> [    0.029293] NMI watchdog disabled (cpu0): hardware events not
enabled
>> [    0.029318] Brought up 1 CPUs
>> [    0.029448] devtmpfs: initialized
>> [    0.032173] Grant table initialized
>> [    0.032244] print_constraints: dummy:
>> [    0.032305] NET: Registered protocol family 16
>> [    0.032510] PCI: setting up Xen PCI frontend stub
>> [    0.032517] PCI: pci_cache_line_size set to 64 bytes
>> [    0.033015] bio: create slab <bio-0> at 0
>> [    0.033078] ACPI: Interpreter disabled.
>> [    0.033098] xen/balloon: Initialising balloon driver.
>> [    0.033098] xen-balloon: Initialising balloon driver.
>> [    0.033098] vgaarb: loaded
>> [    0.033098] PCI: System does not support PCI
>> [    0.033098] PCI: System does not support PCI
>> [    0.033098] Switching to clocksource xen
>> [    0.033194] pnp: PnP ACPI: disabled
>> [    0.034979] PCI: max bus depth: 0 pci_try_num: 1
>> [    0.035010] NET: Registered protocol family 2
>> [    0.035175] IP route cache hash table entries: 65536 (order: 7,
524288 bytes)
>> [    0.036322] TCP established hash table entries: 262144 (order: 10,
4194304 bytes)
>> [    0.037073] TCP bind hash table entries: 65536 (order: 8, 1048576
bytes)
>> [    0.037188] TCP: Hash tables configured (established 262144 bind
65536)
>> [    0.037193] TCP reno registered
>> [    0.037207] UDP hash table entries: 1024 (order: 3, 32768 bytes)
>> [    0.037225] UDP-Lite hash table entries: 1024 (order: 3, 32768
bytes)
>> [    0.037284] NET: Registered protocol family 1
>> [    0.037292] PCI: CLS 0 bytes, default 64
>> [    0.037327] Unpacking initramfs...
>> [    0.061808] Freeing initrd memory: 29632k freed
>> [    0.067281] platform rtc_cmos: registered platform RTC device (no
PNP device found)
>> [    0.067460] audit: initializing netlink socket (disabled)
>> [    0.067471] type=2000 audit(1369752979.409:1): initialized
>> [    0.080739] HugeTLB registered 2 MB page size, pre-allocated 0 pages
>> [    0.080910] VFS: Disk quotas dquot_6.5.2
>> [    0.080931] Dquot-cache hash table entries: 512 (order 0, 4096
bytes)
>> [    0.080980] msgmni has been set to 2996
>> [    0.081099] alg: No test for stdrng (krng)
>> [    0.081120] Block layer SCSI generic (bsg) driver version 0.4 loaded
(major 253)
>> [    0.081126] io scheduler noop registered
>> [    0.081129] io scheduler deadline registered
>> [    0.081140] io scheduler cfq registered (default)
>> [    0.081183] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
>> [    0.081202] pciehp: PCI Express Hot Plug Controller Driver version:
0.4
>> [    0.081206] acpiphp: ACPI Hot Plug PCI Controller Driver version:
0.5
>> [    0.197788] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
>> [    0.198048] Linux agpgart interface v0.103
>> [    0.198133] i8042: PNP: No PS/2 controller found. Probing ports
directly.
>> [    1.200733] i8042: No controller found
>> [    1.200830] mousedev: PS/2 mouse device common for all mice
>> [    1.240666] rtc_cmos rtc_cmos: rtc core: registered rtc_cmos as rtc0
>> [    1.240721] rtc_cmos: probe of rtc_cmos failed with error -38
>> [    1.240885] TCP cubic registered
>> [    1.240933] NET: Registered protocol family 10
>> [    1.241267] Mobile IPv6
>> [    1.241274] NET: Registered protocol family 17
>> [    1.241283] Registering the dns_resolver key type
>> [    1.241388] PM: Hibernation image not present or could not be
loaded.
>> [    1.241395] registered taskstats version 1
>> [    1.241410] XENBUS: Device with no driver: device/vbd/51714
>> [    1.241416] XENBUS: Device with no driver: device/vbd/51713
>> [    1.241420] XENBUS: Device with no driver: device/vif/0
>> [    1.241425] XENBUS: Device with no driver: device/console/0
>> [    1.241442]
/build/buildd-linux_3.2.41-2+deb7u2~bpo60+1-amd64-mnypfK/linux-3.2.41/drivers/rtc/hctosys.c:
unable to open rtc device (rtc0)
>> [    1.241476] Initializing network drop monitor service
>> [    1.241791] Freeing unused kernel memory: 616k freed
>> [    1.241910] Write protecting the kernel read-only data: 6144k
>> [    1.244660] Freeing unused kernel memory: 548k freed
>> [    1.245050] Freeing unused kernel memory: 708k freed
>> [    1.276238] udev[45]: starting version 164
>> [    1.312147] Initialising Xen virtual ethernet driver.
>> [    1.327497] blkfront: xvda2: flush diskcache: enabled
>> [    1.331984] blkfront: xvda1: flush diskcache: enabled
>> [    1.667213] kjournald starting.  Commit interval 5 seconds
>> [    1.667240] EXT3-fs (xvda2): mounted filesystem with ordered data
mode
>> [    2.738037] udev[140]: starting version 164
>> [    3.172340] input: PC Speaker as
/devices/platform/pcspkr/input/input0
>> [    3.296421] alg: No test for __gcm-aes-aesni
(__driver-gcm-aes-aesni)
>> [    3.660850] Error: Driver ''pcspkr'' is already
registered, aborting...
>> [    3.965481] Adding 262140k swap on /dev/xvda1.  Priority:-1
extents:1 across:262140k SS
>> [    4.075322] EXT3-fs (xvda2): using internal journal
>> [    5.839480] sshd (534): /proc/534/oom_adj is deprecated, please use
/proc/534/oom_score_adj instead.
>> [   15.408035] eth0: no IPv6 routers present
>

Ian Campbell

2013-May-29 08:39 UTC

head link

Re: BUG: ext3 corruption in domU

On Tue, 2013-05-28 at 14:15 -0400, Anthony Sheetz wrote:> >> I would guess the difference is I am using LVM with full disk
> >> encryption. Take a look at
> >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705124 for the
> >> details on exactly how I am able to recreate this bug.
> >> In other words, I use the installer and chose the option to use
full
> >> disk encryption and LVM.
> >> I''ll be starting with the rest of the testing and data
collection
> >> which was requested shortly.
> >
> > I would like to avoid reinstalling my whole OS, and I don''t
have a spare
> > HDD, so isn''t there anyway I can reproduce the full disk
encryption
> > using a partition?
> 
> As my colleague points out, the set up you have misses that a single
> encrypted object is in use by both dom0 and domU. Without having your
> dom0 on the same encrypted device as your domU (even though they use
> different logical volumes) I''m not sure how to test it.
Perhaps you could install a second dom0 rootfs on the LVM partition and
use that for testing. This would at least avoid blowing away the
original "primary" dom0 rootfs, which I suppose is what Roger would
like
to avoid.

Ian.

Anthony Sheetz

2013-May-29 11:53 UTC

head link

Re: BUG: ext3 corruption in domU

Is there anything else I can get you at this time to help troubleshoot this?

On Fri, May 24, 2013 at 10:20 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:> On Thu, May 23, 2013 at 02:19:50PM -0400, Anthony Sheetz wrote:
>> On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>> > On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell wrote:
>> >> Konrad is on vacation this week, so it''ll probably be
next week before
>> >> this gets looked at by him.
>> >
>> > And I finally got to this email in my
''vacation-mbox''
>> >>
>> >> Ian.
>> >>
>> >> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote:
>> >> > I realize folks are pretty busy, but we''re still
interested in getting
>> >> > this problem solved, and I want to be sure it''s
not lost in the
>> >> > shuffle.
>> >> > Any chance of getting some attention for it?
>> >> >
>> >> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
>> >> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony Sheetz
wrote:
>> >> > >> (re-sending, first message seems to have gotten
lost)
>> >> > >>
>> >> > >> I was referred here by Ian Campbell
ijc@hellion.org.uk from bugs.debian.org.
>> >> > >
>> >> > > I''m here too (different hat ;-)), thanks
for posting it here. I''ve added
>> >> > > some people who know about the block stuff to the
CC.
>> >> > >
>> >> > > Guys, my suspicion is that the issue is that
barriers issued by ext3
>> >> > > inside the guest aren''t making it all the
way down the
>> >> > >
ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the
>> >> > > filesystem to eventually corrupt itself.
>> >> > >
>> >> > > The issue seems to relate to the use of dm-crypt
since
>> >> > > ext3->blkfront->blkback->lvm->disk is
reported work fine.
>> >> > >
>> >> > > However there is no problem with the local dom0 ext3
root filesystem
>> >> > > which is also in the same lvm VG on the crypt device
(i.e.
>> >> > > ext3->lvm->dm-crypt->disk), so its not
purely a dm-crypt issue. I figure
>> >> > > something is up at the blkfront->back link which
causes the barriers
>> >> > > which blkback is injecting into the block subsystem
either don''t make it
>> >> > > to the dm-crypt layer or do not DTRT once they
arrive.
>> >> > >
>> >> > > I''m not really sure with how to proceed (or
how to ask Anthony to
>> >> > > proceed) with verifying any part of that hypothesis
though.
>> >> > >
>> >> > > ISTR issues with old vs new style barriers or
barriers with no data in
>> >> > > them or something, could this be related to that?
(or am I thinking of
>> >> > > DISCARD?)
>> >
>> > You are using two different kernel versions. The 2.6.32 domU is
only using
>> > WRITE_BARRIERs, while in the 3.2 kernels that have been completly
eliminated.
>> > The mechanism they use is called ''WRITE_FLUSH''.
The 3.2 kernel has a patch:
>> > ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b
>> > Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> > Date:   Mon Oct 10 00:42:22 2011 -0400
>> >
>> >     xen/blkback: Support ''feature-barrier'' aka
old-style BARRIER requests.
>> >
>> >
>> > which emulates the barrier request by draining all of the
oustanding I/Os and then
>> > sending the WRITE_FLUSH.
>> >
>> > But it looks like you are hitting an issue here. Just to make sure
>> > that is the case, what happens if you use the _same_ kernel in
both dom0 and
>> > domU? Does it work then?
>> >
>>
>> First, thank you so much for getting back to me, it''s really
appreciated.
>> At this point I''ve forgotten if I did this with Wheezy on
Wheezy, and
>> what the result was.
>> I''ll have to test using the 3.2 kernel on the domU Debian
Squeeze and
>> get back to you. I should be able to do that early next week.
>
> Thank you. Also when you do this test, could you also provide the
''xenstore-ls''
> output from dom0? And the ''dmesg'' output from the guest
(or at least
> the ''xl console <guest> | tee /tmp/log'' ? That
would give me and idea if
> the frontend/backend have the right negotiation parameters.
>
> Have a good weekend!

Konrad Rzeszutek Wilk

2013-May-29 15:15 UTC

head link

Re: BUG: ext3 corruption in domU

On Tue, May 28, 2013 at 02:19:17PM -0400, Anthony Sheetz
wrote:> I''d have thought so as well. It''s possible that was
console output
> from dom0, come to think of it.

OK, any chance you could capture that? Some questions below:
> 
> On Tue, May 28, 2013 at 2:18 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Tue, May 28, 2013 at 02:02:41PM -0400, Anthony Sheetz wrote:
> >> On Tue, May 28, 2013 at 10:27 AM, Anthony Sheetz
<sheetzam@inspire.com> wrote:
> >> >> Thank you. Also when you do this test, could you also
provide the ''xenstore-ls''
> >> >> output from dom0? And the ''dmesg''
output from the guest (or at least
> >> >> the ''xl console <guest> | tee
/tmp/log'' ? That would give me and idea if
> >> >> the frontend/backend have the right negotiation
parameters.
> >> >
> >> > Attached is the output of xenstore-ls from dom0, and dmesg
from a domU
> >> > with kernel 2.6.32-5-xen-amd64
> >> > Will be working on putting a 3.2 kernel in place next,
testing file
> >> > transfer, and adding the output of dmesg from that.
> >>
> >> updated to 3.2 using
> >>
http://www.cyberciti.biz/faq/debian-linux-6-apt-get-install-linux-kernel-3-2/
> >> for instructions.
> >> During transfer of data saw this: BUG" scheduling while
atomic:
> >> kworker/0:2/10421/0x10000002
> >
> > ? I don''t see it here?
> >> Transfer test resulted in a file which did not match md5sum.
Attached
> >> is the dmesg output from the domU.
So the transfer you are speaking of is.. What exactly is it that?
Are you using ''scp'' to an disk in the guest? Can you describe
to me how
your disk in the guest is setup? When you do the ''md5sum'' do
you
do it after you have dropped the cache?

Is the storage on an USB stick/disk?

Konrad Rzeszutek Wilk

2013-May-30 18:36 UTC

head link

Re: BUG: ext3 corruption in domU

On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz
wrote:> Is there anything else I can get you at this time to help troubleshoot
this?
Well, this reminds me of a ext3 bug in the 2.6.32 stable tree that 
the maintainer of ext3 would not want to backport the fix. It was an
bug that caused corruption.

If I could just remember the email thread about it. > 
> On Fri, May 24, 2013 at 10:20 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Thu, May 23, 2013 at 02:19:50PM -0400, Anthony Sheetz wrote:
> >> On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk
> >> <konrad.wilk@oracle.com> wrote:
> >> > On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell wrote:
> >> >> Konrad is on vacation this week, so it''ll
probably be next week before
> >> >> this gets looked at by him.
> >> >
> >> > And I finally got to this email in my
''vacation-mbox''
> >> >>
> >> >> Ian.
> >> >>
> >> >> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz wrote:
> >> >> > I realize folks are pretty busy, but we''re
still interested in getting
> >> >> > this problem solved, and I want to be sure
it''s not lost in the
> >> >> > shuffle.
> >> >> > Any chance of getting some attention for it?
> >> >> >
> >> >> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
> >> >> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony
Sheetz wrote:
> >> >> > >> (re-sending, first message seems to have
gotten lost)
> >> >> > >>
> >> >> > >> I was referred here by Ian Campbell
ijc@hellion.org.uk from bugs.debian.org.
> >> >> > >
> >> >> > > I''m here too (different hat ;-)),
thanks for posting it here. I''ve added
> >> >> > > some people who know about the block stuff to
the CC.
> >> >> > >
> >> >> > > Guys, my suspicion is that the issue is that
barriers issued by ext3
> >> >> > > inside the guest aren''t making it all
the way down the
> >> >> > >
ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the
> >> >> > > filesystem to eventually corrupt itself.
> >> >> > >
> >> >> > > The issue seems to relate to the use of
dm-crypt since
> >> >> > > ext3->blkfront->blkback->lvm->disk
is reported work fine.
> >> >> > >
> >> >> > > However there is no problem with the local dom0
ext3 root filesystem
> >> >> > > which is also in the same lvm VG on the crypt
device (i.e.
> >> >> > > ext3->lvm->dm-crypt->disk), so its not
purely a dm-crypt issue. I figure
> >> >> > > something is up at the blkfront->back link
which causes the barriers
> >> >> > > which blkback is injecting into the block
subsystem either don''t make it
> >> >> > > to the dm-crypt layer or do not DTRT once they
arrive.
> >> >> > >
> >> >> > > I''m not really sure with how to
proceed (or how to ask Anthony to
> >> >> > > proceed) with verifying any part of that
hypothesis though.
> >> >> > >
> >> >> > > ISTR issues with old vs new style barriers or
barriers with no data in
> >> >> > > them or something, could this be related to
that? (or am I thinking of
> >> >> > > DISCARD?)
> >> >
> >> > You are using two different kernel versions. The 2.6.32 domU
is only using
> >> > WRITE_BARRIERs, while in the 3.2 kernels that have been
completly eliminated.
> >> > The mechanism they use is called
''WRITE_FLUSH''. The 3.2 kernel has a patch:
> >> > ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b
> >> > Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> > Date:   Mon Oct 10 00:42:22 2011 -0400
> >> >
> >> >     xen/blkback: Support ''feature-barrier''
aka old-style BARRIER requests.
> >> >
> >> >
> >> > which emulates the barrier request by draining all of the
oustanding I/Os and then
> >> > sending the WRITE_FLUSH.
> >> >
> >> > But it looks like you are hitting an issue here. Just to make
sure
> >> > that is the case, what happens if you use the _same_ kernel
in both dom0 and
> >> > domU? Does it work then?
> >> >
> >>
> >> First, thank you so much for getting back to me, it''s
really appreciated.
> >> At this point I''ve forgotten if I did this with Wheezy on
Wheezy, and
> >> what the result was.
> >> I''ll have to test using the 3.2 kernel on the domU Debian
Squeeze and
> >> get back to you. I should be able to do that early next week.
> >
> > Thank you. Also when you do this test, could you also provide the
''xenstore-ls''
> > output from dom0? And the ''dmesg'' output from the
guest (or at least
> > the ''xl console <guest> | tee /tmp/log'' ? That
would give me and idea if
> > the frontend/backend have the right negotiation parameters.
> >
> > Have a good weekend!
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

Anthony Sheetz

2013-Jun-04 12:55 UTC

head link

Re: BUG: ext3 corruption in domU

On Thu, May 30, 2013 at 2:36 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:> On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz wrote:
>> Is there anything else I can get you at this time to help troubleshoot
this?
>
> Well, this reminds me of a ext3 bug in the 2.6.32 stable tree that
> the maintainer of ext3 would not want to backport the fix. It was an
> bug that caused corruption.
>
> If I could just remember the email thread about it.
>>
>> On Fri, May 24, 2013 at 10:20 AM, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>> > On Thu, May 23, 2013 at 02:19:50PM -0400, Anthony Sheetz wrote:
>> >> On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk
>> >> <konrad.wilk@oracle.com> wrote:
>> >> > On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian Campbell
wrote:
>> >> >> Konrad is on vacation this week, so it''ll
probably be next week before
>> >> >> this gets looked at by him.
>> >> >
>> >> > And I finally got to this email in my
''vacation-mbox''
>> >> >>
>> >> >> Ian.
>> >> >>
>> >> >> On Mon, 2013-04-22 at 13:22 +0100, Anthony Sheetz
wrote:
>> >> >> > I realize folks are pretty busy, but
we''re still interested in getting
>> >> >> > this problem solved, and I want to be sure
it''s not lost in the
>> >> >> > shuffle.
>> >> >> > Any chance of getting some attention for it?
>> >> >> >
>> >> >> > On Wed, Apr 17, 2013 at 9:00 AM, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
>> >> >> > > On Tue, 2013-04-16 at 18:39 +0100, Anthony
Sheetz wrote:
>> >> >> > >> (re-sending, first message seems to
have gotten lost)
>> >> >> > >>
>> >> >> > >> I was referred here by Ian Campbell
ijc@hellion.org.uk from bugs.debian.org.
>> >> >> > >
>> >> >> > > I''m here too (different hat ;-)),
thanks for posting it here. I''ve added
>> >> >> > > some people who know about the block stuff
to the CC.
>> >> >> > >
>> >> >> > > Guys, my suspicion is that the issue is
that barriers issued by ext3
>> >> >> > > inside the guest aren''t making it
all the way down the
>> >> >> > >
ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the
>> >> >> > > filesystem to eventually corrupt itself.
>> >> >> > >
>> >> >> > > The issue seems to relate to the use of
dm-crypt since
>> >> >> > >
ext3->blkfront->blkback->lvm->disk is reported work fine.
>> >> >> > >
>> >> >> > > However there is no problem with the local
dom0 ext3 root filesystem
>> >> >> > > which is also in the same lvm VG on the
crypt device (i.e.
>> >> >> > > ext3->lvm->dm-crypt->disk), so its
not purely a dm-crypt issue. I figure
>> >> >> > > something is up at the blkfront->back
link which causes the barriers
>> >> >> > > which blkback is injecting into the block
subsystem either don''t make it
>> >> >> > > to the dm-crypt layer or do not DTRT once
they arrive.
>> >> >> > >
>> >> >> > > I''m not really sure with how to
proceed (or how to ask Anthony to
>> >> >> > > proceed) with verifying any part of that
hypothesis though.
>> >> >> > >
>> >> >> > > ISTR issues with old vs new style barriers
or barriers with no data in
>> >> >> > > them or something, could this be related to
that? (or am I thinking of
>> >> >> > > DISCARD?)
>> >> >
>> >> > You are using two different kernel versions. The 2.6.32
domU is only using
>> >> > WRITE_BARRIERs, while in the 3.2 kernels that have been
completly eliminated.
>> >> > The mechanism they use is called
''WRITE_FLUSH''. The 3.2 kernel has a patch:
>> >> > ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b
>> >> > Author: Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com>
>> >> > Date:   Mon Oct 10 00:42:22 2011 -0400
>> >> >
>> >> >     xen/blkback: Support
''feature-barrier'' aka old-style BARRIER requests.
>> >> >
>> >> >
>> >> > which emulates the barrier request by draining all of the
oustanding I/Os and then
>> >> > sending the WRITE_FLUSH.
>> >> >
>> >> > But it looks like you are hitting an issue here. Just to
make sure
>> >> > that is the case, what happens if you use the _same_
kernel in both dom0 and
>> >> > domU? Does it work then?
>> >> >
>> >>
>> >> First, thank you so much for getting back to me, it''s
really appreciated.
>> >> At this point I''ve forgotten if I did this with
Wheezy on Wheezy, and
>> >> what the result was.
>> >> I''ll have to test using the 3.2 kernel on the domU
Debian Squeeze and
>> >> get back to you. I should be able to do that early next week.
>> >
>> > Thank you. Also when you do this test, could you also provide the
''xenstore-ls''
>> > output from dom0? And the ''dmesg'' output from
the guest (or at least
>> > the ''xl console <guest> | tee /tmp/log'' ?
That would give me and idea if
>> > the frontend/backend have the right negotiation parameters.
>> >
>> > Have a good weekend!
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>>
Is there anything I can do at this point to help with this bug?

Konrad Rzeszutek Wilk

2013-Jun-04 13:41 UTC

head link

Re: BUG: ext3 corruption in domU

On Tue, Jun 04, 2013 at 08:55:26AM -0400, Anthony Sheetz
wrote:> On Thu, May 30, 2013 at 2:36 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz wrote:
> >> Is there anything else I can get you at this time to help
troubleshoot this?
> >
> > Well, this reminds me of a ext3 bug in the 2.6.32 stable tree that
> > the maintainer of ext3 would not want to backport the fix. It was an
> > bug that caused corruption.
> >
> > If I could just remember the email thread about it.
Can''t recall it, but maybe Teck can?
> >>
> >> On Fri, May 24, 2013 at 10:20 AM, Konrad Rzeszutek Wilk
> >> <konrad.wilk@oracle.com> wrote:
> >> > On Thu, May 23, 2013 at 02:19:50PM -0400, Anthony Sheetz
wrote:
> >> >> On Wed, May 22, 2013 at 4:10 PM, Konrad Rzeszutek Wilk
> >> >> <konrad.wilk@oracle.com> wrote:
> >> >> > On Mon, Apr 22, 2013 at 01:26:34PM +0100, Ian
Campbell wrote:
> >> >> >> Konrad is on vacation this week, so
it''ll probably be next week before
> >> >> >> this gets looked at by him.
> >> >> >
> >> >> > And I finally got to this email in my
''vacation-mbox''
> >> >> >>
> >> >> >> Ian.
> >> >> >>
> >> >> >> On Mon, 2013-04-22 at 13:22 +0100, Anthony
Sheetz wrote:
> >> >> >> > I realize folks are pretty busy, but
we''re still interested in getting
> >> >> >> > this problem solved, and I want to be sure
it''s not lost in the
> >> >> >> > shuffle.
> >> >> >> > Any chance of getting some attention for
it?
> >> >> >> >
> >> >> >> > On Wed, Apr 17, 2013 at 9:00 AM, Ian
Campbell <Ian.Campbell@citrix.com> wrote:
> >> >> >> > > On Tue, 2013-04-16 at 18:39 +0100,
Anthony Sheetz wrote:
> >> >> >> > >> (re-sending, first message seems
to have gotten lost)
> >> >> >> > >>
> >> >> >> > >> I was referred here by Ian
Campbell ijc@hellion.org.uk from bugs.debian.org.
> >> >> >> > >
> >> >> >> > > I''m here too (different hat
;-)), thanks for posting it here. I''ve added
> >> >> >> > > some people who know about the block
stuff to the CC.
> >> >> >> > >
> >> >> >> > > Guys, my suspicion is that the issue
is that barriers issued by ext3
> >> >> >> > > inside the guest aren''t
making it all the way down the
> >> >> >> > >
ext3->blkfront->blkback->lvm->dm-crypt->disk chain leading the
> >> >> >> > > filesystem to eventually corrupt
itself.
> >> >> >> > >
> >> >> >> > > The issue seems to relate to the use
of dm-crypt since
> >> >> >> > >
ext3->blkfront->blkback->lvm->disk is reported work fine.
> >> >> >> > >
> >> >> >> > > However there is no problem with the
local dom0 ext3 root filesystem
> >> >> >> > > which is also in the same lvm VG on
the crypt device (i.e.
> >> >> >> > > ext3->lvm->dm-crypt->disk),
so its not purely a dm-crypt issue. I figure
> >> >> >> > > something is up at the
blkfront->back link which causes the barriers
> >> >> >> > > which blkback is injecting into the
block subsystem either don''t make it
> >> >> >> > > to the dm-crypt layer or do not DTRT
once they arrive.
> >> >> >> > >
> >> >> >> > > I''m not really sure with how
to proceed (or how to ask Anthony to
> >> >> >> > > proceed) with verifying any part of
that hypothesis though.
> >> >> >> > >
> >> >> >> > > ISTR issues with old vs new style
barriers or barriers with no data in
> >> >> >> > > them or something, could this be
related to that? (or am I thinking of
> >> >> >> > > DISCARD?)
> >> >> >
> >> >> > You are using two different kernel versions. The
2.6.32 domU is only using
> >> >> > WRITE_BARRIERs, while in the 3.2 kernels that have
been completly eliminated.
> >> >> > The mechanism they use is called
''WRITE_FLUSH''. The 3.2 kernel has a patch:
> >> >> > ommit 29bde093787f3bdf7b9b4270ada6be7c8076e36b
> >> >> > Author: Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com>
> >> >> > Date:   Mon Oct 10 00:42:22 2011 -0400
> >> >> >
> >> >> >     xen/blkback: Support
''feature-barrier'' aka old-style BARRIER requests.
> >> >> >
> >> >> >
> >> >> > which emulates the barrier request by draining all
of the oustanding I/Os and then
> >> >> > sending the WRITE_FLUSH.
> >> >> >
> >> >> > But it looks like you are hitting an issue here.
Just to make sure
> >> >> > that is the case, what happens if you use the _same_
kernel in both dom0 and
> >> >> > domU? Does it work then?
> >> >> >
> >> >>
> >> >> First, thank you so much for getting back to me,
it''s really appreciated.
> >> >> At this point I''ve forgotten if I did this with
Wheezy on Wheezy, and
> >> >> what the result was.
> >> >> I''ll have to test using the 3.2 kernel on the
domU Debian Squeeze and
> >> >> get back to you. I should be able to do that early next
week.
> >> >
> >> > Thank you. Also when you do this test, could you also provide
the ''xenstore-ls''
> >> > output from dom0? And the ''dmesg'' output
from the guest (or at least
> >> > the ''xl console <guest> | tee
/tmp/log'' ? That would give me and idea if
> >> > the frontend/backend have the right negotiation parameters.
> >> >
> >> > Have a good weekend!
> >>
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@lists.xen.org
> >> http://lists.xen.org/xen-devel
> >>
> 
> Is there anything I can do at this point to help with this bug?

Konrad Rzeszutek Wilk

2013-Jun-07 17:10 UTC

head link

Re: BUG: ext3 corruption in domU

On Tue, Jun 04, 2013 at 09:41:10AM -0400, Konrad Rzeszutek Wilk
wrote:> On Tue, Jun 04, 2013 at 08:55:26AM -0400, Anthony Sheetz wrote:
> > On Thu, May 30, 2013 at 2:36 PM, Konrad Rzeszutek Wilk
> > <konrad.wilk@oracle.com> wrote:
> > > On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz wrote:
> > >> Is there anything else I can get you at this time to help
troubleshoot this?
> > >
> > > Well, this reminds me of a ext3 bug in the 2.6.32 stable tree
that
> > > the maintainer of ext3 would not want to backport the fix. It was
an
> > > bug that caused corruption.
> > >
> > > If I could just remember the email thread about it.
> 
> Can''t recall it, but maybe Teck can?

He doesn''t seem to respond.

Anthony, I have this on my queue to look - so will get to it.
Sadly that is not going to happen this week :-(

Anthony Sheetz

2013-Jun-07 18:43 UTC

head link

Re: BUG: ext3 corruption in domU

Not a problem. Just wanted to be sure we weren''t a dependency. Thanks
for your attention!

On Fri, Jun 7, 2013 at 1:10 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:> On Tue, Jun 04, 2013 at 09:41:10AM -0400, Konrad Rzeszutek Wilk wrote:
>> On Tue, Jun 04, 2013 at 08:55:26AM -0400, Anthony Sheetz wrote:
>> > On Thu, May 30, 2013 at 2:36 PM, Konrad Rzeszutek Wilk
>> > <konrad.wilk@oracle.com> wrote:
>> > > On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz
wrote:
>> > >> Is there anything else I can get you at this time to help
troubleshoot this?
>> > >
>> > > Well, this reminds me of a ext3 bug in the 2.6.32 stable tree
that
>> > > the maintainer of ext3 would not want to backport the fix. It
was an
>> > > bug that caused corruption.
>> > >
>> > > If I could just remember the email thread about it.
>>
>> Can''t recall it, but maybe Teck can?
>
>
> He doesn''t seem to respond.
>
> Anthony, I have this on my queue to look - so will get to it.
> Sadly that is not going to happen this week :-(

Konrad Rzeszutek Wilk

2013-Jul-02 18:10 UTC

head link

Re: BUG: ext3 corruption in domU

On Fri, Jun 07, 2013 at 02:43:06PM -0400, Anthony Sheetz
wrote:> Not a problem. Just wanted to be sure we weren''t a dependency.
Thanks
> for your attention!
> 
> On Fri, Jun 7, 2013 at 1:10 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Tue, Jun 04, 2013 at 09:41:10AM -0400, Konrad Rzeszutek Wilk wrote:
> >> On Tue, Jun 04, 2013 at 08:55:26AM -0400, Anthony Sheetz wrote:
> >> > On Thu, May 30, 2013 at 2:36 PM, Konrad Rzeszutek Wilk
> >> > <konrad.wilk@oracle.com> wrote:
> >> > > On Wed, May 29, 2013 at 07:53:39AM -0400, Anthony Sheetz
wrote:
> >> > >> Is there anything else I can get you at this time to
help troubleshoot this?
> >> > >
> >> > > Well, this reminds me of a ext3 bug in the 2.6.32 stable
tree that
> >> > > the maintainer of ext3 would not want to backport the
fix. It was an
> >> > > bug that caused corruption.
> >> > >
> >> > > If I could just remember the email thread about it.
> >>
> >> Can''t recall it, but maybe Teck can?
> >
> >
> > He doesn''t seem to respond.
> >
> > Anthony, I have this on my queue to look - so will get to it.
> > Sadly that is not going to happen this week :-(
Installing a new box with Wheezy to try this out. The one thing I could
not find in the thread and in the bug was the guest config. Could you
please reply back with it? Thanks.> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

Xen devel - Apr 2013 - BUG: ext3 corruption in domU

BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU

Re: BUG: ext3 corruption in domU